fix(sidecar-v2): second-round review fixes

- cooldown_manager: move function-level imports to module top - proxy.py: emergency_count counter now actually increments - server.py: metrics reads emergency_count from proxy module - dashboard.html: real JS CDN fallback (not just comment) - requirements.txt: remove unused prometheus_client Round 2 review residual fixes from 沈路明/陆怀瑾/梁思筑 feedback Co-authored-by: multica-agent <github@multica.ai>
fix(sidecar-v2): incorporate review feedback - P0/P1 fixes
2026-06-25 17:53:48 +08:00 · 2026-06-25 17:12:33 +08:00 · 2026-06-25 16:39:01 +08:00
38 changed files with 3397 additions and 3937 deletions
@@ -1,644 +0,0 @@
 # BIZ-46 Phase3: NVIDIA Sidecar Follow-up 架构设计
 > **架构师**: 梁思筑 (architect)  
 > **日期**: 2026-06-24  
 > **状态**: 已批准，推进实施  
 > **来源**: BIZ-42 Phase2 二轮评审 follow-up  
 ---
 ## 1. 架构解耦 / 依赖注入 — SidecarContext
 ### 1.1 现状分析
 当前 `server.py` 使用 **模块级全局变量** 管理所有核心组件：
 ```python
 # server.py 全局状态（当前）
 _config: SidecarConfig
 _http_client: httpx.AsyncClient
 _priority_queue: PriorityRequestQueue
 _token_bucket: AdaptiveTokenBucket
 _prometheus: PrometheusMetrics
 _health_service: HealthService
 _pending_requests: dict[str, tuple[asyncio.Future, float]]
 _stats: dict[str, int]
 _stats_lock: asyncio.Lock
 ```
 **问题**：
 - `webui.py` 通过 `from nvidia_sidecar import server` 反向导入全局变量（循环依赖风险）
 - 单元测试需要 mock 模块级变量，无法并行运行测试
 - 未来多实例/多租户扩展需重写全部模块访问逻辑
 ### 1.2 设计方案 — SidecarContext + FastAPI Dependency Injection
 #### 1.2.1 核心数据结构
 ```python
 # context.py
 from dataclasses import dataclass, field
 import asyncio
 import httpx
 from typing import Any
@dataclass
 class SidecarContext:
    """Sidecar 全局运行时上下文 — 所有核心组件的唯一容器。
    通过 app.state.sidecar 注入 FastAPI，路由通过 Depends 获取。
    """
    config: 'SidecarConfig'
    http_client: httpx.AsyncClient
    token_bucket: 'AdaptiveTokenBucket'
    priority_queue: 'PriorityRequestQueue'
    prometheus: 'PrometheusMetrics'
    health: 'HealthService'
    pending_requests: dict[str, tuple['asyncio.Future', float]] = field(default_factory=dict)
    stats: dict[str, int] = field(default_factory=lambda: {
        "total_requests": 0,
        "nvidia_requests": 0,
        "passthrough_requests": 0,
        "ratelimited_requests": 0,
        "queue_full_rejects": 0,
        "upstream_errors": 0,
        "start_time": 0,
    })
    stats_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
    async def increment_stat(self, key: str, delta: int = 1) -> None:
        """线程安全的统计计数器自增。"""
        async with self.stats_lock:
            self.stats[key] = self.stats.get(key, 0) + delta
 ```
 #### 1.2.2 注入方式
 ```python
 # server.py — lifespan 中创建 context
 from nvidia_sidecar.context import SidecarContext
@asynccontextmanager
 async def lifespan(app: FastAPI):
    ctx = SidecarContext(
        config=load_config(),
        http_client=httpx.AsyncClient(...),
        token_bucket=AdaptiveTokenBucket(...),
        priority_queue=PriorityRequestQueue(...),
        prometheus=PrometheusMetrics(),
        health=HealthService(),
    )
    app.state.sidecar = ctx  # 注入 FastAPI
    # ... worker 启动 ...
    yield
    # ... 清理 ...
 # 依赖注入函数
 def get_context(request: Request) -> SidecarContext:
    return request.app.state.sidecar
 # 路由使用
@app.post("/v1/chat/completions")
 async def chat_completions(request: Request, ctx: SidecarContext = Depends(get_context)):
    return await _handle_proxy_request(request, "/v1/chat/completions", ctx)
 ```
 #### 1.2.3 webui.py 解耦
 ```python
 # webui.py — 不再反向导入 server
 from nvidia_sidecar.context import SidecarContext
 from fastapi import Depends
 def get_webui_router():
    router = APIRouter(prefix="/api", tags=["webui"])
    def _get_ctx(request: Request) -> SidecarContext:
        return request.app.state.sidecar
    @router.get("/dashboard/stream")
    async def dashboard_stream(request: Request, ctx: SidecarContext = Depends(_get_ctx)):
        return await _dashboard_stream(request, ctx)
    @router.get("/admin/config")
    async def admin_get_config(ctx: SidecarContext = Depends(_get_ctx)):
        return await get_config(ctx)
    return router
 ```
 #### 1.2.4 Trade-off 分析
 | 维度 | 当前（全局变量） | 方案A（SidecarContext） | 方案B（FastAPI Dependency 全函数式） |
 |------|------------------|------------------------|-------------------------------------|
 | 可测试性 | 差（需 mock 模块） | 好（注入 mock context） | 优（每个依赖独立注入） |
 | 改动量 | 无 | 中等（~8 文件） | 大（每个函数签名变更） |
 | 可读性 | 一般 | 好（ctx 一目了然） | 差（参数列表膨胀） |
 | 多实例支持 | 不支持 | 支持（多 app 多 ctx） | 支持 |
 | 循环依赖 | 有（webui→server） | 消除 | 消除 |
 **决策**: 采用方案A（SidecarContext），平衡改动量与收益。
 ### 1.3 迁移计划
 分 3 步渐进迁移，每步可独立合入：
 1. **Step 1**: 创建 `context.py`，定义 `SidecarContext`，在 `lifespan` 中实例化并挂到 `app.state`
 2. **Step 2**: 路由函数改为 `Depends(get_context)`，删除模块级 `_config`、`_http_client` 等
 3. **Step 3**: `webui.py` 移除 `from nvidia_sidecar import server`，改用依赖注入
 ---
 ## 2. Prometheus 标签基数治理
 ### 2.1 现状
 当前使用 `model_id` 作为 label 的指标：
 | 指标 | Label | 风险 |
 |------|-------|------|
 | `sidecar_upstream_latency_seconds` | `model_id` | **高** — NVIDIA 模型名含版本号，可能无界增长 |
 | `sidecar_upstream_errors_total` | `status_code`, `model_id` | **中** — 组合基数 = 模型数 × 状态码数 |
 ### 2.2 基数评估
 NVIDIA API 当前已知模型约 20-30 个，但：
 - 新模型持续发布（每月 2-5 个）
 - 模型名含版本后缀（`nvidia/deepseek-ai/deepseek-v4-pro`、`nvidia/llama-3.1-70b-instruct` 等）
 - 长期运行（6 个月+）可能累积 100+ 标签组合
 **结论**: 当前基数可控（<200 组合），但长期存在膨胀风险，应提前治理。
 ### 2.3 治理方案
 | 指标 | 当前 Label | 调整后 Label | 理由 |
 |------|-----------|-------------|------|
 | `upstream_latency_seconds` | `model_id` | `provider` | provider 固定为 `nvidia`，基数=1 |
 | `upstream_errors_total` | `status_code`, `model_id` | `status_code`, `provider` | 同上 |
 **模型级信息迁移路径**：
 - 模型 ID → 结构化 JSON 日志（structlog 已支持）
 - 需要模型级延迟分析时 → 临时 `/status` API 查询或日志聚合
 ```python
 # metrics.py 调整
 self.upstream_latency_seconds: Histogram = Histogram(
    "sidecar_upstream_latency_seconds",
    "Upstream response latency in seconds",
    labelnames=["provider"],  # 原: ["model_id"]
    buckets=(...),
 )
 self.upstream_errors_total: Counter = Counter(
    "sidecar_upstream_errors_total",
    "Upstream error count by status code",
    labelnames=["status_code", "provider"],  # 原: ["status_code", "model_id"]
 )
 ```
 ```python
 # server.py 调整 — 模型信息改记日志
 model_id = _extract_model(payload) or "unknown"
 provider = "nvidia"  # 固定值，因为只有 NVIDIA 请求走 worker
 _prometheus.record_upstream_latency(provider, upstream_latency)
 if not resp.is_success:
    _prometheus.record_upstream_error(resp.status_code, provider)
 logger.info("request_completed", model_id=model_id, ...)  # JSON 日志保留模型信息
 ```
 ### 2.4 Trade-off
 | 维度 | 保留 model_id | 收敛为 provider |
 |------|--------------|----------------|
 | 基数风险 | 高（无界） | 无（固定=1） |
 | 模型级分析 | Prometheus 原生查询 | 需日志聚合 |
 | 迁移成本 | 无 | 低（改 2 个指标定义 + 调用点） |
 **决策**: 收敛为 `provider`，模型级分析通过 JSON 日志 + 日志聚合系统（ELK/Loki）完成。
 ---
 ## 3. SSE 快照共享缓存
 ### 3.1 现状
 每个 SSE 客户端每秒独立调用 `_build_snapshot()`，该方法：
 - 获取 `_stats` 字典（需锁）
 - 调用 `_token_bucket.get_status()`（需锁）
 - 调用 `_priority_queue.get_stats()`（需 asyncio.Lock）
 当 N 个仪表盘同时打开时，每秒 N 次锁竞争 + N 次重复计算。
 ### 3.2 设计方案 — 1s TTL 共享缓存
 ```python
 # webui.py
 _snapshot_cache: tuple[dict[str, Any], float] | None = None  # (data, timestamp)
 _snapshot_lock: asyncio.Lock = asyncio.Lock()
 _SNAPSHOT_TTL: float = 1.0  # 1 秒 TTL
 async def _build_snapshot_cached(ctx: SidecarContext) -> dict[str, Any]:
    """带 1s TTL 的共享快照缓存。
    多个 SSE 客户端共享同一份快照，避免重复计算和锁竞争。
    """
    global _snapshot_cache
    now = time.monotonic()
    if _snapshot_cache is not None:
        data, ts = _snapshot_cache
        if now - ts < _SNAPSHOT_TTL:
            return data
    async with _snapshot_lock:
        # Double-check（避免多个协程同时 miss 后重复构建）
        if _snapshot_cache is not None:
            data, ts = _snapshot_cache
            if now - ts < _SNAPSHOT_TTL:
                return data
        snapshot = await _build_snapshot(ctx)
        _snapshot_cache = (snapshot, now)
        return snapshot
 ```
 ### 3.3 性能收益
 | 场景 | 当前 | 优化后 |
 |------|------|--------|
 | 1 客户端 | 1 次/s 计算 | 1 次/s 计算（无变化） |
 | 5 客户端 | 5 次/s 计算，5 次锁竞争 | 1 次/s 计算，1 次锁竞争 |
 | 20 客户端 | 20 次/s 计算，20 次锁竞争 | 1 次/s 计算，1 次锁竞争 |
 ---
 ## 4. 部署支撑
 ### 4.1 Dockerfile
 ```dockerfile
 # services/nvidia_sidecar/Dockerfile
 FROM python:3.12-slim AS base
 WORKDIR /app
 # 安装依赖（利用 Docker 层缓存）
 COPY pyproject.toml .
 RUN pip install --no-cache-dir -e .
 # 复制源码
 COPY . .
 # 非 root 用户运行
 RUN useradd -r -s /bin/false sidecar
 USER sidecar
 # 健康检查
 HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python -c "import httpx; r=httpx.get('http://127.0.0.1:9190/health'); exit(0 if r.status_code==200 else 1)"
 EXPOSE 9190 9191
 CMD ["uvicorn", "nvidia_sidecar.server:app", "--host", "0.0.0.0", "--port", "9190"]
 ```
 ### 4.2 systemd Service
 ```ini
 # services/nvidia_sidecar/deploy/nvidia-sidecar.service
 [Unit]
 Description=NVIDIA Sidecar Rate-Limiting Proxy
 After=network-online.target
 Wants=network-online.target
 [Service]
 Type=simple
 User=sidecar
 Group=sidecar
 WorkingDirectory=/opt/nvidia-sidecar
 ExecStart=/opt/nvidia-sidecar/.venv/bin/uvicorn nvidia_sidecar.server:app \
    --host 127.0.0.1 \
    --port 9190 \
    --log-level info
 Restart=always
 RestartSec=5
 # 环境变量
 EnvironmentFile=/opt/nvidia-sidecar/.env
 # 安全加固
 NoNewPrivileges=true
 ProtectSystem=strict
 ProtectHome=true
 PrivateTmp=true
 ReadWritePaths=/opt/nvidia-sidecar/logs
 # 资源限制
 LimitNOFILE=65536
 MemoryMax=512M
 [Install]
 WantedBy=multi-user.target
 ```
 ### 4.3 环境变量清单
 | 变量 | 默认值 | 说明 |
 |------|--------|------|
 | `SIDECAR_HOST` | `127.0.0.1` | 监听地址 |
 | `SIDECAR_PORT` | `9190` | 代理端口 |
 | `SIDECAR_METRICS_PORT` | `9191` | Prometheus 指标端口 |
 | `SIDECAR_UPSTREAM` | `https://integrate.api.nvidia.com/v1` | 上游 API |
 | `SIDECAR_API_KEY` | (必填) | NVIDIA API Key |
 | `SIDECAR_RATE_RPM` | `40` | 限流速率 (RPM) |
 | `SIDECAR_BUCKET_CAPACITY` | `40` | 令牌桶容量 |
 | `SIDECAR_TIMEOUT` | `60` | 请求超时 (秒) |
 | `SIDECAR_QUEUE_MAX` | `500` | 队列最大容量 |
 | `SIDECAR_LOW_TIMEOUT` | `2` | 低优先级超时 (秒) |
 | `SIDECAR_FALLBACK_PASSTHROUGH` | `true` | 队列满时是否直通 |
 | `SIDECAR_LOG_LEVEL` | `INFO` | 日志级别 |
 | `SIDECAR_ADMIN_TOKEN` | (可选) | Admin API 认证 Token |
 ### 4.4 防火墙建议
 ```
 # 仅允许内网访问代理端口
 sudo ufw allow from 192.168.1.0/24 to any port 9190
 sudo ufw allow from 192.168.1.0/24 to any port 9191
 # 禁止外网访问
 sudo ufw deny 9190
 sudo ufw deny 9191
 ```
 ---
 ## 5. Readiness HTTP Client 复用
 ### 5.1 现状
 `HealthService.check_upstream()` 每次调用创建新的 `httpx.AsyncClient`：
 ```python
 # health.py — 当前
 async def check_upstream(self, upstream_url: str, timeout: float = 5.0, api_key: str = "") -> bool:
    async with httpx.AsyncClient(timeout=timeout) as client:  # 每次新建！
        resp = await client.get(...)
 ```
 K8s/systemd 每 10-30s 探测一次，每次创建+销毁 HTTP client 带来不必要的 TCP 连接开销。
 ### 5.2 方案 — 复用主 http_client
 ```python
 # health.py — 优化后
 async def check_upstream(
    self,
    upstream_url: str,
    http_client: httpx.AsyncClient,  # 注入主 client
    api_key: str = "",
    timeout: float = 5.0,
 ) -> bool:
    try:
        headers = {}
        if api_key:
            headers["authorization"] = f"Bearer {api_key}"
        resp = await http_client.get(
            f"{upstream_url.rstrip('/')}/v1/models",
            headers=headers,
            timeout=timeout,
        )
        return resp.status_code < 500
    except Exception:
        return False
 ```
 ```python
 # server.py — 路由调用处
@app.get("/health/ready")
 async def health_ready(ctx: SidecarContext = Depends(get_context)):
    queue_size = await ctx.priority_queue.get_queue_size()
    bucket_status = ctx.token_bucket.get_status()
    return await ctx.health.readiness(
        upstream_url=ctx.config.upstream_url,
        http_client=ctx.http_client,  # 复用主 client
        upstream_api_key=ctx.config.upstream_api_key or "",
        queue_current_size=queue_size,
        queue_max_size=ctx.config.queue_max_size,
        available_tokens=bucket_status["tokens"],
        bucket_capacity=bucket_status["capacity"],
    )
 ```
 **注意**: readiness 检查使用较短 timeout (5s)，不影响主代理请求的 timeout 配置。httpx 支持per-request timeout 覆盖。
 ---
 ## 6. Retreat 并发/死锁回归测试
 ### 6.1 风险点
 `AdaptiveTokenBucket` 有两把锁：
 - `_lock` (Lock): 保护令牌消费/补充
 - `_retreat_lock` (RLock): 保护避退状态机
 潜在死锁路径：
 1. `evaluate_retreat()` 持有 `_retreat_lock` → 调用 `get_429_rate()` (也获取 `_retreat_lock`，RLock 可重入 ✅)
 2. `evaluate_retreat()` → `_apply_retreat()` → `set_rate()` → 获取 `_lock` (另一把锁)
 3. Worker 线程: `consume()` 持有 `_lock` → 不调用 `_retreat_lock` (无交叉 ✅)
 当前设计使用 RLock 已规避了重入死锁，但需要回归测试确保未来修改不引入死锁。
 ### 6.2 测试用例
 ```python
 # tests/test_retreat_concurrency.py
 import pytest
 import asyncio
 import threading
 from nvidia_sidecar.rate_limiter import AdaptiveTokenBucket, RetreatState
 class TestRetreatConcurrency:
    """避退模式并发安全回归测试。"""
    @pytest.mark.asyncio
    async def test_concurrent_record_and_evaluate(self):
        """多线程同时 record_response + evaluate_retreat 不死锁。"""
        bucket = AdaptiveTokenBucket(rate=40/60, capacity=40)
        errors: list[Exception] = []
        def worker_record():
            for i in range(1000):
                try:
                    bucket.record_response(is_429=(i % 10 == 0))
                except Exception as e:
                    errors.append(e)
        def worker_evaluate():
            for _ in range(1000):
                try:
                    bucket.evaluate_retreat()
                except Exception as e:
                    errors.append(e)
        threads = [
            threading.Thread(target=worker_record),
            threading.Thread(target=worker_record),
            threading.Thread(target=worker_evaluate),
            threading.Thread(target=worker_evaluate),
        ]
        for t in threads:
            t.start()
        for t in threads:
            t.join(timeout=10)
        # 所有线程必须在 10s 内完成（无死锁）
        assert all(not t.is_alive() for t in threads), "线程未完成，疑似死锁"
        assert not errors, f"并发错误: {errors}"
    @pytest.mark.asyncio
    async def test_concurrent_consume_and_retreat(self):
        """多线程同时 consume + evaluate_retreat 不死锁。"""
        bucket = AdaptiveTokenBucket(rate=40/60, capacity=40)
        errors: list[Exception] = []
        def worker_consume():
            for _ in range(500):
                try:
                    bucket.consume(tokens=1)
                except Exception as e:
                    errors.append(e)
        def worker_retreat():
            for _ in range(500):
                try:
                    bucket.record_response(is_429=False)
                    bucket.evaluate_retreat()
                except Exception as e:
                    errors.append(e)
        threads = [
            threading.Thread(target=worker_consume),
            threading.Thread(target=worker_consume),
            threading.Thread(target=worker_retreat),
            threading.Thread(target=worker_retreat),
        ]
        for t in threads:
            t.start()
        for t in threads:
            t.join(timeout=10)
        assert all(not t.is_alive() for t in threads), "线程未完成，疑似死锁"
        assert not errors, f"并发错误: {errors}"
    def test_retreat_state_transitions_under_load(self):
        """高负载下避退状态转换正确。"""
        bucket = AdaptiveTokenBucket(
            rate=40/60, capacity=40,
            retreat_429_threshold=0.05,
            retreat_factor=0.75,
        )
        # 模拟高 429 率
        for _ in range(100):
            bucket.record_response(is_429=True)
        state = bucket.evaluate_retreat()
        assert state == RetreatState.RETREAT
        assert bucket.get_effective_rate_rpm() < bucket.get_base_rate_rpm()
        # 模拟恢复
        for _ in range(200):
            bucket.record_response(is_429=False)
        # 需要等待 RECOVER_WINDOW
        import time
        time.sleep(0.1)  # 确保时间窗口过去
        bucket._last_state_change = 0  # 强制触发时间条件
        state = bucket.evaluate_retreat()
        assert state in (RetreatState.RECOVER, RetreatState.NORMAL)
 ```
 ---
 ## 7. Dashboard UX 优化
 ### 7.1 优化项清单
 | # | 优化项 | 实现方式 | 优先级 |
 |---|--------|---------|--------|
 | 1 | 队列柱状图 300ms 平滑动画 | CSS `transition: height 300ms ease` | P1 |
 | 2 | SSE 断连 5s 遮罩 | JS 定时器 + DOM 遮罩层 | P1 |
 | 3 | 队列图标题显示总排队数 | SSE 数据已有 `current_size`，更新标题 | P2 |
 | 4 | 页面加载同步配置 | `fetch('/api/admin/config')` 初始化表单 | P2 |
 ### 7.2 关键实现
 ```javascript
 // dashboard.html — SSE 断连检测
 let lastSSETime = Date.now();
 let reconnectMask = document.getElementById('reconnect-mask');
 eventSource.onmessage = (event) => {
    lastSSETime = Date.now();
    reconnectMask.style.display = 'none';
    // ... 更新 UI ...
 };
 // 5s 无数据 → 显示遮罩
 setInterval(() => {
    if (Date.now() - lastSSETime > 5000) {
        reconnectMask.style.display = 'flex';
    }
 }, 1000);
 // 队列柱状图动画
 // CSS: .queue-bar { transition: height 0.3s ease; }
 ```
 ```javascript
 // 页面加载时同步配置
 async function loadConfig() {
    try {
        const resp = await fetch('/api/admin/config');
        if (resp.ok) {
            const config = await resp.json();
            document.getElementById('rate-rpm').value = config.rate_rpm;
            document.getElementById('queue-max').value = config.queue_max_size;
            // ...
        }
    } catch (e) {
        console.warn('配置加载失败（可能需要 Admin Token）', e);
    }
 }
 loadConfig();
 ```
 ---
 ## 8. 实施排期
 | 阶段 | 内容 | 预估工时 | 依赖 |
 |------|------|---------|------|
 | **D1** | SidecarContext Step 1-3（解耦迁移） | 8h | 无 |
 | **D2** | Prometheus 标签收敛 + 日志增强 | 2h | D1 |
 | **D2** | SSE 共享缓存 | 2h | D1 |
 | **D2** | Readiness HTTP client 复用 | 1h | D1 |
 | **D3** | Dockerfile + systemd service | 2h | 无 |
 | **D3** | Dashboard UX 优化 | 3h | 无 |
 | **D3** | Retreat 并发回归测试 | 3h | 无 |
 | **D4** | 集成测试 + mypy strict | 4h | D1-D3 |
 | **合计** | | **25h** | |
 ---
 ## 9. 验收标准映射
 | Issue 要求 | 本文档章节 | 状态 |
 |-----------|-----------|------|
 | SidecarContext / DI 方案落地或 ADR | §1 | ✅ 详细设计 + 迁移计划 |
 | Prometheus 高基数 label 收敛 | §2 | ✅ 收敛为 provider |
 | SSE snapshot 共享缓存 | §3 | ✅ 1s TTL 设计 |
 | Dockerfile + systemd + 部署 SOP | §4 | ✅ 完整文件 |
 | readiness 复用 HTTP client | §5 | ✅ 注入主 client |
 | retreat 并发/死锁回归测试 | §6 | ✅ 测试用例 |
 | Dashboard UX 细节 | §7 | ✅ 4 项优化 |
@@ -1,3 +0,0 @@
 __pycache__/
 *.egg-info/
 .mypy_cache/
@@ -1,40 +1,46 @@
-# NVIDIA Sidecar 限流代理 — 生产 Docker 镜像 (BIZ-46 Phase3 §4)
+# Sidecar V2 — Multi-Pool Provider Proxy
-#
+FROM python:3.12-slim AS builder
 # 构建：
 #   docker build -t nvidia-sidecar:latest .
 #
 # 运行：
 #   docker run -d --name nvidia-sidecar \
 #     -p 127.0.0.1:9190:9190 \
 #     -p 127.0.0.1:9191:9191 \
 #     -e SIDECAR_API_KEY="nvapi-xxx" \
 #     -e SIDECAR_RATE_RPM=40 \
 #     -v $(pwd)/logs:/opt/nvidia-sidecar/logs \
 #     nvidia-sidecar:latest
 FROM python:3.12-slim AS base
 WORKDIR /app
-# 安装依赖（利用 Docker 层缓存）
+# Install dependencies
-COPY pyproject.toml .
+COPY requirements.txt .
-RUN pip install --no-cache-dir fastapi>=0.115 \
+RUN pip install --no-cache-dir --upgrade pip && \
-    "uvicorn[standard]>=0.34" httpx>=0.28 PyYAML>=6.0 \
+    pip install --no-cache-dir -r requirements.txt
    structlog>=24.4 "prometheus-client>=0.21" pydantic>=2.0
-# 复制源码
+# Copy application code
-COPY . .
+COPY config.py crypto.py main.py server.py proxy.py router.py \
     pool_manager.py cooldown_manager.py rate_limiter.py __init__.py \
     dashboard.html ./
 COPY storage/ ./storage/
-# 非 root 用户运行
+# Create data directory
-RUN useradd -r -m -s /bin/false sidecar \
+RUN mkdir -p /app/data /app/data/backups
    && mkdir -p /opt/nvidia-sidecar/logs \
    && chown -R sidecar:sidecar /app /opt/nvidia-sidecar/logs
 USER sidecar
-# 健康检查
+FROM python:3.12-slim
-HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+
-    CMD python -c "import httpx; r=httpx.get('http://127.0.0.1:9190/health'); exit(0 if r.status_code==200 else 1)"
+WORKDIR /app
 # Copy built artifacts
 COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
 COPY --from=builder /app /app
 # Environment
 ENV SIDECAR_HOST=0.0.0.0
 ENV SIDECAR_PORT=9190
 ENV SIDECAR_METRICS_PORT=9191
 ENV SIDECAR_DB_PATH=/app/data/sidecar_v2.db
 ENV SIDECAR_BACKUP_DIR=/app/data/backups
 ENV SIDECAR_ENCRYPTION_KEY=
 ENV SIDECAR_ADMIN_TOKEN=
 ENV LOG_FORMAT=json
 ENV PYTHONUNBUFFERED=1
 EXPOSE 9190 9191
-CMD ["uvicorn", "nvidia_sidecar.server:app", "--host", "0.0.0.0", "--port", "9190"]
+VOLUME ["/app/data"]
 HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:9190/health')" || exit 1
 ENTRYPOINT ["python3", "main.py"]
@@ -1,118 +1,77 @@
-# NVIDIA Sidecar 限流代理
+# Sidecar V2 — Multi-Pool Provider Proxy
-为 NVIDIA API 提供**优先级排队 + 令牌桶限流**的透明代理层。
+## 概述
 Sidecar V2 是 OpenClaw 的 API 代理服务，实现多 Provider 池管理、负载均衡、429 冷却、RPM 队列控流。
-> BIZ-46 Phase3: 架构解耦、Prometheus 标签治理、SSE 共享缓存、部署支撑、测试完善、Dashboard UX 优化。
+## 核心功能
 - **Provider 池管理**：主池 (primary) + 备用池 (fallback)，支持动态增删 Provider
 - **429 冷却**：检测 429 → 自动冷却 → 指数退避 → 自动恢复
 - **按 Provider 独立 RPM 限流**：每个 Provider 独立的 Token Bucket
 - **路由策略**：主池优先 → 备用池兜底 → 全部耗尽返 503
 - **WebUI 管理**：Dashboard 仪表盘 + Provider CRUD
 - **用量统计**：Token 用量 + 费用统计 + 每小时/每日聚合
 - **API Key 加密**：AES-256-GCM 加密存储
-## 快速启动
+## 架构
-```bash
+```
-pip install .
+OpenClaw → Sidecar V2 (port 9190) → 路由 → 主池 Provider 1,2,3...
-nvidia-sidecar
+                                        ↘ 备池 Provider 4,5...
                                        ↘ 全部耗尽 → 503
 ```
-监听 `127.0.0.1:9190`，代理到 NVIDIA API。
+## 快速开始
 ```bash
 # 设置加密密钥 (64位十六进制)
 export SIDECAR_ENCRYPTION_KEY="0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff"
 # 启动服务
 python3 main.py
 # OR via uvicorn
 python3 -m uvicorn server:app --host 127.0.0.1 --port 9190
 ```
 ## WebUI
 访问 http://127.0.0.1:9190/dashboard
 ## API 端点
 ### Admin API
 - `GET  /api/admin/backends` — 列出所有 Provider
 - `POST /api/admin/backends` — 添加 Provider
 - `PUT  /api/admin/backends/{id}` — 更新 Provider
 - `DELETE /api/admin/backends/{id}` — 删除 Provider
 - `GET  /api/admin/pools` — 池状态汇总
 - `GET  /api/admin/stats/total` — 总计统计
 - `GET  /api/admin/stats/hourly` — 每小时用量
 - `GET  /api/admin/stats/daily` — 每日聚合
 - `GET  /api/admin/stats/cooldown` — 冷却事件历史
 - `GET  /api/admin/config` — 系统配置
 ### 代理 API (OpenAI 兼容)
 - `POST /v1/chat/completions`
 - `POST /v1/completions`
 - `POST /v1/embeddings`
 - `GET  /v1/models`
 ### 监控
 - `GET /health` — 健康检查
 - `GET /dashboard/sse` — Dashboard 实时数据流 (SSE)
 ## 环境变量
 | 变量 | 默认值 | 说明 |
 |------|--------|------|
-| `SIDECAR_HOST` | `127.0.0.1` | 监听地址 |
+| SIDECAR_HOST | 127.0.0.1 | 监听地址 |
-| `SIDECAR_PORT` | `9190` | 监听端口 |
+| SIDECAR_PORT | 9190 | 监听端口 |
-| `SIDECAR_METRICS_PORT` | `9191` | Metrics 端口 |
+| SIDECAR_ENCRYPTION_KEY | (必填) | API Key 加密密钥 (64 hex chars) |
-| `SIDECAR_UPSTREAM` | `https://integrate.api.nvidia.com/v1` | 上游 API 地址 |
+| SIDECAR_DB_PATH | ./data/sidecar_v2.db | SQLite 数据库路径 |
-| `SIDECAR_API_KEY` | — | NVIDIA API Key（必填） |
+| SIDECAR_RATE_RPM | 40 | 默认 RPM 限制 |
-| `SIDECAR_RATE_RPM` | `40` | 每分钟请求数限制 |
+| SIDECAR_COOLDOWN_BASE | 30 | 冷却基础时长 (秒) |
-| `SIDECAR_BUCKET_CAPACITY` | `40` | 令牌桶容量 |
+| SIDECAR_COOLDOWN_MAX | 600 | 冷却最大时长 (秒) |
 | `SIDECAR_TIMEOUT` | `60` | 上游请求超时（秒） |
 | `SIDECAR_QUEUE_MAX` | `500` | 队列最大长度 |
 | `SIDECAR_LOW_TIMEOUT` | `2.0` | 低优先级令牌等待超时（秒） |
 | `SIDECAR_FALLBACK_PASSTHROUGH` | `true` | 队列满时是否直通上游 |
 | `SIDECAR_LOG_LEVEL` | `INFO` | 日志级别 |
-## YAML 配置
+## 存储
-
+- SQLite (WAL 模式)
-```yaml
+- 表：backends, backend_usage_logs, cooldown_events, backend_health, system_config, daily_stats
 listen_port: 9292
 rate_rpm: 60
 upstream_api_key: "nvapi-xxx"
 ```
 ```bash
 nvidia-sidecar --config /etc/nvidia-sidecar.yaml
 ```
 ## API 端点
 | 路径 | 方法 | 说明 |
 |------|------|------|
 | `/v1/chat/completions` | POST | OpenAI Chat Completions 代理 |
 | `/v1/completions` | POST | OpenAI Completions 代理（legacy） |
 | `/v1/embeddings` | POST | OpenAI Embeddings 代理 |
 | `/v1/models` | GET | 模型列表代理 |
 | `/health` | GET | 存活检查 (liveness) |
 | `/health/ready` | GET | 就绪检查 (readiness，含上游连通性) |
 | `/status` | GET | 调试用完整状态（限流器 + 队列 + 避退） |
 | `/api/dashboard/stream` | GET | SSE 仪表盘实时推送 |
 | `/api/dashboard` | GET | 仪表盘 HTML 页面 |
 | `/api/admin/config` | GET/POST | 配置查询/热重载（需 Admin Token） |
 | `/metrics` | :9191 | Prometheus 指标端点（独立端口） |
 ## 部署方式
 ### Docker（推荐）
 ```bash
 # 构建
 docker build -t nvidia-sidecar:latest .
 # 运行
 docker run -d --name nvidia-sidecar \
  -p 127.0.0.1:9190:9190 \
  -p 127.0.0.1:9191:9191 \
  -e SIDECAR_API_KEY="nvapi-xxx" \
  nvidia-sidecar:latest
 ```
 ### systemd
 ```bash
 # 安装
 sudo cp deploy/nvidia-sidecar.service /etc/systemd/system/
 sudo systemctl daemon-reload
 sudo systemctl enable nvidia-sidecar
 # 配置环境变量
 sudo cp deploy/.env.example /opt/nvidia-sidecar/.env
 sudo vim /opt/nvidia-sidecar/.env  # 填入实际值
 # 启动
 sudo systemctl start nvidia-sidecar
 sudo journalctl -u nvidia-sidecar -f  # 查看日志
 ```
 ### 环境变量清单
 详见 `deploy/.env.example`。
 ### 防火墙建议
 ```bash
 # 仅允许内网访问代理端口
 sudo ufw allow from 192.168.1.0/24 to any port 9190
 sudo ufw allow from 192.168.1.0/24 to any port 9191
 # 禁止外网访问
 sudo ufw deny 9190
 sudo ufw deny 9191
 ```
 ## 架构
 ```
 请求 → 网关识别 → [NVIDIA: 优先级排队 → 令牌桶限流] → httpx → NVIDIA API
                → [非 NVIDIA: 直通] → httpx → 上游
 ```
 - **四级优先级**: URGENT > HIGH > NORMAL > LOW（通过 `X-Priority` header 指定）
 - **队列满策略**: PASSTHROUGH（直通）/ REJECT（503）/ DROP_LOWEST（丢弃最低优先级）
 - **令牌桶**: 40 RPM，线程安全，支持阻塞/非阻塞消费
@@ -1,41 +1 @@
-"""
+"""Sidecar V2 — Multi-pool provider proxy with cooldown, rate limiting, and WebUI management."""
 NVIDIA Sidecar 限流代理 — 核心代理模块。
 为 OpenAI Chat Completions 兼容 API 提供四层防护：
    1. 请求接收（FastAPI）
    2. 网关识别 → 非 NVIDIA 直通
    3. 优先级排队 → 令牌桶限流
    4. httpx 异步转发到 NVIDIA 上游
 """
 from __future__ import annotations
 from nvidia_sidecar.config import SidecarConfig, load_config
 from nvidia_sidecar.rate_limiter import (
    Priority,
    TokenBucket,
    is_nvidia_gateway,
    normalize_gateway_name,
 )
 from nvidia_sidecar.priority_queue import (
    PriorityQueueItem,
    PriorityRequestQueue,
    QueueFullError,
    QueueFullPassthrough,
    QueueFullPolicy,
 )
 __version__ = "0.1.0"
 __all__ = [
    "SidecarConfig",
    "load_config",
    "Priority",
    "TokenBucket",
    "is_nvidia_gateway",
    "normalize_gateway_name",
    "PriorityQueueItem",
    "PriorityRequestQueue",
    "QueueFullError",
    "QueueFullPassthrough",
    "QueueFullPolicy",
 ]
@@ -1,221 +1,165 @@
-"""
+"""System configuration management for Sidecar V2."""
 NVIDIA Sidecar 限流代理 — 配置管理模块 (§3.1)
 集中管理 Sidecar 运行参数，支持环境变量覆盖和 YAML 配置文件。
 """
 from __future__ import annotations
 import os
-import warnings
+import json
-from dataclasses import dataclass, field
+from dataclasses import dataclass, field, asdict
-from pathlib import Path
+from typing import Optional
 from typing import Any
@dataclass
-class SidecarConfig:
+class Config:
-    """Sidecar 运行配置数据类。
+    """Sidecar V2 runtime configuration.
-    所有字段可通过环境变量覆盖，优先级：环境变量 > YAML 配置文件 > 默认值。
+    Sources (priority order):
    1. Environment variables (highest)
    2. system_config table in SQLite
    3. Defaults defined here
    """
-    # ---- 网络 ----
+    # Listen
-    listen_host: str = field(
+    host: str = "127.0.0.1"
-        default="127.0.0.1",
+    port: int = 9190
-        metadata={"env": "SIDECAR_HOST"},
+    metrics_port: int = 9191
    )
    listen_port: int = field(
        default=9190,
        metadata={"env": "SIDECAR_PORT"},
    )
    metrics_port: int = field(
        default=9191,
        metadata={"env": "SIDECAR_METRICS_PORT"},
    )
-    # ---- 上游 ----
+    # Queue
-    upstream_url: str = field(
+    queue_max_depth: int = 500
-        default="https://integrate.api.nvidia.com/v1",
+    queue_timeout_seconds: float = 30.0
        metadata={"env": "SIDECAR_UPSTREAM"},
    )
    upstream_api_key: str = field(
        default="",
        metadata={"env": "SIDECAR_API_KEY"},
    )
-    # ---- 限流 ----
+    # Provider
-    rate_rpm: int = field(
+    default_rpm_limit: int = 40
        default=40,
        metadata={"env": "SIDECAR_RATE_RPM"},
    )
    bucket_capacity: int = field(
        default=40,
        metadata={"env": "SIDECAR_BUCKET_CAPACITY"},
    )
-    # ---- 超时 ----
+    # Cooldown
-    request_timeout: float = field(
+    cooldown_base_seconds: float = 30.0
-        default=60.0,
+    cooldown_max_seconds: float = 600.0
-        metadata={"env": "SIDECAR_TIMEOUT"},
+    cooldown_exponential_backoff: bool = True
    )
-    # ---- 队列 ----
+    # Emergency channel: RPM fraction when all pools exhausted
-    queue_max_size: int = field(
+    emergency_rpm_fraction: float = 0.10
        default=500,
        metadata={"env": "SIDECAR_QUEUE_MAX"},
    )
    low_priority_timeout: float = field(
        default=2.0,
        metadata={"env": "SIDECAR_LOW_TIMEOUT"},
    )
-    # ---- 降级 ----
+    # Health check
-    fallback_enabled_passthrough: bool = field(
+    health_check_interval_seconds: int = 60
-        default=True,
+    health_check_timeout_seconds: int = 10
-        metadata={"env": "SIDECAR_FALLBACK_PASSTHROUGH"},
+    health_probe_endpoint: str = "/v1/models"
    )
-    # ---- 日志 ----
+    # Admin auth
-    log_level: str = field(
+    admin_token: str = ""
        default="INFO",
        metadata={"env": "SIDECAR_LOG_LEVEL"},
    )
    # Encryption
    encryption_key: str = ""
-def _apply_env_overrides(config: SidecarConfig) -> SidecarConfig:
+    # Logging
-    """用环境变量覆盖配置字段。
+    log_level: str = "INFO"
-    遍历 SidecarConfig 的 dataclass fields，对每个声明了 ``metadata={"env": ...}``
+    # Database
-    的字段检查环境变量是否存在，存在则用对应类型转换后覆盖。
+    db_path: str = ""
-    """
+    backup_dir: str = ""
-    import dataclasses as _dc
+    backup_retention_days: int = 7
-    # 使用 typing.get_type_hints 解析 from __future__ import annotations
+    # Rate limiter
-    # 引入的字符串化类型注解 (PEP 563)
+    rate_limiter_refill_interval_ms: int = 50
    try:
        resolved_types = __import__("typing").get_type_hints(type(config))
    except Exception:
        resolved_types = {}
-    for fld in _dc.fields(config):
+    # Router
-        env_key: str | None = fld.metadata.get("env")
+    router_refresh_interval_seconds: float = 5.0
        if env_key is None:
            continue
        env_val = os.environ.get(env_key)
        if env_val is None:
            continue
-        target_type = resolved_types.get(fld.name, fld.type)
+    # Max pool-internal retries
-        target_type_name: str = getattr(target_type, "__name__", str(target_type))
+    max_pool_retries: int = 5
-        try:
+
-            if target_type is bool or target_type == "bool":
+    # Pre-check cooldown threshold (seconds remaining)
-                parsed: bool = env_val.strip().lower() in ("true", "1", "yes", "on")
+    cooldown_precheck_threshold_seconds: float = 10.0
-                setattr(config, fld.name, parsed)
+
-            elif target_type is int or target_type == "int":
+    # Dashboard
-                setattr(config, fld.name, int(env_val))
+    dashboard_sse_interval_seconds: float = 1.0
-            elif target_type is float or target_type == "float":
+
-                setattr(config, fld.name, float(env_val))
+    # Stats
    stats_refresh_interval_seconds: float = 30.0
    # Request timeout
    default_request_timeout_seconds: int = 120
    @classmethod
    def from_env(cls) -> "Config":
        """Load configuration from environment variables."""
        c = cls()
        # Listen
        c.host = os.getenv("SIDECAR_HOST", c.host)
        c.port = int(os.getenv("SIDECAR_PORT", str(c.port)))
        c.metrics_port = int(os.getenv("SIDECAR_METRICS_PORT", str(c.metrics_port)))
        # Queue
        c.queue_max_depth = int(os.getenv("SIDECAR_QUEUE_MAX", str(c.queue_max_depth)))
        c.queue_timeout_seconds = float(
            os.getenv("SIDECAR_QUEUE_TIMEOUT", str(c.queue_timeout_seconds))
        )
        # Provider
        c.default_rpm_limit = int(
            os.getenv("SIDECAR_RATE_RPM", str(c.default_rpm_limit))
        )
        # Cooldown
        c.cooldown_base_seconds = float(
            os.getenv("SIDECAR_COOLDOWN_BASE", str(c.cooldown_base_seconds))
        )
        c.cooldown_max_seconds = float(
            os.getenv("SIDECAR_COOLDOWN_MAX", str(c.cooldown_max_seconds))
        )
        # Admin
        c.admin_token = os.getenv("SIDECAR_ADMIN_TOKEN", c.admin_token)
        # Encryption
        c.encryption_key = os.getenv("SIDECAR_ENCRYPTION_KEY", c.encryption_key)
        # Logging
        c.log_level = os.getenv("LOG_LEVEL", c.log_level).upper()
        # Database
        c.db_path = os.getenv(
            "SIDECAR_DB_PATH",
            os.path.join(os.getcwd(), "data", "sidecar_v2.db"),
        )
        c.backup_dir = os.getenv(
            "SIDECAR_BACKUP_DIR",
            os.path.join(os.getcwd(), "data", "backups"),
        )
        # V1 compatibility: migrate env vars
        c._migrate_v1_env()
        return c
    def _migrate_v1_env(self) -> None:
        """Migrate V1 environment variables to V2 defaults."""
        # V1 UPSTREAM endpoint
        upstream = os.getenv("SIDECAR_UPSTREAM")
        api_key = os.getenv("SIDECAR_API_KEY")
        if api_key and self.encryption_key:
            # These will be used during initial migration
            os.environ["_SIDECAR_V1_API_KEY"] = api_key
            os.environ["_SIDECAR_V1_UPSTREAM"] = upstream or "https://integrate.api.nvidia.com/v1"
    def to_db_dict(self) -> dict:
        """Serialize to dict for system_config storage."""
        result = {}
        for key, value in asdict(self).items():
            if isinstance(value, bool):
                result[key] = "true" if value else "false"
            elif isinstance(value, (int, float)):
                result[key] = str(value)
            else:
-                setattr(config, fld.name, env_val)
+                result[key] = value
-        except (ValueError, TypeError) as exc:
+        return result
            warnings.warn(
                f"无法将环境变量 {env_key}={env_val!r} 转换为 {target_type_name}: {exc}"
            )
-    return config
+    @classmethod
    def merge_db(cls, base: "Config", db_config: dict) -> "Config":
        """Merge DB config into base config (env vars already applied to base)."""
        for key, value in base.__dict__.items():
            if key in db_config and key not in os.environ:
                # DB values only apply when no env var override
                setattr(base, key, type(value)(db_config[key]))
        return base
-def _validate_config(config: SidecarConfig) -> list[str]:
+# Singleton
-    """验证配置合理性，返回警告/问题列表。"""
+config = Config.from_env()
    issues: list[str] = []
    # 端口冲突检查
    if config.listen_port == config.metrics_port:
        issues.append(
            f"listen_port ({config.listen_port}) 与 metrics_port ({config.metrics_port}) 相同"
        )
    # rate_rpm 边界检查
    if config.rate_rpm <= 0:
        issues.append(
            f"rate_rpm ({config.rate_rpm}) 无效，回退到默认值 40"
        )
        config.rate_rpm = 40
    # queue_max_size 合理性
    if config.queue_max_size <= 0:
        issues.append(
            f"queue_max_size ({config.queue_max_size}) 无效，回退到默认值 500"
        )
        config.queue_max_size = 500
    # request_timeout 合理性
    if config.request_timeout <= 0:
        issues.append(
            f"request_timeout ({config.request_timeout}) 无效，回退到默认值 60"
        )
        config.request_timeout = 60.0
    elif config.request_timeout > 300.0:
        issues.append(
            f"request_timeout ({config.request_timeout}) 异常偏高，已截断为 300"
        )
        config.request_timeout = 300.0
    return issues
 def load_config(path: str | None = None) -> SidecarConfig:
    """加载 Sidecar 配置。
    加载顺序（后者覆盖前者）：
    1. 默认值（SidecarConfig dataclass defaults）
    2. YAML 配置文件（如果 path 提供）
    3. 环境变量覆盖
    Args:
        path: 可选 YAML 配置文件路径。为 None 时只使用默认值 + 环境变量。
    Returns:
        经过验证的 SidecarConfig 实例。
    Raises:
        FileNotFoundError: path 指定的文件不存在。
        yaml.YAMLError: YAML 解析失败。
    """
    config = SidecarConfig()
    if path is not None:
        import yaml
        cfg_path = Path(path)
        if not cfg_path.is_file():
            raise FileNotFoundError(f"配置文件不存在: {cfg_path}")
        try:
            raw: dict[str, Any] = yaml.safe_load(cfg_path.read_text(encoding="utf-8")) or {}
        except yaml.YAMLError as exc:
            raise yaml.YAMLError(f"YAML 解析失败 ({cfg_path}): {exc}") from exc
        # 覆盖已声明的字段
        for fld_name in (
            "listen_host", "listen_port", "metrics_port",
            "upstream_url", "upstream_api_key",
            "rate_rpm", "bucket_capacity",
            "request_timeout",
            "queue_max_size", "low_priority_timeout",
            "fallback_enabled_passthrough",
            "log_level",
        ):
            if fld_name in raw:
                setattr(config, fld_name, raw[fld_name])
    # 环境变量覆盖（最高优先级）
    config = _apply_env_overrides(config)
    # 验证
    issues = _validate_config(config)
    for issue in issues:
        warnings.warn(issue)
    return config
@@ -1,75 +0,0 @@
 """
 NVIDIA Sidecar — SidecarContext 依赖注入容器 (§BIZ-46 Phase3)
 将所有模块级全局状态收敛为单一 dataclass，通过 FastAPI app.state 注入，
 消除 webui.py → server 的反向导入，支持可测试性和多实例扩展。
 设计文档: docs/architecture/BIZ-46_Phase3_Architecture_Design.md §1
 """
 from __future__ import annotations
 import asyncio
 import time
 from dataclasses import dataclass, field
 from typing import TYPE_CHECKING, Any
 import httpx
 if TYPE_CHECKING:
    from nvidia_sidecar.config import SidecarConfig
    from nvidia_sidecar.rate_limiter import AdaptiveTokenBucket
    from nvidia_sidecar.priority_queue import PriorityRequestQueue
    from nvidia_sidecar.metrics import PrometheusMetrics
    from nvidia_sidecar.health import HealthService
@dataclass
 class SidecarContext:
    """Sidecar 全局运行时上下文 — 所有核心组件的唯一容器。
    通过 ``app.state.sidecar`` 注入 FastAPI，路由通过 ``Depends(get_context)`` 获取。
    """
    # ---- 核心组件 ----
    config: SidecarConfig
    http_client: httpx.AsyncClient
    token_bucket: AdaptiveTokenBucket
    priority_queue: PriorityRequestQueue
    prometheus: PrometheusMetrics
    health: HealthService
    # ---- 运行时状态 ----
    pending_requests: dict[str, tuple["asyncio.Future[Any]", float]] = field(default_factory=dict)
    """request_id → (response future, enqueued_at) 的映射。"""
    stats: dict[str, int] = field(default_factory=lambda: {
        "total_requests": 0,
        "nvidia_requests": 0,
        "passthrough_requests": 0,
        "ratelimited_requests": 0,
        "queue_full_rejects": 0,
        "upstream_errors": 0,
        "start_time": 0,
    })
    stats_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
    # ---- 缓存 ----
    snapshot_cache: tuple["dict[str, Any]", float] | None = None
    """SSE 快照共享缓存: (data, timestamp)。"""
    snapshot_cache_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
    SNAPSHOT_CACHE_TTL: float = 1.0
    # ---- 便捷方法 ----
    async def increment_stat(self, key: str, delta: int = 1) -> None:
        """线程安全的统计计数器自增。"""
        async with self.stats_lock:
            self.stats[key] = self.stats.get(key, 0) + delta
    @property
    def uptime_seconds(self) -> int:
        """服务运行时长（秒）。"""
        st = self.stats.get("start_time", 0)
        return int(time.time() - st) if st else 0
@@ -0,0 +1,114 @@
 """429 Cooldown management for backends using exponential backoff."""
 import time
 from datetime import datetime, timezone
 import structlog
 from config import config
 from storage.backend_store import set_backend_cooldown, clear_backend_cooldown, get_backend
 from storage.cooldown_store import log_cooldown_event, end_cooldown_event
 logger = structlog.get_logger("sidecar_v2.cooldown_manager")
 def calculate_cooldown(consecutive_count: int) -> float:
    """Calculate cooldown duration using exponential backoff.
    Formula: base * 2^(consecutive-1), capped at max.
    """
    base = config.cooldown_base_seconds
    max_seconds = config.cooldown_max_seconds
    if config.cooldown_exponential_backoff:
        duration = base * (2 ** (consecutive_count - 1))
    else:
        duration = base * consecutive_count
    return min(duration, max_seconds)
 def start_cooldown(backend_id: str, consecutive_count: int) -> float:
    """Start cooldown for a backend after 429.
    Returns: cooldown end timestamp.
    """
    duration = calculate_cooldown(consecutive_count)
    cooldown_until_ts = time.time() + duration
    cooldown_until = time.strftime(
        "%Y-%m-%dT%H:%M:%SZ", time.gmtime(cooldown_until_ts)
    )
    set_backend_cooldown(backend_id, cooldown_until, consecutive_count)
    log_cooldown_event(
        backend_id=backend_id,
        consecutive_count=consecutive_count,
        cooldown_seconds=int(duration),
        response_summary=f"429 cooldown triggered (consecutive #{consecutive_count})",
    )
    logger.info(
        "cooldown_started",
        backend_id=backend_id,
        duration=round(duration, 1),
        consecutive=consecutive_count,
    )
    return duration
 def check_and_clear_cooldown(backend_id: str) -> bool:
    """Check if cooldown has expired for a backend.
    Returns True if cooldown was cleared (backend is back online).
    """
    backend = get_backend(backend_id, decrypt_key=False)
    if backend is None:
        return False
    if backend.status != "cooling":
        return False
    cooldown_until = backend.cooldown_until
    if not cooldown_until:
        clear_backend_cooldown(backend_id)
        return True
    # Parse cooldown_until as ISO timestamp
    try:
        dt = datetime.fromisoformat(cooldown_until.replace("Z", "+00:00"))
        cooldown_ts = dt.timestamp()
    except ValueError:
        # If parsing fails, clear and move on
        clear_backend_cooldown(backend_id)
        return True
    now = time.time()
    if now >= cooldown_ts:
        clear_backend_cooldown(backend_id)
        end_cooldown_event(backend_id)
        logger.info("cooldown_cleared", backend_id=backend_id)
        return True
    remaining = cooldown_ts - now
    logger.debug("cooldown_active", backend_id=backend_id, remaining_seconds=round(remaining, 1))
    return False
 def precheck_cooldown(backend_id: str) -> bool:
    """Check if backend should be skipped due to near-expiry cooldown.
    If cooldown will expire within config.cooldown_precheck_threshold_seconds,
    skip the backend so we don't hit it again right as it expires.
    """
    backend = get_backend(backend_id, decrypt_key=False)
    if backend is None or backend.status != "cooling":
        return False
    cooldown_until = backend.cooldown_until
    if not cooldown_until:
        return False
    try:
        dt = datetime.fromisoformat(cooldown_until.replace("Z", "+00:00"))
        cooldown_ts = dt.timestamp()
    except ValueError:
        return False
    remaining = cooldown_ts - time.time()
    return 0 < remaining <= config.cooldown_precheck_threshold_seconds
@@ -0,0 +1,108 @@
 """AES-256-GCM encryption for API Key storage."""
 import os
 import secrets
 import structlog
 from cryptography.hazmat.primitives.ciphers.aead import AESGCM
 logger = structlog.get_logger()
 _ENCRYPTION_KEY: bytes | None = None
 _cipher: AESGCM | None = None
 def init_crypto(hex_key: str) -> None:
    """Initialize the encryption module.
    Validates the key and prepares the cipher.
    Raises ValueError if key is invalid.
    """
    global _ENCRYPTION_KEY, _cipher
    if not hex_key:
        raise ValueError("FATAL: SIDECAR_ENCRYPTION_KEY not set")
    if len(hex_key) != 64:
        raise ValueError(
            f"FATAL: SIDECAR_ENCRYPTION_KEY must be 64 hex chars (32 bytes), "
            f"got {len(hex_key)} chars"
        )
    try:
        key_bytes = bytes.fromhex(hex_key)
    except ValueError:
        raise ValueError(
            "FATAL: SIDECAR_ENCRYPTION_KEY must be valid hexadecimal"
        )
    global _ENCRYPTION_KEY, _cipher
    _ENCRYPTION_KEY = key_bytes
    _cipher = AESGCM(key_bytes)
    logger.info("crypto_initialized")
 def encrypt(plaintext: str) -> str:
    """Encrypt plaintext using AES-256-GCM.
    Returns: hex-encoded nonce (12 bytes) + ciphertext + tag.
    Format: <nonce_hex>:<ciphertext_hex>
    """
    if _cipher is None:
        raise RuntimeError("Crypto not initialized. Call init_crypto() first.")
    nonce = secrets.token_bytes(12)
    ciphertext = _cipher.encrypt(nonce, plaintext.encode("utf-8"), None)
    return nonce.hex() + ":" + ciphertext.hex()
 def decrypt(encrypted: str) -> str:
    """Decrypt AES-256-GCM ciphertext.
    Args:
        encrypted: Format "<nonce_hex>:<ciphertext_hex>"
    Returns: Decrypted plaintext string.
    """
    if _cipher is None:
        raise RuntimeError("Crypto not initialized. Call init_crypto() first.")
    parts = encrypted.split(":", 1)
    if len(parts) != 2:
        raise ValueError("Invalid encrypted format: expected nonce:ciphertext")
    nonce = bytes.fromhex(parts[0])
    ciphertext = bytes.fromhex(parts[1])
    try:
        plaintext = _cipher.decrypt(nonce, ciphertext, None)
        return plaintext.decode("utf-8")
    except Exception as e:
        raise ValueError(f"Decryption failed: {e}")
 def is_initialized() -> bool:
    """Check if crypto has been initialized."""
    return _cipher is not None
 def mask_api_key(api_key_plain: str) -> str:
    """Mask API key for display: show first 6 + last 4 chars."""
    if len(api_key_plain) <= 10:
        return api_key_plain[:2] + "****"
    return api_key_plain[:6] + "****" + api_key_plain[-4:]
 def try_decrypt_existing(encrypted_value: str) -> str | None:
    """Try to decrypt an existing encrypted value.
    Returns the plaintext if successful, None if decryption fails
    (e.g., encryption key was changed).
    """
    try:
        return decrypt(encrypted_value)
    except Exception:
        logger.warning(
            "decrypt_existing_failed",
            hint="Encryption key may have been changed, existing keys unrecoverable"
        )
        return None
@@ -0,0 +1,623 @@
 <!DOCTYPE html>
 <html lang="zh-CN">
 <head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0">
 <title>Sidecar V2 — Provider Pool Dashboard</title>
 <!-- Primary: jsDelivr CDN -->
 <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
 <!-- Fallback: local static copy for offline/intranet deployments -->
 <script>
 (function() {
  var check = function() {
    if (typeof Chart === 'undefined') {
      var s = document.createElement('script');
      s.src = '/static/chart.umd.min.js';
      s.onerror = function() {
        console.warn('Chart.js unavailable (CDN + local both failed). Charts disabled.');
      };
      document.head.appendChild(s);
    }
  };
  // Check after CDN script has had a chance to load
  setTimeout(check, 2000);
 })();
 </script>
 <style>
  :root {
    --bg: #0f1117;
    --card-bg: #1a1d28;
    --border: #2a2d3a;
    --text: #e0e0e0;
    --text-dim: #888;
    --green: #23d160;
    --yellow: #ffdd57;
    --red: #ff3860;
    --blue: #3273dc;
    --purple: #b86bff;
    --cyan: #00d1b2;
    --orange: #ff8533;
  }
  * { margin: 0; padding: 0; box-sizing: border-box; }
  body {
    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
    background: var(--bg);
    color: var(--text);
    min-height: 100vh;
  }
  /* Layout */
  .app { display: flex; height: 100vh; }
  .sidebar {
    width: 220px; background: var(--card-bg); border-right: 1px solid var(--border);
    padding: 20px 0; display: flex; flex-direction: column;
  }
  .sidebar h2 { padding: 0 20px 20px; font-size: 16px; color: var(--cyan); border-bottom: 1px solid var(--border); }
  .sidebar nav { flex: 1; padding: 10px 0; }
  .sidebar nav a {
    display: block; padding: 10px 20px; color: var(--text-dim); text-decoration: none;
    font-size: 13px; transition: 0.2s;
  }
  .sidebar nav a:hover, .sidebar nav a.active { color: var(--text); background: rgba(255,255,255,0.05); }
  .sidebar .status-bar { padding: 15px 20px; border-top: 1px solid var(--border); font-size: 11px; color: var(--text-dim); }
  .main { flex: 1; overflow-y: auto; padding: 24px; }
  .page { display: none; }
  .page.active { display: block; }
  /* Dashboard Cards */
  .cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
  .card {
    background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
  }
  .card .label { font-size: 12px; color: var(--text-dim); text-transform: uppercase;letter-spacing:0.5px;margin-bottom:6px; }
  .card .value { font-size: 28px; font-weight: 700; }
  .card .sub { font-size: 12px; color: var(--text-dim); margin-top: 4px; }
  .charts { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }
  .chart-card {
    background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
  }
  .chart-card h3 { font-size: 14px; margin-bottom: 12px; color: var(--text-dim); }
  .chart-card canvas { max-height: 250px; }
  /* Pool Cards */
  .pool-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin-bottom: 24px; }
  .pool-card {
    background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
  }
  .pool-card h3 { font-size: 15px; margin-bottom: 12px; text-transform: uppercase; letter-spacing: 1px; }
  .pool-card h3.primary { color: var(--blue); }
  .pool-card h3.fallback { color: var(--orange); }
  .pool-stats { display: grid; grid-template-columns: repeat(4, 1fr); gap: 8px; }
  .pool-stat { text-align: center; }
  .pool-stat .num { font-size: 22px; font-weight: 700; }
  .pool-stat .lbl { font-size: 11px; color: var(--text-dim); margin-top: 2px; }
  .pool-stat.healthy .num { color: var(--green); }
  .pool-stat.cooling .num { color: var(--yellow); }
  .pool-stat.error .num { color: var(--red); }
  .pool-stat.total .num { color: var(--purple); }
  /* Tables */
  table { width: 100%; border-collapse: collapse; background: var(--card-bg); border-radius: 8px; overflow: hidden; }
  th { text-align: left; padding: 10px 12px; font-size: 11px; text-transform: uppercase; letter-spacing: 0.5px; color: var(--text-dim); background: rgba(255,255,255,0.03); border-bottom: 1px solid var(--border); }
  td { padding: 10px 12px; font-size: 13px; border-bottom: 1px solid var(--border); }
  tr:last-child td { border-bottom: none; }
  tr:hover { background: rgba(255,255,255,0.02); }
  .badge {
    display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 11px; font-weight: 600;
  }
  .badge.healthy { background: rgba(35,209,96,0.15); color: var(--green); }
  .badge.cooling { background: rgba(255,221,87,0.15); color: var(--yellow); }
  .badge.error { background: rgba(255,56,96,0.15); color: var(--red); }
  .badge.disabled { background: rgba(136,136,136,0.15); color: var(--text-dim); }
  .badge.primary { background: rgba(50,115,220,0.15); color: var(--blue); }
  .badge.fallback { background: rgba(255,133,51,0.15); color: var(--orange); }
  /* Buttons */
  .btn {
    padding: 6px 14px; border-radius: 6px; border: none; cursor: pointer; font-size: 12px; font-weight: 600;
    transition: 0.2s;
  }
  .btn-primary { background: var(--blue); color: #fff; }
  .btn-primary:hover { opacity: 0.85; }
  .btn-danger { background: var(--red); color: #fff; }
  .btn-danger:hover { opacity: 0.85; }
  .btn-sm { padding: 3px 10px; font-size: 11px; }
  .btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
  .btn-outline:hover { background: rgba(255,255,255,0.05); }
  .section-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
  .section-header h3 { font-size: 15px; }
  /* Modal */
  .modal-overlay { display: none; position: fixed; top: 0; left: 0; right: 0; bottom: 0; background: rgba(0,0,0,0.7); z-index: 100; justify-content: center; align-items: center; }
  .modal-overlay.active { display: flex; }
  .modal { background: var(--card-bg); border: 1px solid var(--border); border-radius: 12px; padding: 24px; width: 560px; max-height: 80vh; overflow-y: auto; }
  .modal h3 { margin-bottom: 16px; font-size: 16px; }
  .form-group { margin-bottom: 12px; }
  .form-group label { display: block; font-size: 12px; color: var(--text-dim); margin-bottom: 4px; }
  .form-group input, .form-group select, .form-group textarea {
    width: 100%; padding: 8px 10px; background: var(--bg); border: 1px solid var(--border);
    border-radius: 6px; color: var(--text); font-size: 13px;
  }
  .form-group textarea { min-height: 80px; font-family: monospace; font-size: 12px; }
  .form-row { display: grid; grid-template-columns: 1fr 1fr; gap: 12px; }
  .form-actions { display: flex; gap: 8px; justify-content: flex-end; margin-top: 16px; }
  .model-mapping-row { display: flex; gap: 8px; align-items: center; margin-bottom: 8px; }
  .model-mapping-row input { flex: 1; }
  /* Utility */
  .text-green { color: var(--green); }
  .text-red { color: var(--red); }
  .text-dim { color: var(--text-dim); }
  .mb-16 { margin-bottom: 16px; }
  .mb-24 { margin-bottom: 24px; }
  @media (max-width: 768px) {
    .charts, .pool-grid { grid-template-columns: 1fr; }
    .sidebar { display: none; }
  }
 </style>
 </head>
 <body>
 <div class="app">
  <!-- Sidebar -->
  <aside class="sidebar">
    <h2>🚀 Sidecar V2</h2>
    <nav>
      <a href="#" data-page="dashboard" class="active">📊 Dashboard</a>
      <a href="#" data-page="providers">🔌 Providers</a>
      <a href="#" data-page="usage">📈 Usage Stats</a>
      <a href="#" data-page="cooldown">🧊 Cooldown Log</a>
    </nav>
    <div class="status-bar" id="status-bar">Connected · Sidecar V2</div>
  </aside>
  <!-- Main Content -->
  <main class="main">
    <!-- Dashboard Page -->
    <div class="page active" id="page-dashboard">
      <div class="cards" id="stat-cards"></div>
      <div class="pool-grid" id="pool-grid"></div>
      <div class="charts" id="charts"></div>
    </div>
    <!-- Providers Page -->
    <div class="page" id="page-providers">
      <div class="section-header">
        <h3>Provider Backends</h3>
        <button class="btn btn-primary" onclick="showAddBackend()">+ Add Provider</button>
      </div>
      <table id="backends-table">
        <thead>
          <tr><th>Name</th><th>Label</th><th>Pool</th><th>Status</th><th>RPM</th><th>Models</th><th>Actions</th></tr>
        </thead>
        <tbody></tbody>
      </table>
    </div>
    <!-- Usage Page -->
    <div class="page" id="page-usage">
      <div class="section-header"><h3>Hourly Usage</h3></div>
      <div class="mb-16">
        <select id="usage-backend-filter" onchange="loadUsage()" class="btn btn-outline btn-sm">
          <option value="">All Backends</option>
        </select>
      </div>
      <table id="usage-table">
        <thead>
          <tr><th>Hour</th><th>Backend</th><th>Model</th><th>Requests</th><th>Errors</th><th>Tokens</th><th>Cost</th><th>Avg Latency</th></tr>
        </thead>
        <tbody></tbody>
      </table>
      <div class="section-header mt-24 mb-16"><h3>Daily Aggregation</h3></div>
      <table id="daily-table">
        <thead>
          <tr><th>Date</th><th>Pool</th><th>Requests</th><th>Errors</th><th>Tokens</th><th>Cost</th><th>Backends</th></tr>
        </thead>
        <tbody></tbody>
      </table>
    </div>
    <!-- Cooldown Page -->
    <div class="page" id="page-cooldown">
      <div class="section-header"><h3>Cooldown Event History</h3></div>
      <table id="cooldown-table">
        <thead>
          <tr><th>Time</th><th>Backend</th><th>Consecutive 429s</th><th>Duration</th><th>Summary</th></tr>
        </thead>
        <tbody></tbody>
      </table>
    </div>
  </main>
 </div>
 <!-- Add/Edit Backend Modal -->
 <div class="modal-overlay" id="backend-modal">
  <div class="modal">
    <h3 id="modal-title">Add Provider</h3>
    <form id="backend-form" onsubmit="saveBackend(event)">
      <input type="hidden" id="backend-id">
      <div class="form-row">
        <div class="form-group">
          <label>Name *</label>
          <input type="text" id="backend-name" placeholder="e.g. NVIDIA H100 Primary" required>
        </div>
        <div class="form-group">
          <label>Label</label>
          <input type="text" id="backend-label" placeholder="e.g. nvidia, siliconflow">
        </div>
      </div>
      <div class="form-group">
        <label>API Base URL *</label>
        <input type="url" id="backend-url" placeholder="https://integrate.api.nvidia.com/v1" required>
      </div>
      <div class="form-group">
        <label>API Key *</label>
        <input type="password" id="backend-key" placeholder="sk-..." required>
      </div>
      <div class="form-row">
        <div class="form-group">
          <label>Pool</label>
          <select id="backend-pool">
            <option value="primary">Primary</option>
            <option value="fallback">Fallback</option>
          </select>
        </div>
        <div class="form-group">
          <label>RPM Limit</label>
          <input type="number" id="backend-rpm" value="40" min="1" max="1000">
        </div>
      </div>
      <div class="form-row">
        <div class="form-group">
          <label>Timeout (seconds)</label>
          <input type="number" id="backend-timeout" value="120" min="10" max="600">
        </div>
        <div class="form-group">
          <label>Enabled</label>
          <select id="backend-enabled">
            <option value="true">Yes</option>
            <option value="false">No</option>
          </select>
        </div>
      </div>
      <div class="form-group">
        <label>Model Mappings (JSON: canonical → {native_id, cost, ...})</label>
        <textarea id="backend-mappings" placeholder='{"deepseek-ai/DeepSeek-V4-Pro":{"native_id":"deepseek-ai/deepseek-v4-pro","cost":{"input":0.000001,"output":0.000004}}}'></textarea>
      </div>
      <div class="form-actions">
        <button type="button" class="btn btn-outline" onclick="closeModal()">Cancel</button>
        <button type="submit" class="btn btn-primary">Save</button>
      </div>
    </form>
  </div>
 </div>
 <script>
 // ── Navigation ──
 document.querySelectorAll('.sidebar nav a').forEach(a => {
  a.addEventListener('click', e => {
    e.preventDefault();
    document.querySelectorAll('.sidebar nav a').forEach(l => l.classList.remove('active'));
    a.classList.add('active');
    document.querySelectorAll('.page').forEach(p => p.classList.remove('active'));
    document.getElementById('page-' + a.dataset.page).classList.add('active');
    loadPage(a.dataset.page);
  });
 });
 // ── SSE Connection ──
 const sse = new EventSource('/dashboard/sse');
 sse.onmessage = e => {
  const data = JSON.parse(e.data);
  if (data.type === 'snapshot') updateDashboard(data);
 };
 sse.onerror = () => {
  document.getElementById('status-bar').textContent = '⚠️ SSE Disconnected';
 };
 // ── Dashboard Update ──
 let costChart = null, tokenChart = null;
 function updateDashboard(data) {
  document.getElementById('status-bar').textContent =
    `⚡ Connected · Uptime ${formatDuration(data.uptime_seconds)}`;
  // Stat cards
  const st = data.total || {};
  const errRate = st.total_requests > 0 ? ((st.total_errors || 0) / st.total_requests * 100).toFixed(1) : '0.0';
  document.getElementById('stat-cards').innerHTML = `
    <div class="card"><div class="label">Total Requests</div><div class="value">${fmt(st.total_requests)}</div><div class="sub">Error rate: ${errRate}%</div></div>
    <div class="card"><div class="label">Total Tokens</div><div class="value">${fmt(st.total_tokens)}</div><div class="sub">Prompt: ${fmt(st.total_prompt_tokens)} · Completion: ${fmt(st.total_completion_tokens)}</div></div>
    <div class="card"><div class="label">Total Cost</div><div class="value">$${st.total_cost ? st.total_cost.toFixed(4) : '0.0000'}</div><div class="sub">USD</div></div>
    <div class="card"><div class="label">Uptime</div><div class="value">${formatDuration(data.uptime_seconds)}</div><div class="sub">Sidecar V2</div></div>
  `;
  // Pool grid
  let poolHTML = '';
  for (const [pool, ps] of Object.entries(data.pool || {})) {
    poolHTML += `
      <div class="pool-card">
        <h3 class="${pool}">${pool}</h3>
        <div class="pool-stats">
          <div class="pool-stat total"><div class="num">${ps.total}</div><div class="lbl">Total</div></div>
          <div class="pool-stat healthy"><div class="num">${ps.healthy}</div><div class="lbl">Healthy</div></div>
          <div class="pool-stat cooling"><div class="num">${ps.cooling}</div><div class="lbl">Cooling</div></div>
          <div class="pool-stat error"><div class="num">${ps.error}</div><div class="lbl">Error</div></div>
        </div>
      </div>`;
  }
  document.getElementById('pool-grid').innerHTML = poolHTML || '<div class="card">No pools configured</div>';
  // Update backend table if on providers page
  if (document.getElementById('page-providers').classList.contains('active')) {
    renderBackendsTable(data.backends || []);
  }
 }
 // ── Chart Updates (use SSE data to build chart data) ──
 function initCharts() {
  const cc = document.getElementById('cost-chart');
  const tc = document.getElementById('token-chart');
  if (!cc || !tc) return;
  if (costChart) costChart.destroy();
  if (tokenChart) tokenChart.destroy();
  costChart = new Chart(cc, {
    type: 'line', data: { labels: [], datasets: [{ label: 'Cost (USD)', data: [], borderColor: '#00d1b2', backgroundColor: 'rgba(0,209,178,0.1)', fill: true, tension: 0.3 }] },
    options: { responsive: true, maintainAspectRatio: true, plugins: { legend: { labels: { color: '#888' } } }, scales: { x: { ticks: { color: '#888', maxTicksLimit: 12 } }, y: { ticks: { color: '#888' } } } }
  });
  tokenChart = new Chart(tc, {
    type: 'line', data: { labels: [], datasets: [{ label: 'Total Tokens', data: [], borderColor: '#b86bff', backgroundColor: 'rgba(184,107,255,0.1)', fill: true, tension: 0.3 }] },
    options: { responsive: true, maintainAspectRatio: true, plugins: { legend: { labels: { color: '#888' } } }, scales: { x: { ticks: { color: '#888', maxTicksLimit: 12 } }, y: { ticks: { color: '#888' } } } }
  });
 }
 // ── Providers Page ──
 function renderBackendsTable(backends) {
  const tbody = document.querySelector('#backends-table tbody');
  tbody.innerHTML = backends.map(b => `
    <tr>
      <td><strong>${h(b.name)}</strong></td>
      <td><span class="badge ${b.label ? 'primary' : ''}">${h(b.label || '-')}</span></td>
      <td><span class="badge ${b.pool}">${b.pool}</span></td>
      <td><span class="badge ${b.status}">${b.status}</span></td>
      <td>${b.rpm_limit}</td>
      <td>${b.model_count || 0}</td>
      <td>
        <button class="btn btn-outline btn-sm" onclick="editBackend('${b.id}')">Edit</button>
        <button class="btn btn-danger btn-sm" onclick="deleteBackend('${b.id}')">Del</button>
      </td>
    </tr>`).join('');
 }
 function showAddBackend() {
  document.getElementById('modal-title').textContent = 'Add Provider';
  document.getElementById('backend-id').value = '';
  document.getElementById('backend-name').value = '';
  document.getElementById('backend-label').value = '';
  document.getElementById('backend-url').value = '';
  document.getElementById('backend-key').value = '';
  document.getElementById('backend-pool').value = 'primary';
  document.getElementById('backend-rpm').value = '40';
  document.getElementById('backend-timeout').value = '120';
  document.getElementById('backend-enabled').value = 'true';
  document.getElementById('backend-mappings').value = '{}';
  document.getElementById('backend-modal').classList.add('active');
 }
 async function editBackend(id) {
  try {
    const res = await fetch('/api/admin/backends/' + id);
    const b = await res.json();
    document.getElementById('modal-title').textContent = 'Edit Provider';
    document.getElementById('backend-id').value = b.id;
    document.getElementById('backend-name').value = b.name;
    document.getElementById('backend-label').value = b.label || '';
    document.getElementById('backend-url').value = b.api_base_url;
    document.getElementById('backend-key').value = '';
    document.getElementById('backend-key').placeholder = '(leave blank to keep current)';
    document.getElementById('backend-key').required = false;
    document.getElementById('backend-pool').value = b.pool;
    document.getElementById('backend-rpm').value = b.rpm_limit;
    document.getElementById('backend-timeout').value = b.timeout_seconds;
    document.getElementById('backend-enabled').value = b.enabled ? 'true' : 'false';
    document.getElementById('backend-mappings').value = JSON.stringify(b.model_mappings || {}, null, 2);
    document.getElementById('backend-modal').classList.add('active');
  } catch (e) { alert('Failed to load backend: ' + e.message); }
 }
 async function saveBackend(e) {
  e.preventDefault();
  const id = document.getElementById('backend-id').value;
  const body = {
    name: document.getElementById('backend-name').value,
    label: document.getElementById('backend-label').value,
    api_base_url: document.getElementById('backend-url').value,
    pool: document.getElementById('backend-pool').value,
    rpm_limit: parseInt(document.getElementById('backend-rpm').value),
    timeout_seconds: parseInt(document.getElementById('backend-timeout').value),
    enabled: document.getElementById('backend-enabled').value === 'true',
    model_mappings: JSON.parse(document.getElementById('backend-mappings').value || '{}'),
  };
  const key = document.getElementById('backend-key').value;
  if (key) body.api_key = key;
  try {
    const method = id ? 'PUT' : 'POST';
    const url = id ? '/api/admin/backends/' + id : '/api/admin/backends';
    const res = await fetch(url, { method, headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(body) });
    if (!res.ok) throw new Error((await res.json()).detail || 'Save failed');
    closeModal();
    refreshAll();
  } catch (e) { alert('Error: ' + e.message); }
 }
 async function deleteBackend(id) {
  if (!confirm('Delete this provider? This cannot be undone.')) return;
  try {
    await fetch('/api/admin/backends/' + id, { method: 'DELETE' });
    refreshAll();
  } catch (e) { alert('Delete failed: ' + e.message); }
 }
 function closeModal() { document.getElementById('backend-modal').classList.remove('active'); }
 // ── Load Pages ──
 async function loadPage(page) {
  if (page === 'dashboard') {
    initCharts();
    loadChartData();
  } else if (page === 'providers') {
    refreshAll();
  } else if (page === 'usage') {
    loadUsageFilter();
    loadUsage();
    loadDaily();
  } else if (page === 'cooldown') {
    loadCooldown();
  }
 }
 async function refreshAll() {
  try {
    const res = await fetch('/api/admin/backends');
    const backends = await res.json();
    renderBackendsTable(backends);
  } catch (e) { console.error(e); }
 }
 async function loadUsageFilter() {
  try {
    const res = await fetch('/api/admin/backends');
    const backends = await res.json();
    const sel = document.getElementById('usage-backend-filter');
    sel.innerHTML = '<option value="">All Backends</option>' +
      backends.map(b => `<option value="${b.id}">${h(b.name)}</option>`).join('');
  } catch (e) {}
 }
 async function loadUsage() {
  const sel = document.getElementById('usage-backend-filter');
  const backendId = sel.value;
  const url = backendId ? `/api/admin/stats/hourly?backend_id=${backendId}&hours=72` : '/api/admin/stats/hourly?hours=72';
  try {
    const res = await fetch(url);
    const data = await res.json();
    const tbody = document.querySelector('#usage-table tbody');
    tbody.innerHTML = data.map(r => `
      <tr>
        <td>${r.hour_bucket}</td>
        <td>${r.backend_id}</td>
        <td>${h(r.model)}</td>
        <td>${fmt(r.request_count)}</td>
        <td class="${r.error_count > 0 ? 'text-red' : 'text-green'}">${r.error_count}</td>
        <td>${fmt(r.total_tokens)}</td>
        <td>$${(r.cost || 0).toFixed(6)}</td>
        <td>${r.avg_latency_ms}ms</td>
      </tr>`).join('');
  } catch (e) { console.error(e); }
 }
 async function loadDaily() {
  try {
    const res = await fetch('/api/admin/stats/daily?days=30');
    const data = await res.json();
    const tbody = document.querySelector('#daily-table tbody');
    tbody.innerHTML = data.map(r => `
      <tr>
        <td>${r.date}</td>
        <td><span class="badge ${r.pool}">${r.pool}</span></td>
        <td>${fmt(r.total_requests)}</td>
        <td>${fmt(r.total_errors)}</td>
        <td>${fmt(r.total_tokens)}</td>
        <td>$${(r.total_cost || 0).toFixed(6)}</td>
        <td>${r.unique_backends}</td>
      </tr>`).join('');
  } catch (e) { console.error(e); }
 }
 async function loadCooldown() {
  try {
    const res = await fetch('/api/admin/stats/cooldown?limit=100');
    const data = await res.json();
    const tbody = document.querySelector('#cooldown-table tbody');
    tbody.innerHTML = data.map(r => `
      <tr>
        <td>${r.started_at}</td>
        <td>${r.backend_id}</td>
        <td>${r.consecutive_count}</td>
        <td>${r.cooldown_seconds}s</td>
        <td>${h(r.response_summary)}</td>
      </tr>`).join('');
  } catch (e) { console.error(e); }
 }
 async function loadChartData() {
  try {
    const res = await fetch('/api/admin/stats/hourly?hours=168');
    const data = await res.json();
    // Group by hour, sum
    const byHour = {};
    data.forEach(r => {
      const hour = r.hour_bucket.slice(0, 13);
      if (!byHour[hour]) byHour[hour] = { cost: 0, tokens: 0 };
      byHour[hour].cost += (r.cost || 0);
      byHour[hour].tokens += (r.total_tokens || 0);
    });
    const hours = Object.keys(byHour).sort();
    const costs = hours.map(h => byHour[h].cost);
    const tokens = hours.map(h => byHour[h].tokens);
    const labels = hours.map(h => h.slice(11, 16) + ' ' + h.slice(5, 10));
    if (costChart) {
      costChart.data.labels = labels;
      costChart.data.datasets[0].data = costs;
      costChart.update();
    }
    if (tokenChart) {
      tokenChart.data.labels = labels;
      tokenChart.data.datasets[0].data = tokens;
      tokenChart.update();
    }
  } catch (e) { console.error(e); }
 }
 // ── Helpers ──
 function fmt(n) { return (n || 0).toLocaleString(); }
 function h(s) { const d=document.createElement('div'); d.textContent=s||''; return d.innerHTML; }
 function formatDuration(s) {
  const d = Math.floor(s / 86400);
  const h = Math.floor((s % 86400) / 3600);
  const m = Math.floor((s % 3600) / 60);
  const parts = [];
  if (d) parts.push(d + 'd');
  if (h) parts.push(h + 'h');
  if (m || !parts.length) parts.push(m + 'm');
  return parts.join(' ');
 }
 // Initial load
 document.addEventListener('DOMContentLoaded', () => {
  // Ensure chart containers exist
  if (!document.getElementById('cost-chart')) {
    const chartsDiv = document.getElementById('charts');
    if (chartsDiv) {
      chartsDiv.innerHTML = `
        <div class="chart-card"><h3>Cost Over Time</h3><canvas id="cost-chart"></canvas></div>
        <div class="chart-card"><h3>Token Usage Over Time</h3><canvas id="token-chart"></canvas></div>`;
    }
  }
  initCharts();
  loadChartData();
 });
 </script>
 </body>
 </html>
@@ -1,31 +0,0 @@
 # NVIDIA Sidecar 环境变量清单 (BIZ-46 Phase3 §4)
 # 复制为 .env 后按需修改，供 Docker / systemd 使用。
 # 网络
 SIDECAR_HOST=127.0.0.1
 SIDECAR_PORT=9190
 SIDECAR_METRICS_PORT=9191
 # 上游 API（必填）
 SIDECAR_UPSTREAM=https://integrate.api.nvidia.com/v1
 SIDECAR_API_KEY=nvapi-your-key-here
 # 限流
 SIDECAR_RATE_RPM=40
 SIDECAR_BUCKET_CAPACITY=40
 # 超时
 SIDECAR_TIMEOUT=60
 # 队列
 SIDECAR_QUEUE_MAX=500
 SIDECAR_LOW_TIMEOUT=2
 # 降级
 SIDECAR_FALLBACK_PASSTHROUGH=true
 # 日志
 SIDECAR_LOG_LEVEL=INFO
 # Admin API 认证（可选，不设置则跳过认证）
 # SIDECAR_ADMIN_TOKEN=your-admin-token-here
@@ -0,0 +1,90 @@
 # Sidecar V2 — API Key Encryption Rotation SOP
 > 版本: v1.0 | 维护者: 严维序 (opengineer)
 ## 背景
 Sidecar V2 使用 AES-256-GCM 加密存储所有 Provider 的 API Key。加密密钥通过 `SIDECAR_ENCRYPTION_KEY` 环境变量传入，启动时通过 `init_crypto()` 初始化。
 ## ⚠️ 关键警告
 **更换 SIDECAR_ENCRYPTION_KEY 会导致所有已存储的 API Key 永久不可恢复！**
 `crypto.py` 的 `try_decrypt_existing()` 在密钥变更时会静默返回 `None`，已有加密数据将无法解密。请在轮换密钥前执行以下步骤。
 ## 安全轮换步骤
 ### Step 1: 导出当前 API Key 明文（必须）
 ```bash
 # 使用旧密钥启动 sidecar，通过 admin API 导出
 curl -s -H "Authorization: Bearer <ADMIN_TOKEN>" \
  http://127.0.0.1:9190/api/admin/backends | \
  python3 -c "
 import json, sys
 data = json.load(sys.stdin)
 # 注意：api_key 是 masked 的，需要重新从安全渠道获取原始 key
 print(json.dumps(data, indent=2))
 "
 ```
 ### Step 2: 停止服务
 ```bash
 systemctl stop sidecar-v2
 # 或
 docker compose down
 ```
 ### Step 3: 备份数据库
 ```bash
 cp /app/data/sidecar_v2.db /app/data/backups/pre-rotation-$(date +%Y%m%d_%H%M%S).db
 ```
 ### Step 4: 更新密钥
 更新 `/etc/sidecar-v2/env` 或 docker `.env` 文件中的 `SIDECAR_ENCRYPTION_KEY`：
 ```
 SIDECAR_ENCRYPTION_KEY=<new_64_hex_char_key>
 ```
 生成新密钥：
 ```bash
 python3 -c "import secrets; print(secrets.token_hex(32))"
 ```
 ### Step 5: 清空加密 Key 并重新录入
 由于密钥变更后旧加密数据不可读，需要：
 1. 启动服务（此时所有旧 Provider 的 API Key 不可用）
 2. 通过 Admin API 重新录入所有 Provider 的 API Key：
 ```bash
 curl -s -X PUT -H "Authorization: Bearer <ADMIN_TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"api_key": "<NEW_PLAIN_KEY>"}' \
  http://127.0.0.1:9190/api/admin/backends/<backend_id>
 ```
 ### Step 6: 验证
 ```bash
 # 确认 Provider 状态为 healthy
 curl -s http://127.0.0.1:9190/api/admin/pools
 # 发送测试请求
 curl -s -X POST http://127.0.0.1:9190/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"<model_name>","messages":[{"role":"user","content":"test"}],"max_tokens":5}'
 ```
 ## 应急预案
 如果在密钥轮换过程中出错：
 1. 恢复旧密钥环境变量
 2. 恢复旧数据库备份
 3. 重启服务
 旧 Key 会正常工作，因为未被覆盖的数据仍然用旧密钥加密。
@@ -0,0 +1,56 @@
 # Sidecar V2 — Nginx reverse proxy config (reference)
 # Place at /etc/nginx/sites-available/sidecar-v2.conf
 # SSL certs managed by certbot or manually
 upstream sidecar_v2_main {
    server 127.0.0.1:9190;
 }
 upstream sidecar_v2_metrics {
    server 127.0.0.1:9191;
 }
 server {
    listen 443 ssl http2;
    server_name sidecar.example.com;
    ssl_certificate     /etc/ssl/certs/sidecar.pem;
    ssl_certificate_key /etc/ssl/private/sidecar.key;
    # Dashboard + Admin API (main port)
    location / {
        proxy_pass http://sidecar_v2_main;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
    # SSE support for dashboard real-time data
    location /dashboard/sse {
        proxy_pass http://sidecar_v2_main;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_buffering off;
        proxy_cache off;
        chunked_transfer_encoding off;
        proxy_read_timeout 86400s;
    }
    # Prometheus metrics
    location /metrics {
        proxy_pass http://sidecar_v2_metrics;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
    }
    # Health check
    location /health {
        proxy_pass http://sidecar_v2_main;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
    }
 }
@@ -1,49 +0,0 @@
 # NVIDIA Sidecar 限流代理 — systemd service (BIZ-46 Phase3 §4)
 #
 # 安装：
 #   sudo cp deploy/nvidia-sidecar.service /etc/systemd/system/
 #   sudo systemctl daemon-reload
 #   sudo systemctl enable nvidia-sidecar
 #   sudo systemctl start nvidia-sidecar
 #
 # 运维：
 #   sudo systemctl status nvidia-sidecar
 #   sudo journalctl -u nvidia-sidecar -f
 [Unit]
 Description=NVIDIA Sidecar Rate-Limiting Proxy
 Documentation=https://github.com/bizwings/nvidia-sidecar
 After=network-online.target
 Wants=network-online.target
 [Service]
 Type=simple
 User=sidecar
 Group=sidecar
 WorkingDirectory=/opt/nvidia-sidecar
 ExecStart=/opt/nvidia-sidecar/.venv/bin/uvicorn nvidia_sidecar.server:app \
    --host 127.0.0.1 \
    --port 9190 \
    --log-level info
 Restart=always
 RestartSec=5
 # 环境变量
 EnvironmentFile=/opt/nvidia-sidecar/.env
 # 安全加固
 NoNewPrivileges=true
 ProtectSystem=strict
 ProtectHome=true
 PrivateTmp=true
 ReadWritePaths=/opt/nvidia-sidecar/logs
 # 资源限制
 LimitNOFILE=65536
 MemoryMax=512M
 # 启动延迟（等待网络就绪）
 ExecStartPre=/bin/sleep 1
 [Install]
 WantedBy=multi-user.target
@@ -0,0 +1,23 @@
 [Unit]
 Description=Sidecar V2 — Multi-Pool Provider Proxy
 After=network.target
 [Service]
 Type=simple
 User=openclaw
 Group=openclaw
 WorkingDirectory=/opt/sidecar-v2
 EnvironmentFile=/etc/sidecar-v2/env
 ExecStart=/opt/sidecar-v2/.venv/bin/python3 main.py
 Restart=always
 RestartSec=5
 # Security hardening
 NoNewPrivileges=yes
 ProtectSystem=strict
 ProtectHome=yes
 ReadWritePaths=/opt/sidecar-v2/data
 PrivateTmp=yes
 [Install]
 WantedBy=multi-user.target
@@ -0,0 +1,26 @@
 # Sidecar V2 — Multi-Pool Provider Proxy
 version: "3.9"
 services:
  sidecar-v2:
    build: .
    container_name: sidecar-v2
    restart: unless-stopped
    ports:
      - "9190:9190"  # Main proxy + admin API + dashboard
      - "9191:9191"  # Prometheus metrics
    environment:
      - SIDECAR_ENCRYPTION_KEY=${SIDECAR_ENCRYPTION_KEY}
      - SIDECAR_ADMIN_TOKEN=${SIDECAR_ADMIN_TOKEN:-change-me}
      - LOG_FORMAT=${LOG_FORMAT:-json}
      - SIDECAR_HOST=0.0.0.0
      - SIDECAR_PORT=9190
      - SIDECAR_METRICS_PORT=9191
      - SIDECAR_DB_PATH=/app/data/sidecar_v2.db
      - SIDECAR_BACKUP_DIR=/app/data/backups
    volumes:
      - sidecar-data:/app/data
 volumes:
  sidecar-data:
    driver: local
@@ -1,198 +0,0 @@
 """
 NVIDIA Sidecar 限流代理 — 健康检查端点 (§3.6)
 提供 Kubernetes / systemd 兼容的健康检查：
    GET /health       — 存活检查
    GET /health/ready — 就绪检查（含上游连通性）
 BIZ-46 Phase3: Readiness HTTP Client 复用 — 注入主 http_client，
 不再每次检查创建新 client，降低 K8s/systemd 高频探测的连接开销。
 """
 from __future__ import annotations
 import time
 from dataclasses import dataclass
 from typing import Any
 import httpx
@dataclass
 class HealthService:
    """健康检查服务。
    封装存活检查和就绪检查的逻辑，供 server.py 路由调用。
    """
    start_time: float = 0.0
    version: str = "0.1.0"
    def __post_init__(self) -> None:
        if self.start_time == 0.0:
            self.start_time = time.time()
    @property
    def uptime_seconds(self) -> float:
        """服务运行时长（秒）。"""
        return time.time() - self.start_time
    async def check_upstream(
        self,
        upstream_url: str,
        http_client: httpx.AsyncClient,
        timeout: float = 5.0,
        api_key: str = "",
    ) -> bool:
        """检查上游连通性（复用注入的 http_client，BIZ-46 Phase3）。
        Args:
            upstream_url: NVIDIA API base URL。
            http_client: 复用的 httpx.AsyncClient（来自 ctx）。
            timeout: 超时秒数（per-request override）。
            api_key: 可选的 API Key 用于认证。
        Returns:
            True 上游可达。
        """
        try:
            headers: dict[str, str] = {}
            if api_key:
                headers["authorization"] = f"Bearer {api_key}"
            resp = await http_client.get(
                f"{upstream_url.rstrip('/')}/v1/models",
                headers=headers,
                timeout=timeout,
            )
            return resp.status_code < 500
        except Exception:
            return False
    def check_queue_healthy(
        self,
        current_size: int,
        max_size: int,
        threshold_ratio: float = 0.9,
    ) -> bool:
        """检查队列是否健康（未接近满载）。
        Args:
            current_size: 当前队列长度。
            max_size: 队列最大容量。
            threshold_ratio: 告警阈值比例，默认 0.9。
        Returns:
            True 队列健康。
        """
        if max_size <= 0:
            return True
        return current_size < max_size * threshold_ratio
    def check_token_bucket_healthy(
        self,
        available_tokens: float,
        capacity: int,
        threshold: float = 0.05,
    ) -> bool:
        """检查令牌桶是否健康（token 未耗尽）。
        Args:
            available_tokens: 当前可用令牌数。
            capacity: 桶容量。
            threshold: 令牌数低于此比例视为不健康。
        Returns:
            True 令牌桶健康。
        """
        if capacity <= 0:
            return False
        return available_tokens > capacity * threshold
    def liveness(self) -> dict[str, Any]:
        """存活检查响应。
        Returns:
            liveness JSON payload。
        """
        return {
            "status": "ok",
            "uptime": round(self.uptime_seconds, 1),
            "version": self.version,
        }
    async def readiness(
        self,
        upstream_url: str,
        upstream_api_key: str = "",
        queue_current_size: int = 0,
        queue_max_size: int = 500,
        available_tokens: float = 0.0,
        bucket_capacity: int = 40,
        http_client: httpx.AsyncClient | None = None,
    ) -> dict[str, Any]:
        """就绪检查响应。
        Args:
            upstream_url: 上游 API 地址。
            upstream_api_key: API Key。
            queue_current_size: 当前队列长度。
            queue_max_size: 队列最大容量。
            available_tokens: 当前令牌数。
            bucket_capacity: 桶容量。
            http_client: 复用的 httpx.AsyncClient（BIZ-46 Phase3）。
                为 None 时回退到每次创建新 client（兼容旧调用）。
        Returns:
            readiness JSON payload。
        """
        if http_client is not None:
            upstream_ok = await self.check_upstream(
                upstream_url, http_client=http_client, api_key=upstream_api_key,
            )
        else:
            # 向后兼容：无 http_client 时沿用旧行为
            upstream_ok = await self.check_upstream_standalone(
                upstream_url, api_key=upstream_api_key,
            )
        queue_ok = self.check_queue_healthy(queue_current_size, queue_max_size)
        token_ok = self.check_token_bucket_healthy(available_tokens, bucket_capacity)
        all_ready = upstream_ok and queue_ok and token_ok
        return {
            "ready": all_ready,
            "upstream_reachable": upstream_ok,
            "queue_healthy": queue_ok,
            "token_bucket_healthy": token_ok,
        }
    async def check_upstream_standalone(
        self,
        upstream_url: str,
        timeout: float = 5.0,
        api_key: str = "",
    ) -> bool:
        """独立检查上游连通性（向后兼容，每次创建新 client）。
        Args:
            upstream_url: NVIDIA API base URL。
            timeout: 超时秒数。
            api_key: 可选的 API Key。
        Returns:
            True 上游可达。
        """
        try:
            headers: dict[str, str] = {}
            if api_key:
                headers["authorization"] = f"Bearer {api_key}"
            async with httpx.AsyncClient(timeout=timeout) as client:
                resp = await client.get(
                    f"{upstream_url.rstrip('/')}/v1/models",
                    headers=headers,
                )
                return resp.status_code < 500
        except Exception:
            return False
@@ -0,0 +1,17 @@
 """Sidecar V2 entry point."""
 import uvicorn
 from config import config
 def main():
    uvicorn.run(
        "server:app",
        host=config.host,
        port=config.port,
        log_level=config.log_level.lower(),
    )
 if __name__ == "__main__":
    main()
@@ -1,277 +0,0 @@
 """
 NVIDIA Sidecar 限流代理 — Prometheus 指标端点 (§3.5)
 10 个指标，独立端口 :9191，与代理端口 :9190 分离。
 BIZ-46 Phase3: Prometheus 标签基数治理 — model_id label 收敛为 provider。
 - upstream_latency_seconds: model_id → provider (固定值 "nvidia", 基数=1)
 - upstream_errors_total: model_id → provider
 - 模型级信息迁移到 structlog JSON 日志
 """
 from __future__ import annotations
 import time
 import threading
 from typing import Any
 from prometheus_client import (
    CollectorRegistry,
    Counter,
    Gauge,
    Histogram,
    generate_latest,
    make_asgi_app,
 )
 class PrometheusMetrics:
    """Sidecar Prometheus 指标收集器。
    线程安全，所有公开方法通过 ``threading.Lock`` 保护。
    """
    def __init__(self, registry: CollectorRegistry | None = None) -> None:
        """初始化所有 10 个 Prometheus 指标。
        Args:
            registry: 可选自定义 Registry；None 则使用默认全局 registry。
        """
        self._registry: CollectorRegistry = registry or CollectorRegistry()
        self._lock: threading.Lock = threading.Lock()
        self._start_time: float = time.time()
        # ---- 1. 总请求数（按优先级 + 状态分组） ----
        self.requests_total: Counter = Counter(
            "sidecar_requests_total",
            "Total requests processed by priority and status",
            labelnames=["priority", "status"],
            registry=self._registry,
        )
        # ---- 2. 可用令牌数 ----
        self.tokens_available: Gauge = Gauge(
            "sidecar_tokens_available",
            "Current number of available tokens",
            registry=self._registry,
        )
        # ---- 3. 令牌生成速率 ----
        self.tokens_rate: Gauge = Gauge(
            "sidecar_tokens_rate",
            "Current token generation rate (tokens per minute)",
            registry=self._registry,
        )
        # ---- 4. 各优先级队列深度 ----
        self.queue_depth: Gauge = Gauge(
            "sidecar_queue_depth",
            "Queue depth by priority",
            labelnames=["priority"],
            registry=self._registry,
        )
        # ---- 5. 队列等待时间 Histogram ----
        self.queue_latency_seconds: Histogram = Histogram(
            "sidecar_queue_latency_seconds",
            "Request wait time in queue in seconds",
            labelnames=["priority"],
            buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0),
            registry=self._registry,
        )
        # ---- 6. 上游响应延迟 Histogram（label 收敛: model_id → provider） ----
        self.upstream_latency_seconds: Histogram = Histogram(
            "sidecar_upstream_latency_seconds",
            "Upstream response latency in seconds",
            labelnames=["provider"],  # BIZ-46: was ["model_id"], converged to fixed-cardinality provider
            buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0, 600.0),
            registry=self._registry,
        )
        # ---- 7. 上游错误计数（label 收敛: model_id → provider） ----
        self.upstream_errors_total: Counter = Counter(
            "sidecar_upstream_errors_total",
            "Upstream error count by status code and provider",
            labelnames=["status_code", "provider"],  # BIZ-46: was ["model_id"], converged
            registry=self._registry,
        )
        # ---- 8. 降级直通次数 ----
        self.fallback_passthrough_total: Counter = Counter(
            "sidecar_fallback_passthrough_total",
            "Total fallback / passthrough events (queue full or sidecar unavailable)",
            registry=self._registry,
        )
        # ---- 9. 健康状态 ----
        self.health_status: Gauge = Gauge(
            "sidecar_health_status",
            "Sidecar health: 0=unhealthy, 1=healthy",
            registry=self._registry,
        )
        # ---- 10. 运行时长 ----
        self.uptime_seconds: Gauge = Gauge(
            "sidecar_uptime_seconds",
            "Process uptime in seconds",
            registry=self._registry,
        )
        # 避退模式指标（附加，不计入基础 10 个）
        self.retreat_state: Gauge = Gauge(
            "sidecar_retreat_state",
            "Adaptive retreat state: 0=NORMAL, 1=RETREAT, 2=RECOVER",
            registry=self._registry,
        )
        self.effective_rate_rpm: Gauge = Gauge(
            "sidecar_effective_rate_rpm",
            "Current effective rate in RPM (after retreat adjustments)",
            registry=self._registry,
        )
        self.upstream_429_rate: Gauge = Gauge(
            "sidecar_upstream_429_rate",
            "Upstream 429 rate over the retreat observation window (0.0-1.0)",
            registry=self._registry,
        )
        # 初始化
        self.health_status.set(1)
    # ---- ASGI app 生成 ----
    def build_asgi_app(self) -> Any:
        """生成 Prometheus ASGI 应用，挂载到独立端口。
        Returns:
            可传给 uvicorn 的 ASGI app。
        """
        return make_asgi_app(registry=self._registry)
    # ---- 指标记录方法 ----
    def record_request(self, priority: str, status: str) -> None:
        """记录一次请求。
        Args:
            priority: 优先级名（URGENT / HIGH / NORMAL / LOW）。
            status: 状态（success / ratelimited / error）。
        """
        with self._lock:
            self.requests_total.labels(priority=priority, status=status).inc()
    def record_queue_latency(self, priority: str, seconds: float) -> None:
        """记录排队延迟。
        Args:
            priority: 优先级名。
            seconds: 排队等待秒数。
        """
        with self._lock:
            self.queue_latency_seconds.labels(priority=priority).observe(seconds)
    def record_upstream(self, status_code: int, provider: str) -> None:
        """记录上游响应（label 收敛: provider 替代 model_id，BIZ-46 Phase3）。
        Args:
            status_code: HTTP 状态码。
            provider: 上游提供商标识（固定 "nvidia"）。
        """
        with self._lock:
            self.upstream_latency_seconds.labels(provider=provider).observe(0.0)
    def record_upstream_error(self, status_code: int, provider: str) -> None:
        """记录上游错误（label 收敛: provider 替代 model_id，BIZ-46 Phase3）。
        Args:
            status_code: 错误 HTTP 状态码。
            provider: 上游提供商标识（固定 "nvidia"）。
        """
        with self._lock:
            self.upstream_errors_total.labels(
                status_code=str(status_code), provider=provider
            ).inc()
    def record_upstream_latency(self, provider: str, seconds: float) -> None:
        """记录上游响应延迟（label 收敛: provider 替代 model_id，BIZ-46 Phase3）。
        Args:
            provider: 上游提供商标识（固定 "nvidia"）。
            seconds: 响应延迟秒数。
        """
        with self._lock:
            self.upstream_latency_seconds.labels(provider=provider).observe(seconds)
    def update_token_status(self, tokens: float, rate_per_minute: float) -> None:
        """更新令牌桶状态。
        Args:
            tokens: 当前可用令牌数。
            rate_per_minute: 每分钟速率。
        """
        with self._lock:
            self.tokens_available.set(tokens)
            self.tokens_rate.set(rate_per_minute)
    def update_queue_depth(self, depths: dict[str, int]) -> None:
        """更新各优先级队列深度。
        Args:
            depths: {priority_name: count} 映射。
        """
        with self._lock:
            # 先清零所有已知标签再设置，避免残留旧值
            for pri in ("URGENT", "HIGH", "NORMAL", "LOW"):
                self.queue_depth.labels(priority=pri).set(depths.get(pri, 0))
    def increment_fallback(self) -> None:
        """降级直通计数 +1。"""
        with self._lock:
            self.fallback_passthrough_total.inc()
    def set_health(self, healthy: bool) -> None:
        """设置健康状态。
        Args:
            healthy: True=健康, False=不健康。
        """
        with self._lock:
            self.health_status.set(1 if healthy else 0)
    def update_uptime(self) -> None:
        """更新运行时长。"""
        with self._lock:
            self.uptime_seconds.set(time.time() - self._start_time)
    # ---- 避退模式指标 ----
    def update_retreat_metrics(
        self,
        retreat_state: str,
        effective_rate_rpm: float,
        upstream_429_rate: float,
    ) -> None:
        """更新避退模式指标。
        Args:
            retreat_state: "normal" / "retreat" / "recover".
            effective_rate_rpm: 当前实际速率 (RPM)。
            upstream_429_rate: 上游 429 率 (0.0-1.0)。
        """
        state_map: dict[str, int] = {"normal": 0, "retreat": 1, "recover": 2}
        with self._lock:
            self.retreat_state.set(state_map.get(retreat_state, 0))
            self.effective_rate_rpm.set(effective_rate_rpm)
            self.upstream_429_rate.set(upstream_429_rate)
    # ---- 导出 ----
    def generate_latest(self) -> bytes:
        """生成 Prometheus 文本格式的指标数据。
        Returns:
            Prometheus 文本格式 bytes。
        """
        with self._lock:
            self.update_uptime()
        return generate_latest(self._registry)
@@ -0,0 +1,83 @@
 """Provider pool management: primary / fallback pool routing."""
 import structlog
 from typing import Optional
 from storage.backend_store import list_backends, get_pool_stats
 from storage.models import Backend
 logger = structlog.get_logger("sidecar_v2.pool_manager")
 class PoolManager:
    """Manages provider pools and selects healthy backends for a given model.
    Priority: primary pool → fallback pool.
    Within a pool: healthy backends only, sorted by availability.
    """
    def __init__(self):
        self._pool_order = ["primary", "fallback"]
    def get_available_backends(
        self, canonical_model: str, pool: Optional[str] = None
    ) -> list[Backend]:
        """Get all healthy, enabled backends that serve a model, in pool order.
        Args:
            canonical_model: Canonical model name to match.
            pool: Optional pool filter (primary/fallback). None = all pools.
        Returns:
            List of ready backends sorted by pool priority, then RPM utilization.
        """
        backends: list[Backend] = []
        pools_to_check = [pool] if pool else self._pool_order
        for p in pools_to_check:
            pool_backends = list_backends(pool=p, enabled_only=True, decrypt_key=True)
            for b in pool_backends:
                if b.status == "healthy" and b.has_model(canonical_model):
                    backends.append(b)
            if pool:
                break
        return backends
    def get_any_healthy_backends(self, pool: Optional[str] = None) -> list[Backend]:
        """Get all healthy, enabled backends regardless of model."""
        backends: list[Backend] = []
        pools_to_check = [pool] if pool else self._pool_order
        for p in pools_to_check:
            pool_backends = list_backends(pool=p, enabled_only=True, decrypt_key=True)
            for b in pool_backends:
                if b.status == "healthy":
                    backends.append(b)
            if pool:
                break
        return backends
    def get_pool_status(self) -> dict:
        """Get pool summary for dashboard."""
        stats = get_pool_stats()
        result = {}
        for pool in self._pool_order:
            s = stats.get(pool, {"total": 0, "enabled": 0, "healthy": 0, "cooling": 0, "error": 0})
            result[pool] = s
        # Also include any other pools
        for pool, s in stats.items():
            if pool not in result:
                result[pool] = s
        return result
    def is_pool_available(self, canonical_model: str, pool: str = "primary") -> bool:
        """Check if a pool has any healthy backends for a model."""
        backends = self.get_available_backends(canonical_model, pool=pool)
        return len(backends) > 0
    def is_any_pool_available(self, canonical_model: str) -> bool:
        """Check if any pool has healthy backends for a model."""
        for pool in self._pool_order:
            if self.is_pool_available(canonical_model, pool):
                return True
        return False
@@ -1,253 +0,0 @@
 """
 NVIDIA Sidecar 限流代理 — 四级优先级请求队列模块 (§3.3)
 管理待处理的 NVIDIA API 请求，按优先级 + FIFO 出队。
 支持三种队列满策略：PASSTHROUGH / REJECT / DROP_LOWEST。
 """
 from __future__ import annotations
 import asyncio
 import heapq
 import time
 import uuid
 from dataclasses import dataclass, field
 from enum import Enum
 from typing import Any
 from nvidia_sidecar.rate_limiter import Priority
 # ---------------------------------------------------------------------------
 # 队列满策略
 # ---------------------------------------------------------------------------
 class QueueFullPolicy(str, Enum):
    """队列满时的处理策略。"""
    PASSTHROUGH = "passthrough"   # 直通上游，绕过排队（fail-open 子策略）
    REJECT = "reject"             # 返回 503 Service Unavailable
    DROP_LOWEST = "drop_lowest"   # 丢弃队列中最低优先级元素，插入新请求
 # ---------------------------------------------------------------------------
 # 队列元素
 # ---------------------------------------------------------------------------
@dataclass(order=True)
 class PriorityQueueItem:
    """优先级队列元素。
    ``sort_index`` 由 ``(priority, timestamp)`` 组成，
    Python 的 ``__lt__`` 按字段顺序比较：先比 priority，再比 timestamp。
    数值越小越优先（URGENT=1 优于 HIGH=2）。
    """
    sort_index: tuple[int, float] = field(compare=True)
    priority: Priority = field(compare=False)
    request_id: str = field(compare=False)
    payload: dict[str, Any] = field(compare=False)
    enqueued_at: float = field(compare=False)
    headers: dict[str, str] = field(default_factory=dict, compare=False)
 # ---------------------------------------------------------------------------
 # 优先级请求队列
 # ---------------------------------------------------------------------------
 class QueueFullError(Exception):
    """队列已满且策略为 REJECT 时抛出。"""
    pass
 class QueueFullPassthrough(Exception):
    """队列已满且策略为 PASSTHROUGH 时抛出，由调用方绕过队列直通上游。"""
    pass
 class PriorityRequestQueue:
    """异步线程安全的四级优先级请求队列。
    内部使用 ``asyncio.Lock`` 保护并发操作，
    基于 ``heapq`` + ``asyncio.Event`` 实现阻塞出队。
    """
    def __init__(self, max_size: int = 500) -> None:
        """初始化优先级队列。
        Args:
            max_size: 队列最大容量。
        Raises:
            ValueError: max_size <= 0。
        """
        if max_size <= 0:
            raise ValueError(f"max_size 必须为正整数，当前值: {max_size}")
        self.max_size: int = max_size
        self._heap: list[PriorityQueueItem] = []
        self._lock: asyncio.Lock = asyncio.Lock()
        self._not_empty: asyncio.Event = asyncio.Event()
        self._full_policy: QueueFullPolicy = QueueFullPolicy.PASSTHROUGH
        # 统计
        self._total_enqueued: int = 0
        self._total_dequeued: int = 0
        self._total_dropped: int = 0
    # ---- 队列满策略 ---- 
    def set_full_policy(self, policy: QueueFullPolicy) -> None:
        """设置队列满时的处理策略。
        Args:
            policy: QueueFullPolicy 枚举值。
        """
        self._full_policy = policy
    @property
    def full_policy(self) -> QueueFullPolicy:
        """当前队列满策略。"""
        return self._full_policy
    # ---- 动态容量调整 ----
    def set_max_size(self, new_size: int) -> tuple[bool, str]:
        """动态调整队列最大容量（热重载）。
        缩小操作受保护：如果 new_size 小于当前排队数，拒绝变更并
        提示当前队列深度。
        Args:
            new_size: 新的最大容量。
        Returns:
            (成功标志, 消息)。成功时标志为 True，消息含新旧容量对比；
            失败时标志为 False，消息含拒绝原因和当前深度。
        Raises:
            ValueError: new_size <= 0。
        """
        if new_size <= 0:
            raise ValueError(f"max_size 必须为正整数，当前值: {new_size}")
        current = len(self._heap)
        if new_size < current:
            return (False, f"拒绝缩小：新上限 {new_size} < 当前排队数 {current}，需要先排空或提升上限")
        old = self.max_size
        self.max_size = new_size
        return (True, f"队列上限已调整：{old} → {new_size}{'（当前排队 ' + str(current) + '）' if current > 0 else ''}")
    # ---- 入队 ----
    async def put(
        self,
        item: dict[str, Any],
        priority: Priority = Priority.NORMAL,
        headers: dict[str, str] | None = None,
    ) -> str:
        """将请求放入队列。
        Args:
            item: 请求体（JSON 序列化的 dict）。
            priority: 请求优先级，默认 NORMAL。
            headers: 原始请求 headers。
        Returns:
            分配的唯一 request_id。
        Raises:
            QueueFullError: 队列满且策略为 REJECT。
        """
        request_id = str(uuid.uuid4())
        headers = headers or {}
        queue_item = PriorityQueueItem(
            sort_index=(int(priority), time.monotonic()),
            priority=priority,
            request_id=request_id,
            payload=item,
            enqueued_at=time.monotonic(),
            headers=headers,
        )
        async with self._lock:
            queue_size = len(self._heap)
            if queue_size >= self.max_size:
                if self._full_policy == QueueFullPolicy.REJECT:
                    raise QueueFullError(
                        f"队列已满 ({queue_size}/{self.max_size})，策略: reject"
                    )
                elif self._full_policy == QueueFullPolicy.DROP_LOWEST:
                    # 丢弃 heap 中优先级最低（值最大）的元素
                    # heap 是最小堆，找最大值需要遍历
                    max_val_item = max(self._heap, key=lambda x: x.sort_index)
                    self._heap.remove(max_val_item)
                    heapq.heapify(self._heap)
                    self._total_dropped += 1
                # PASSTHROUGH 策略：不插入队列，抛异常让调用方绕过排队
                else:
                    raise QueueFullPassthrough(
                        f"队列已满 ({queue_size}/{self.max_size})，策略: passthrough"
                    )
            heapq.heappush(self._heap, queue_item)
            self._total_enqueued += 1
        self._not_empty.set()
        return request_id
    # ---- 出队 ----
    async def get(self, timeout: float = 1.0) -> PriorityQueueItem | None:
        """从队列取出下一个元素（阻塞、优先级排序）。
        Args:
            timeout: 阻塞等待的最大秒数，默认 1.0。
        Returns:
            优先级最高的队列元素；超时无元素时返回 None。
        """
        deadline = time.monotonic() + timeout
        while True:
            async with self._lock:
                if self._heap:
                    item = heapq.heappop(self._heap)
                    self._total_dequeued += 1
                    if not self._heap:
                        self._not_empty.clear()
                    return item
            # 队列为空，等待新元素入队
            remaining = deadline - time.monotonic()
            if remaining <= 0:
                return None
            try:
                await asyncio.wait_for(
                    self._not_empty.wait(),
                    timeout=remaining,
                )
            except asyncio.TimeoutError:
                return None
    # ---- 状态查询 ----
    async def get_queue_size(self) -> int:
        """返回当前队列长度。"""
        async with self._lock:
            return len(self._heap)
    async def get_stats(self) -> dict[str, Any]:
        """返回队列统计信息。"""
        async with self._lock:
            depth_by_priority: dict[str, int] = {}
            for item in self._heap:
                key = item.priority.name
                depth_by_priority[key] = depth_by_priority.get(key, 0) + 1
            return {
                "max_size": self.max_size,
                "current_size": len(self._heap),
                "total_enqueued": self._total_enqueued,
                "total_dequeued": self._total_dequeued,
                "total_dropped": self._total_dropped,
                "depth_by_priority": depth_by_priority,
                "full_policy": self._full_policy.value,
                "utilization": len(self._heap) / self.max_size if self.max_size > 0 else 0.0,
            }
@@ -0,0 +1,383 @@
 """Proxy request handling for Sidecar V2 — multi-pool routing + cooldown + rate limiting."""
 import asyncio
 import json
 import time
 from typing import Any, Optional
 import httpx
 import structlog
 from fastapi import Request
 from fastapi.responses import JSONResponse, Response, StreamingResponse
 from config import config
 from pool_manager import PoolManager
 from rate_limiter import PerBackendRateLimiter
 from router import Router
 from cooldown_manager import start_cooldown, check_and_clear_cooldown
 from storage.models import Backend
 from storage.usage_store import record_usage
 # Emergency activation counter (read by metrics endpoint)
 _emergency_count: int = 0
 def get_emergency_count() -> int:
    return _emergency_count
 logger: structlog.stdlib.BoundLogger = structlog.get_logger("sidecar_v2.proxy")
 def extract_model(body: dict[str, Any]) -> str:
    """Extract model identifier from request body."""
    return str(body.get("model", "unknown"))
 def build_error_response(status: int, message: str, error_type: str = "") -> JSONResponse:
    """Build a standard error response."""
    return JSONResponse(
        status_code=status,
        content={
            "error": {
                "message": message,
                "type": error_type or f"Error_{status}",
            }
        },
    )
 async def forward_to_backend(
    backend: Backend,
    method: str,
    path: str,
    body: bytes | None,
    headers: dict[str, str],
    stream: bool = False,
 ) -> httpx.Response:
    """Forward a request to a specific backend."""
    upstream_url = backend.api_base_url.rstrip("/") + path
    forward_headers = {
        k: v
        for k, v in headers.items()
        if k.lower() not in ("host", "content-length", "transfer-encoding")
    }
    if backend.api_key_plain:
        forward_headers["authorization"] = f"Bearer {backend.api_key_plain}"
    elif "authorization" not in {k.lower() for k in forward_headers}:
        forward_headers["authorization"] = "Bearer nvidia"
    timeout = httpx.Timeout(backend.timeout_seconds)
    async with httpx.AsyncClient(timeout=timeout) as client:
        req = client.build_request(
            method=method,
            url=upstream_url,
            headers=forward_headers,
            content=body,
        )
        return await client.send(req, stream=stream)
 def build_response(resp: httpx.Response) -> Response:
    """Convert httpx.Response to FastAPI Response."""
    content_type = resp.headers.get("content-type", "")
    headers = {
        k: v
        for k, v in resp.headers.items()
        if k.lower() not in ("content-encoding", "transfer-encoding")
    }
    is_sse = "text/event-stream" in content_type
    is_chunked = resp.headers.get("transfer-encoding", "").lower() == "chunked"
    if is_sse or (is_chunked and headers.get("content-type", "") != "application/octet-stream"):
        return StreamingResponse(
            content=resp.aiter_bytes(),
            status_code=resp.status_code,
            headers=headers,
            media_type=content_type or "text/event-stream",
        )
    return Response(
        content=resp.content,
        status_code=resp.status_code,
        headers=headers,
        media_type=content_type or "application/json",
    )
 def extract_usage_from_response(
    resp: httpx.Response,
    resp_json: dict[str, Any],
    model: str,
 ) -> tuple[int, int, int]:
    """Extract token usage from response body (OpenAI-compatible)."""
    usage = resp_json.get("usage", {})
    prompt_tokens = usage.get("prompt_tokens", 0) or 0
    completion_tokens = usage.get("completion_tokens", 0) or 0
    # Try streaming chunks: aggregate from choices
    if not prompt_tokens and not completion_tokens:
        choices = resp_json.get("choices", [])
        for choice in choices:
            if isinstance(choice, dict):
                tokens = choice.get("usage", {})
                prompt_tokens += tokens.get("prompt_tokens", 0) or 0
                completion_tokens += tokens.get("completion_tokens", 0) or 0
    total_tokens = prompt_tokens + completion_tokens
    if total_tokens == 0:
        total_tokens = usage.get("total_tokens", 0) or 0
    return prompt_tokens, completion_tokens, total_tokens
 def calculate_cost(
    backend: Backend,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
 ) -> float:
    """Calculate cost using backend's model pricing."""
    cost_info = backend.get_model_cost(model)
    input_cost = cost_info.get("input", 0.0)
    output_cost = cost_info.get("output", 0.0)
    # Costs are per token
    return (prompt_tokens * input_cost + completion_tokens * output_cost)
 async def handle_proxy_request(
    pool_manager: PoolManager,
    rate_limiter: PerBackendRateLimiter,
    router: Router,
    request: Request,
    path: str,
 ) -> Response:
    """Main proxy handler: multi-pool routing with cooldown and rate limiting.
    Flow:
    1. Extract model → canonical name
    2. Pick backend via Router (primary → fallback)
    3. Forward request
    4. If 429 → cooldown backend, retry with another
    5. If all pools exhausted → emergency mode
    6. Track usage
    """
    start_time = time.monotonic()
    body_bytes: bytes = await request.body()
    raw_headers: dict[str, str] = dict(request.headers)
    body_json: dict[str, Any] = {}
    try:
        if body_bytes:
            parsed = json.loads(body_bytes)
            if isinstance(parsed, dict):
                body_json = parsed
    except (ValueError, TypeError):
        body_json = {}
    canonical_model = extract_model(body_json)
    is_stream = body_json.get("stream", False)
    # Try with pool routing
    max_retries = config.max_pool_retries
    for attempt in range(max_retries):
        # Check and clear expired cooldowns before picking
        _refresh_cooldowns()
        backend = router.pick_backend(canonical_model)
        if backend is None:
            break  # No backend available, fall through to emergency
        try:
            resp = await forward_to_backend(
                backend=backend,
                method=request.method,
                path=path,
                body=body_bytes if body_bytes else None,
                headers=raw_headers,
                stream=is_stream,
            )
            elapsed_ms = int((time.monotonic() - start_time) * 1000)
            # Handle 429 — cooldown and retry
            if resp.status_code == 429:
                new_count = backend.consecutive_429_count + 1
                start_cooldown(backend.id, new_count)
                resp_body = ""
                try:
                    resp_body = resp.text[:200]
                except Exception:
                    pass
                logger.warning(
                    "backend_429_cooldown",
                    backend_id=backend.id,
                    pool=backend.pool,
                    consecutive=new_count,
                    model=canonical_model,
                )
                # Track the error
                record_usage(
                    backend_id=backend.id,
                    model=canonical_model,
                    prompt_tokens=0,
                    completion_tokens=0,
                    cost=0.0,
                    latency_ms=elapsed_ms,
                    is_error=True,
                )
                continue  # Retry with another backend
            # Success — track usage
            resp_json: dict[str, Any] = {}
            try:
                if not is_stream and resp.content:
                    resp_json = json.loads(resp.content)
            except (ValueError, TypeError):
                pass
            prompt_tokens, completion_tokens, total_tokens = extract_usage_from_response(
                resp, resp_json, canonical_model
            )
            cost = calculate_cost(
                backend, canonical_model, prompt_tokens, completion_tokens
            )
            record_usage(
                backend_id=backend.id,
                model=canonical_model,
                prompt_tokens=prompt_tokens,
                completion_tokens=completion_tokens,
                cost=cost,
                latency_ms=elapsed_ms,
            )
            logger.info(
                "request_completed",
                backend_id=backend.id,
                pool=backend.pool,
                model=canonical_model,
                status=resp.status_code,
                tokens=total_tokens,
                cost=round(cost, 6),
                elapsed_ms=elapsed_ms,
            )
            return build_response(resp)
        except httpx.TimeoutException:
            logger.warning(
                "backend_timeout",
                backend_id=backend.id,
                model=canonical_model,
            )
            continue
        except (httpx.ConnectError, httpx.RemoteProtocolError) as exc:
            logger.warning(
                "backend_connection_error",
                backend_id=backend.id,
                model=canonical_model,
                error=str(exc),
            )
            continue
        except Exception as exc:
            logger.error(
                "proxy_error",
                backend_id=backend.id,
                model=canonical_model,
                error=str(exc),
            )
            continue
    # All pools exhausted — emergency rate-limited passthrough
    emergency_rpm = int(config.default_rpm_limit * config.emergency_rpm_fraction)
    if emergency_rpm < 1:
        emergency_rpm = 1
    logger.warning(
        "all_pools_exhausted_emergency",
        model=canonical_model,
        emergency_rpm=emergency_rpm,
    )
    # Track emergency activation for metrics
    _emergency_count += 1
    # Emergency: try to get a token from any fallback backend at reduced RPM
    emergency_retries = 3
    for attempt in range(emergency_retries):
        backends = pool_manager.get_any_healthy_backends()
        for backend in backends:
            if rate_limiter.consume(backend.id, emergency_rpm):
                try:
                    resp = await forward_to_backend(
                        backend=backend,
                        method=request.method,
                        path=path,
                        body=body_bytes if body_bytes else None,
                        headers=raw_headers,
                        stream=is_stream,
                    )
                    elapsed_ms = int((time.monotonic() - start_time) * 1000)
                    if resp.status_code == 429:
                        start_cooldown(backend.id, backend.consecutive_429_count + 1)
                        continue
                    # Success in emergency mode
                    try:
                        resp_json: dict[str, Any] = {}
                        if not is_stream and resp.content:
                            resp_json = json.loads(resp.content)
                    except Exception:
                        resp_json = {}
                    prompt_tokens, completion_tokens, total_tokens = extract_usage_from_response(
                        resp, resp_json, canonical_model
                    )
                    cost_em = calculate_cost(backend, canonical_model, prompt_tokens, completion_tokens)
                    record_usage(
                        backend_id=backend.id,
                        model=canonical_model,
                        prompt_tokens=prompt_tokens,
                        completion_tokens=completion_tokens,
                        cost=cost_em,
                        latency_ms=elapsed_ms,
                    )
                    logger.info(
                        "emergency_passthrough_success",
                        backend_id=backend.id,
                        model=canonical_model,
                        emergency_rpm=emergency_rpm,
                    )
                    return build_response(resp)
                except Exception:
                    continue
    # All emergency attempts failed — return 503 for OpenClaw fallback chain
    return build_error_response(
        503,
        "All provider pools exhausted. OpenClaw fallback chain should activate.",
        "AllPoolsExhausted",
    )
 def _refresh_cooldowns() -> None:
    """Check and clear expired cooldowns for backends currently in cooling state.
    Only queries backends with status='cooling' (the health_check_loop handles
    the periodic scanning; this is the on-demand refresh before proxy routing)."""
    from storage.backend_store import list_backends
    backends = list_backends(decrypt_key=False)
    for backend in backends:
        if backend.status == "cooling":
            check_and_clear_cooldown(backend.id)
@@ -1,48 +0,0 @@
 [project]
 name = "nvidia_sidecar"
 version = "0.1.0"
 description = "NVIDIA Sidecar 限流代理 — 为 NVIDIA API 提供优先级排队 + 令牌桶限流"
 readme = "README.md"
 license = { text = "MIT" }
 requires-python = ">=3.12"
 dependencies = [
    "fastapi>=0.115",
    "uvicorn[standard]>=0.34",
    "httpx>=0.28",
    "PyYAML>=6.0",
    "structlog>=24.4",
    "prometheus-client>=0.21",
    "pydantic>=2.0",
 ]
 [project.optional-dependencies]
 dev = [
    "pytest>=8.3",
    "pytest-asyncio>=0.24",
    "httpx>=0.28",
    "mypy>=1.14",
    "types-PyYAML",
 ]
 [project.scripts]
 nvidia-sidecar = "nvidia_sidecar.server:main"
 [build-system]
 requires = ["setuptools>=75", "wheel"]
 build-backend = "setuptools.build_meta"
 [tool.setuptools]
 packages = ["nvidia_sidecar"]
 [tool.setuptools.package-dir]
 # Flat layout: __init__.py + all .py files at project root
 "nvidia_sidecar" = "."
 [tool.mypy]
 python_version = "3.12"
 strict = true
 warn_return_any = true
 warn_unused_configs = true
 [[tool.mypy.overrides]]
 module = "structlog.*"
 ignore_missing_imports = true
@@ -1,130 +1,86 @@
-"""
+"""Per-backend rate limiter using token bucket algorithm."""
 NVIDIA Sidecar 限流代理 — 令牌桶 + 网关识别模块 (§3.2)
 从 BIZ-26 rate_limiter.py 提取核心限流逻辑，去除多线程调度器、缓存管理等。
 保留：Priority, TokenBucket, is_nvidia_gateway, normalize_gateway_name。
 """
 from __future__ import annotations
 import time
 import threading
-from enum import IntEnum
+import time
 from typing import Any
-# ---------------------------------------------------------------------------
+class PerBackendRateLimiter:
-# 优先级枚举
+    """Manages independent token buckets for each backend.
 # ---------------------------------------------------------------------------
-class Priority(IntEnum):
+    Thread-safe. Each backend gets its own bucket with configurable RPM.
    """请求优先级（数值越小优先级越高）。"""
    URGENT = 1
    HIGH = 2
    NORMAL = 3
    LOW = 4
 # ---------------------------------------------------------------------------
 # NVIDIA 网关别名集
 # ---------------------------------------------------------------------------
 NVIDIA_GATEWAY_ALIASES: set[str] = {
    # OpenClaw 配置中全部的 NVIDIA provider 名称
    "nvidia",
    "nvidia-gateway",
    "nvidia98053",
    "nvidialiuweicheng84",
    "nvidiavx",
    "nvidiavx18088980513",
    "nvidiavx64391942",
 }
 def is_nvidia_gateway(value: str | None) -> bool:
    """判断给定网关名/模型全路径是否属于 NVIDIA 网关。
    Args:
        value: 网关名（如 ``"nvidia"``）或模型全路径前缀
               （如 ``"nvidia/deepseek-ai/deepseek-v4-pro"``）。
               None 时直接返回 False。
    Returns:
        True 当 value 的 provider 部分匹配已知 NVIDIA 别名。
    """
    if value is None:
        return False
    # 提取 provider 前缀：取 "/" 前第一个部分
    provider = value.split("/", 1)[0].lower().strip()
    return provider in NVIDIA_GATEWAY_ALIASES
 def normalize_gateway_name(value: str | None) -> str | None:
    """规范化网关名：提取 provider 前缀并转为小写。
    Args:
        value: 网关名或模型全路径。None 时返回 None。
    Returns:
        provider 前缀的小写形式，或 None。
    """
    if value is None:
        return None
    return value.split("/", 1)[0].lower().strip()
 # ---------------------------------------------------------------------------
 # 令牌桶（线程安全）
 # ---------------------------------------------------------------------------
 class TokenBucket:
    """线程安全的令牌桶实现。
    支持固定速率令牌补充和消费，带有溢出保护和可选的阻塞等待。
    """
-    def __init__(self, rate: float = 40 / 60, capacity: int = 40) -> None:
+    def __init__(self, refill_interval_ms: int = 50):
-        """初始化令牌桶。
+        self._buckets: dict[str, _TokenBucket] = {}
        self._lock = threading.Lock()
        self._refill_interval_ms = refill_interval_ms
-        Args:
+    def ensure_bucket(self, backend_id: str, rpm_limit: int) -> None:
-            rate: 令牌补充速率（令牌/秒）。默认 40/60 ≈ 0.667 token/s（40 RPM）。
+        """Create or update a bucket for a backend."""
-            capacity: 桶最大容量（令牌数）。默认 40。
+        with self._lock:
            if backend_id in self._buckets:
                existing = self._buckets[backend_id]
                existing.update_rate(rpm_limit)
            else:
                self._buckets[backend_id] = _TokenBucket(
                    rate=rpm_limit / 60.0,
                    capacity=max(rpm_limit, 1),
                )
    def remove_bucket(self, backend_id: str) -> None:
        """Remove a backend's bucket."""
        with self._lock:
            self._buckets.pop(backend_id, None)
    def consume(self, backend_id: str, rpm_limit: int, tokens: int = 1) -> bool:
        """Try to consume tokens for a backend. Returns True if allowed.
        Auto-creates the bucket if needed.
        """
-        self._rate: float = float(rate)
+        self.ensure_bucket(backend_id, rpm_limit)
        self._capacity: int = int(capacity)
        self._tokens: float = float(capacity)  # 启动时桶满
        self._last_refill: float = time.monotonic()
        self._lock: threading.Lock = threading.Lock()
-    # ---- 内部方法 ----
+        with self._lock:
            bucket = self._buckets.get(backend_id)
            if bucket is None:
                return False
        return bucket.consume(tokens)
    def get_status(self, backend_id: str) -> dict[str, Any] | None:
        """Get bucket status for a backend."""
        with self._lock:
            bucket = self._buckets.get(backend_id)
            if bucket is None:
                return None
            return bucket.get_status()
    def get_all_status(self) -> dict[str, dict[str, Any]]:
        """Get status of all buckets."""
        with self._lock:
            return {bid: b.get_status() for bid, b in self._buckets.items()}
 class _TokenBucket:
    """Internal token bucket with refill."""
    def __init__(self, rate: float, capacity: int):
        self._rate = float(rate)
        self._capacity = int(capacity)
        self._tokens = float(capacity)
        self._last_refill = time.monotonic()
        self._lock = threading.Lock()
    def _refill(self) -> None:
        """补充令牌（调用方需持有 _lock）。
        根据距上次补充的时间差计算新增令牌数，不超过 capacity。
        """
        now = time.monotonic()
        elapsed = now - self._last_refill
        if elapsed > 0 and self._rate > 0:
-            new_tokens = elapsed * self._rate
+            self._tokens = min(self._tokens + elapsed * self._rate, float(self._capacity))
            self._tokens = min(self._tokens + new_tokens, float(self._capacity))
        self._last_refill = now
    # ---- 公开方法 ----
    def consume(self, tokens: int = 1) -> bool:
        """尝试立即消费令牌（非阻塞）。
        Args:
            tokens: 要消费的令牌数，默认 1。
        Returns:
            True 消费成功；False 令牌不足。
        """
        if tokens <= 0:
            return True
        with self._lock:
            self._refill()
            if self._tokens >= tokens:
@@ -132,52 +88,15 @@ class TokenBucket:
                return True
            return False
-    def try_consume(self, tokens: int = 1, timeout: float = 2.0) -> bool:
+    def update_rate(self, rpm_limit: int) -> None:
-        """尝试在指定时间内消费令牌（阻塞）。
+        new_rate = rpm_limit / 60.0
-
+        with self._lock:
-        Args:
+            self._refill()
-            tokens: 要消费的令牌数，默认 1。
+            self._rate = new_rate
-            timeout: 最大等待秒数，默认 2.0。
+            self._capacity = max(rpm_limit, 1)
-
+            self._tokens = min(self._tokens, float(self._capacity))
        Returns:
            True 在超时前成功消费；False 超时。
        """
        if tokens <= 0:
            return True
        deadline = time.monotonic() + timeout
        while True:
            with self._lock:
                self._refill()
                if self._tokens >= tokens:
                    self._tokens -= tokens
                    return True
            # 释放锁后计算剩余等待时间
            remaining = deadline - time.monotonic()
            if remaining <= 0:
                return False
            # 等待到下一个令牌应该补充的时间点
            sleep_time = min(remaining, max(0.05, 1.0 / self._rate) if self._rate > 0 else remaining)
            time.sleep(sleep_time)
    def wait_for_token(self, timeout: float | None = None) -> bool:
        """等待并尝试消费 1 个令牌。
        Args:
            timeout: 最大等待秒数；None 表示无限等待（不推荐）。
        Returns:
            True 成功消费；False 超时。
        """
        return self.try_consume(tokens=1, timeout=timeout if timeout is not None else float("inf"))
    def get_status(self) -> dict[str, Any]:
        """获取令牌桶当前状态。
        Returns:
            包含 tokens, capacity, rate_per_minute, utilization 的字典。
        """
        with self._lock:
            self._refill()
            rate_per_minute = self._rate * 60.0
@@ -189,254 +108,4 @@ class TokenBucket:
                "capacity": self._capacity,
                "rate_per_minute": round(rate_per_minute, 1),
                "utilization": round(utilization, 4),
-            }
+            }
    # ---- 属性 ----
    @property
    def rate(self) -> float:
        """当前令牌补充速率（令牌/秒）。"""
        return self._rate
    @property
    def capacity(self) -> int:
        """桶容量。"""
        return self._capacity
    # ---- 动态速率调整（供 AdaptiveTokenBucket 使用） ----
    def set_rate(self, rate: float) -> None:
        """动态调整令牌补充速率（令牌/秒）。
        Args:
            rate: 新速率（令牌/秒）。
        """
        with self._lock:
            self._refill()  # 先补充现有令牌再切换速率
            self._rate = float(rate)
 # ---------------------------------------------------------------------------
 # 避退模式：AdaptiveTokenBucket (§ADR-009)
 # ---------------------------------------------------------------------------
 class RetreatState:
    """避退状态机常量。"""
    NORMAL: str = "normal"
    RETREAT: str = "retreat"
    RECOVER: str = "recover"
 class AdaptiveTokenBucket(TokenBucket):
    """自适应避退令牌桶（ADR-009）。
    监控上游 429 率（60s 滑动窗口），自动调整发射速率：
    - 429 率 < 5%   → NORMAL，保持基准速率
    - 429 率 5-10%  → RETREAT，速率 × 0.75
    - 429 率 10-20% → RETREAT，再次降速
    - 429 率 > 20%  → RETREAT，最低 5 RPM + 告警
    - 连续 120s 429 率 < 2% → RECOVER，逐步 +2 RPM 恢复
    线程安全，继承 TokenBucket 的所有公共接口。
    """
    # ADR-009 参数（可通过构造函数覆盖）
    RETREAT_WINDOW_SECONDS: float = 60.0
    RETREAT_429_THRESHOLD: float = 0.05
    RETREAT_FACTOR: float = 0.75
    RETREAT_MIN_RPM: float = 5.0
    RECOVER_WINDOW_SECONDS: float = 120.0
    RECOVER_429_THRESHOLD: float = 0.02
    RECOVER_INCREMENT_RPM: float = 2.0
    def __init__(
        self,
        rate: float = 40 / 60,
        capacity: int = 40,
        *,
        retreat_window_seconds: float = 60.0,
        retreat_429_threshold: float = 0.05,
        retreat_factor: float = 0.75,
        retreat_min_rpm: float = 5.0,
        recover_window_seconds: float = 120.0,
        recover_429_threshold: float = 0.02,
        recover_increment_rpm: float = 2.0,
    ) -> None:
        """初始化自适应避退令牌桶。
        Args:
            rate: 基准令牌补充速率（令牌/秒）。默认 40/60 ≈ 0.667 token/s。
            capacity: 桶最大容量。默认 40。
            retreat_window_seconds: 429 率滑动窗口大小（秒）。
            retreat_429_threshold: 触发避退的 429 率阈值。
            retreat_factor: 每次避退速率乘数。
            retreat_min_rpm: 避退最低 RPM。
            recover_window_seconds: 恢复观察窗口大小（秒）。
            recover_429_threshold: 触发恢复的 429 率阈值。
            recover_increment_rpm: 每次恢复增加的 RPM。
        """
        super().__init__(rate=rate, capacity=capacity)
        # 基准速率（不变）
        self._base_rate: float = float(rate)
        # 避退参数
        self.RETREAT_WINDOW_SECONDS = retreat_window_seconds
        self.RETREAT_429_THRESHOLD = retreat_429_threshold
        self.RETREAT_FACTOR = retreat_factor
        self.RETREAT_MIN_RPM = retreat_min_rpm
        self.RECOVER_WINDOW_SECONDS = recover_window_seconds
        self.RECOVER_429_THRESHOLD = recover_429_threshold
        self.RECOVER_INCREMENT_RPM = recover_increment_rpm
        # 避退状态机
        self._retreat_state: str = RetreatState.NORMAL
        # 429 滑动窗口：[(timestamp, is_429), ...]
        self._429_window: list[tuple[float, bool]] = []
        # 上次状态变更时间
        self._last_state_change: float = time.monotonic()
        # 避退状态锁（RLock 防止 evaluate_retreat() → get_429_rate() 重入死锁）
        self._retreat_lock: threading.RLock = threading.RLock()
    # ---- 429 反馈 ----
    def record_response(self, is_429: bool) -> None:
        """记录一次上游响应是否为 429。
        Args:
            is_429: True 表示上游返回了 429。
        """
        now = time.monotonic()
        with self._retreat_lock:
            self._429_window.append((now, is_429))
            # 清理超出观察窗口的旧记录
            cutoff = now - max(
                self.RETREAT_WINDOW_SECONDS,
                self.RECOVER_WINDOW_SECONDS,
            )
            self._429_window = [
                (ts, flag) for ts, flag in self._429_window
                if ts >= cutoff
            ]
    def get_429_rate(self, window_seconds: float | None = None) -> float:
        """获取指定窗口内的 429 率。
        Args:
            window_seconds: 滑动窗口大小；None 使用 RETREAT_WINDOW_SECONDS。
        Returns:
            0.0-1.0 之间的 429 率。
        """
        ws = window_seconds or self.RETREAT_WINDOW_SECONDS
        now = time.monotonic()
        with self._retreat_lock:
            in_window = [flag for ts, flag in self._429_window if now - ts <= ws]
            if not in_window:
                return 0.0
            return sum(1 for f in in_window if f) / len(in_window)
    # ---- 避退状态评估 ----
    def evaluate_retreat(self) -> str:
        """评估并更新避退状态，返回新状态名。
        每次调用根据当前 429 率 + 持续时间决定是否进入 RETREAT / RECOVER。
        Returns:
            "normal" / "retreat" / "recover"。
        """
        now = time.monotonic()
        with self._retreat_lock:
            retreat_rate = self.get_429_rate(self.RETREAT_WINDOW_SECONDS)
            recover_rate = self.get_429_rate(self.RECOVER_WINDOW_SECONDS)
            if self._retreat_state == RetreatState.NORMAL:
                if retreat_rate >= self.RETREAT_429_THRESHOLD:
                    self._retreat_state = RetreatState.RETREAT
                    self._last_state_change = now
                    self._apply_retreat()
            elif self._retreat_state == RetreatState.RETREAT:
                # 持续高 429 率 → 再次降速
                if retreat_rate >= self.RETREAT_429_THRESHOLD * 2:
                    # 429 > 10%，再次降速
                    if self._rate > self.RETREAT_MIN_RPM / 60.0:
                        self._apply_retreat()
                elif recover_rate < self.RECOVER_429_THRESHOLD:
                    time_in_low = now - self._last_state_change
                    if time_in_low >= self.RECOVER_WINDOW_SECONDS:
                        self._retreat_state = RetreatState.RECOVER
                        self._last_state_change = now
                        self._apply_recover()
            elif self._retreat_state == RetreatState.RECOVER:
                if retreat_rate >= self.RETREAT_429_THRESHOLD:
                    # 恢复期间 429 回升，重新进入避退
                    self._retreat_state = RetreatState.RETREAT
                    self._last_state_change = now
                    self._apply_retreat()
                elif self._rate >= self._base_rate:
                    # 已恢复到基准速率
                    self._rate = self._base_rate
                    self._retreat_state = RetreatState.NORMAL
                    self._last_state_change = now
                else:
                    # 继续逐步恢复
                    self._apply_recover()
            return self._retreat_state
    def _apply_retreat(self) -> None:
        """执行一次避退降速。"""
        new_rate: float = max(
            self.RETREAT_MIN_RPM / 60.0,
            self._rate * self.RETREAT_FACTOR,
        )
        self._rate = new_rate
    def _apply_recover(self) -> None:
        """执行一次恢复提速。"""
        increment: float = self.RECOVER_INCREMENT_RPM / 60.0
        new_rate: float = min(self._base_rate, self._rate + increment)
        self._rate = new_rate
    # ---- 状态查询 ----
    def get_retreat_state(self) -> str:
        """获取当前避退状态。
        Returns:
            "normal" / "retreat" / "recover"。
        """
        with self._retreat_lock:
            return self._retreat_state
    def get_effective_rate_rpm(self) -> float:
        """获取当前实际速率（RPM），考虑避退乘数。
        Returns:
            当前每分钟速率。
        """
        with self._lock:
            return self._rate * 60.0
    def get_base_rate_rpm(self) -> float:
        """获取基准速率（RPM），即未避退时的速率。
        Returns:
            基准每分钟速率。
        """
        return self._base_rate * 60.0
    def reset_to_base(self) -> None:
        """手动重置到基准速率（用于运维干预）。"""
        with self._retreat_lock:
            self._rate = self._base_rate
            self._retreat_state = RetreatState.NORMAL
            self._last_state_change = time.monotonic()
            self._429_window.clear()
@@ -0,0 +1,6 @@
 # Sidecar V2 — Multi-Pool Provider Proxy
 fastapi>=0.115.0,<1.0.0
 uvicorn[standard]>=0.30.0,<1.0.0
 httpx>=0.27.0,<1.0.0
 structlog>=24.0.0,<25.0.0
 cryptography>=42.0.0,<44.0.0
@@ -0,0 +1,62 @@
 """Model → Backend routing logic for Sidecar V2."""
 import structlog
 from typing import Optional
 from storage.models import Backend
 from pool_manager import PoolManager
 from rate_limiter import PerBackendRateLimiter
 logger = structlog.get_logger("sidecar_v2.router")
 class Router:
    """Routes model requests to the best available backend.
    Pick strategy:
    1. Primary pool → healthy backends supporting the model
    2. Rate-limiter check → skip if RPM exhausted
    3. Fallback pool → repeat above
    4. If all exhausted → return None (caller handles emergency)
    """
    def __init__(self, pool_manager: PoolManager, rate_limiter: PerBackendRateLimiter):
        self._pool_manager = pool_manager
        self._rate_limiter = rate_limiter
    def pick_backend(self, canonical_model: str) -> Optional[Backend]:
        """Pick the best available backend for a model.
        Tries primary pool first, then fallback.
        Within each pool, skips backends at RPM limit.
        Returns None if no backend available.
        """
        # Try pools in order
        for pool in ["primary", "fallback"]:
            backends = self._pool_manager.get_available_backends(
                canonical_model, pool=pool
            )
            for backend in backends:
                # Rate-limit check
                if self._rate_limiter.consume(
                    backend.id, backend.rpm_limit
                ):
                    return backend
                # Skip this backend, try next
                logger.debug(
                    "backend_rate_limited",
                    backend_id=backend.id,
                    pool=pool,
                    model=canonical_model,
                )
            if not backends:
                logger.debug("pool_exhausted", pool=pool, model=canonical_model)
            else:
                logger.debug("pool_rpm_exhausted", pool=pool, model=canonical_model)
        return None
    def get_all_pools_exhausted_info(self, canonical_model: str) -> bool:
        """Check if ALL pools are exhausted for a model."""
        return not self._pool_manager.is_any_pool_available(canonical_model)
@@ -1,327 +0,0 @@
 <!DOCTYPE html>
 <html lang="zh-CN">
 <head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>NVIDIA Sidecar — 实时仪表盘</title>
  <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.7/dist/chart.umd.min.js"></script>
  <style>
    * { margin: 0; padding: 0; box-sizing: border-box; }
    body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; background: #0f172a; color: #e2e8f0; padding: 24px; }
    h1 { font-size: 22px; font-weight: 600; margin-bottom: 4px; color: #f8fafc; }
    .subtitle { color: #94a3b8; font-size: 13px; margin-bottom: 24px; }
    .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(380px, 1fr)); gap: 20px; margin-bottom: 24px; }
    .card { background: #1e293b; border-radius: 12px; padding: 20px; border: 1px solid #334155; }
    .card h2 { font-size: 15px; font-weight: 600; color: #94a3b8; margin-bottom: 14px; text-transform: uppercase; letter-spacing: 0.05em; }
    .card canvas { max-height: 220px; }
    .stat-row { display: flex; gap: 16px; flex-wrap: wrap; }
    .stat { flex: 1; min-width: 100px; background: #0f172a; border-radius: 8px; padding: 12px; text-align: center; border: 1px solid #334155; }
    .stat .value { font-size: 28px; font-weight: 700; color: #38bdf8; }
    .stat .label { font-size: 11px; color: #64748b; margin-top: 4px; text-transform: uppercase; }
    .stat.warn .value { color: #f59e0b; }
    .stat.danger .value { color: #ef4444; }
    .retreat-badge { display: inline-block; padding: 2px 10px; border-radius: 999px; font-size: 12px; font-weight: 600; }
    .retreat-badge.normal { background: #065f46; color: #6ee7b7; }
    .retreat-badge.retreat { background: #78350f; color: #fbbf24; }
    .retreat-badge.recover { background: #1e3a5f; color: #60a5fa; }
    .config-panel { background: #1e293b; border-radius: 12px; padding: 20px; border: 1px solid #334155; }
    .config-panel h2 { font-size: 15px; font-weight: 600; color: #94a3b8; margin-bottom: 14px; text-transform: uppercase; letter-spacing: 0.05em; }
    .config-row { display: flex; align-items: center; gap: 12px; margin-bottom: 12px; flex-wrap: wrap; }
    .config-row label { min-width: 100px; font-size: 13px; color: #cbd5e1; }
    .config-row input, .config-row select { background: #0f172a; border: 1px solid #334155; border-radius: 6px; color: #e2e8f0; padding: 6px 10px; font-size: 13px; }
    .config-row input[type="range"] { width: 140px; }
    .config-row button { background: #38bdf8; color: #0f172a; border: none; border-radius: 6px; padding: 6px 16px; font-size: 13px; font-weight: 600; cursor: pointer; }
    .config-row button:hover { background: #7dd3fc; }
    .config-row button:disabled { background: #475569; cursor: not-allowed; }
    .toast { position: fixed; top: 16px; right: 16px; padding: 10px 20px; border-radius: 8px; font-size: 13px; z-index: 999; animation: fadeInOut 3s; }
    .toast.success { background: #065f46; color: #6ee7b7; }
    .toast.error { background: #7f1d1d; color: #fca5a5; }
    @keyframes fadeInOut { 0% { opacity: 0; transform: translateY(-8px); } 10% { opacity: 1; transform: translateY(0); } 80% { opacity: 1; } 100% { opacity: 0; } }
    .disconnected { background: #7f1d1d; color: #fca5a5; padding: 4px 10px; border-radius: 4px; font-size: 12px; display: inline-block; margin-left: 8px; }
    .connected { background: #065f46; color: #6ee7b7; padding: 4px 10px; border-radius: 4px; font-size: 12px; display: inline-block; margin-left: 8px; }
    /* BIZ-46 Phase3: 队列柱状图 300ms 平滑动画 */
    .queue-bar { transition: height 0.3s ease; }
    /* BIZ-46 Phase3: SSE 断连 5s 半透明遮罩 */
    #reconnect-mask {
      display: none;
      position: fixed;
      top: 0; left: 0; right: 0; bottom: 0;
      background: rgba(15, 23, 42, 0.85);
      z-index: 1000;
      justify-content: center;
      align-items: center;
      flex-direction: column;
    }
    #reconnect-mask.visible { display: flex; }
    #reconnect-mask .mask-icon { font-size: 48px; margin-bottom: 16px; }
    #reconnect-mask .mask-text { color: #94a3b8; font-size: 16px; font-weight: 500; }
    #reconnect-mask .mask-sub { color: #64748b; font-size: 13px; margin-top: 8px; }
  </style>
 </head>
 <body>
  <!-- BIZ-46 Phase3: SSE 断连遮罩 -->
  <div id="reconnect-mask">
    <div class="mask-icon">⚠️</div>
    <div class="mask-text">数据暂不可用</div>
    <div class="mask-sub">SSE 连接中断，正在重连…</div>
  </div>
  <h1>🚀 NVIDIA Sidecar 实时仪表盘
    <span id="conn-status" class="connected">已连接</span>
  </h1>
  <p class="subtitle">令牌桶限流 · 优先级队列 · 避退模式 · 实时监控</p>
  <!-- 状态卡片 -->
  <div class="stat-row" style="margin-bottom: 24px;">
    <div class="stat"><div class="value" id="val-total">0</div><div class="label">总请求</div></div>
    <div class="stat"><div class="value" id="val-nvidia">0</div><div class="label">NVIDIA 请求</div></div>
    <div class="stat"><div class="value" id="val-rate">0</div><div class="label">当前 RPM</div></div>
    <div class="stat"><div class="value" id="val-429">0%</div><div class="label">上游 429 率</div></div>
    <div class="stat"><div class="value" id="val-retreat">正常</div><div class="label">避退状态</div></div>
    <div class="stat"><div class="value" id="val-uptime">0s</div><div class="label">运行时间</div></div>
  </div>
  <!-- 图表 -->
  <div class="grid">
    <div class="card">
      <h2>📊 令牌桶使用率</h2>
      <canvas id="chart-tokens"></canvas>
    </div>
    <div class="card">
      <!-- BIZ-46 Phase3: 队列图标题显示总排队数 -->
      <h2>📈 队列深度 <span id="queue-total" style="font-size:13px;color:#38bdf8;">(共 0)</span></h2>
      <canvas id="chart-queue"></canvas>
    </div>
    <div class="card">
      <h2>📉 请求吞吐量 (最近 20 点)</h2>
      <canvas id="chart-throughput"></canvas>
    </div>
    <div class="card">
      <h2>⚙️ 速率历史</h2>
      <canvas id="chart-rate"></canvas>
    </div>
  </div>
  <!-- 配置面板 -->
  <div class="config-panel">
    <h2>🔧 实时配置</h2>
    <div class="config-row">
      <label>速率 (RPM)</label>
      <input type="range" id="cfg-rate-rpm" min="1" max="100" value="40" oninput="document.getElementById('cfg-rate-val').textContent=this.value">
      <span id="cfg-rate-val" style="min-width:30px;">40</span>
    </div>
    <div class="config-row">
      <label>队列上限</label>
      <input type="number" id="cfg-queue-max" value="500" min="1" max="2000" style="width:80px;">
    </div>
    <div class="config-row">
      <button onclick="applyConfig()">应用配置</button>
    </div>
  </div>
 <script>
 // SSE 连接
 let evtSource = null;
 let dataHistory = { throughput: [], rates: [] };
 const MAX_HISTORY = 20;
 let lastSSETime = Date.now();
 // BIZ-46 Phase3: SSE 断连 5s 遮罩
 function checkReconnect() {
  const mask = document.getElementById('reconnect-mask');
  if (Date.now() - lastSSETime > 5000) {
    mask.classList.add('visible');
  }
 }
 setInterval(checkReconnect, 1000);
 function connectSSE() {
  if (evtSource) evtSource.close();
  evtSource = new EventSource('/api/dashboard/stream');
  evtSource.onmessage = (e) => {
    try {
      const snap = JSON.parse(e.data);
      lastSSETime = Date.now();
      // 隐藏断连遮罩
      document.getElementById('reconnect-mask').classList.remove('visible');
      updateDashboard(snap);
      document.getElementById('conn-status').className = 'connected';
      document.getElementById('conn-status').textContent = '已连接';
    } catch (err) {
      document.getElementById('conn-status').className = 'disconnected';
      document.getElementById('conn-status').textContent = '解析错误';
    }
  };
  evtSource.onerror = () => {
    document.getElementById('conn-status').className = 'disconnected';
    document.getElementById('conn-status').textContent = '断开 - 重连中';
  };
 }
 // 初始化 Chart.js
 const ctxTokens = document.getElementById('chart-tokens').getContext('2d');
 const chartTokens = new Chart(ctxTokens, {
  type: 'doughnut',
  data: {
    labels: ['已用令牌', '可用令牌'],
    datasets: [{ data: [0, 40], backgroundColor: ['#ef4444', '#22c55e'], borderWidth: 0 }]
  },
  options: { responsive: true, maintainAspectRatio: true, cutout: '65%', plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } },
    // BIZ-46 Phase3: 300ms 平滑动画
    animation: { duration: 300 } }
 });
 const ctxQueue = document.getElementById('chart-queue').getContext('2d');
 const chartQueue = new Chart(ctxQueue, {
  type: 'bar',
  data: {
    labels: ['URGENT', 'HIGH', 'NORMAL', 'LOW'],
    datasets: [{ label: '排队数', data: [0, 0, 0, 0], backgroundColor: ['#ef4444', '#f59e0b', '#38bdf8', '#a78bfa'] }]
  },
  options: { responsive: true, maintainAspectRatio: true,
    scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } },
    plugins: { legend: { display: false } },
    // BIZ-46 Phase3: 300ms 平滑动画
    animation: { duration: 300 } }
 });
 const ctxThroughput = document.getElementById('chart-throughput').getContext('2d');
 const chartThroughput = new Chart(ctxThroughput, {
  type: 'line',
  data: { labels: [], datasets: [
    { label: '成功', data: [], borderColor: '#22c55e', backgroundColor: '#22c55e20', fill: false, tension: 0.3, pointRadius: 2 },
    { label: '429', data: [], borderColor: '#f59e0b', backgroundColor: '#f59e0b20', fill: false, tension: 0.3, pointRadius: 2 },
    { label: '直通', data: [], borderColor: '#a78bfa', backgroundColor: '#a78bfa20', fill: false, tension: 0.3, pointRadius: 2 },
  ]},
  options: { responsive: true, maintainAspectRatio: true,
    scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } },
    plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } },
    animation: { duration: 300 } }
 });
 const ctxRate = document.getElementById('chart-rate').getContext('2d');
 const chartRate = new Chart(ctxRate, {
  type: 'line',
  data: { labels: [], datasets: [
    { label: '有效 RPM', data: [], borderColor: '#38bdf8', fill: false, tension: 0.3, pointRadius: 2 },
    { label: '基准 RPM', data: [], borderColor: '#64748b', fill: false, tension: 0.3, pointRadius: 2, borderDash: [4, 4] },
  ]},
  options: { responsive: true, maintainAspectRatio: true,
    scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } },
    plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } },
    animation: { duration: 300 } }
 });
 function updateDashboard(snap) {
  const r = snap.requests || {};
  const tb = snap.token_bucket || {};
  const rt = snap.retreat || {};
  document.getElementById('val-total').textContent = (r.total || 0).toLocaleString();
  document.getElementById('val-nvidia').textContent = (r.nvidia || 0).toLocaleString();
  document.getElementById('val-rate').textContent = Math.round(rt.effective_rpm || 40);
  document.getElementById('val-429').textContent = ((rt.upstream_429_rate || 0) * 100).toFixed(1) + '%';
  document.getElementById('val-uptime').textContent = fmtDuration(snap.uptime_seconds || 0);
  const retreatEl = document.getElementById('val-retreat');
  const state = rt.state || 'normal';
  retreatEl.textContent = state === 'retreat' ? '⚠️ 避退' : state === 'recover' ? '↗ 恢复中' : '✅ 正常';
  retreatEl.style.color = state === 'retreat' ? '#f59e0b' : state === 'recover' ? '#60a5fa' : '#22c55e';
  chartTokens.data.datasets[0].data = [
    Math.round((tb.capacity || 40) - (tb.tokens || 40)),
    Math.round(tb.tokens || 0)
  ];
  chartTokens.update();
  const qs = snap.queue || {};
  const perPriority = qs.per_priority || {};
  const totalQueued = perPriority.URGENT + perPriority.HIGH + perPriority.NORMAL + perPriority.LOW || qs.current_size || 0;
  chartQueue.data.datasets[0].data = [
    perPriority.URGENT || 0,
    perPriority.HIGH || 0,
    perPriority.NORMAL || 0,
    perPriority.LOW || 0
  ];
  chartQueue.update();
  // BIZ-46 Phase3: 队列图标题显示总排队数
  document.getElementById('queue-total').textContent = '(共 ' + totalQueued + ')';
  const now = new Date().toLocaleTimeString();
  const prev = dataHistory.throughput.length > 0 ? dataHistory.throughput[dataHistory.throughput.length - 1].nvidia : 0;
  const throughput = Math.max(0, (r.nvidia || 0) - prev);
  dataHistory.throughput.push({ time: now, nvidia: throughput, ratelimited: r.ratelimited || 0, passthrough: r.passthrough || 0 });
  dataHistory.rates.push({ time: now, effective: rt.effective_rpm || 40, base: rt.base_rpm || 40 });
  if (dataHistory.throughput.length > MAX_HISTORY) dataHistory.throughput.shift();
  if (dataHistory.rates.length > MAX_HISTORY) dataHistory.rates.shift();
  chartThroughput.data.labels = dataHistory.throughput.map(d => d.time);
  chartThroughput.data.datasets[0].data = dataHistory.throughput.map(d => d.nvidia);
  chartThroughput.data.datasets[1].data = dataHistory.throughput.map(d => d.ratelimited);
  chartThroughput.data.datasets[2].data = dataHistory.throughput.map(d => d.passthrough);
  chartThroughput.update();
  chartRate.data.labels = dataHistory.rates.map(d => d.time);
  chartRate.data.datasets[0].data = dataHistory.rates.map(d => d.effective);
  chartRate.data.datasets[1].data = dataHistory.rates.map(d => d.base);
  chartRate.update();
 }
 function fmtDuration(s) {
  if (s < 60) return s + 's';
  if (s < 3600) return Math.floor(s/60) + 'm ' + (s%60) + 's';
  return Math.floor(s/3600) + 'h ' + Math.floor((s%3600)/60) + 'm';
 }
 async function applyConfig() {
  const btn = document.querySelector('.config-row button');
  btn.disabled = true;
  try {
    const resp = await fetch('/api/admin/config', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        rate_rpm: parseInt(document.getElementById('cfg-rate-rpm').value),
        queue_max_size: parseInt(document.getElementById('cfg-queue-max').value),
      })
    });
    const result = await resp.json();
    showToast(resp.ok ? 'success' : 'error', resp.ok ? '配置已更新' : (result.detail || '配置更新失败'));
  } catch (err) {
    showToast('error', '请求失败: ' + err.message);
  }
  btn.disabled = false;
 }
 function showToast(type, msg) {
  const t = document.createElement('div');
  t.className = 'toast ' + type;
  t.textContent = msg;
  document.body.appendChild(t);
  setTimeout(() => t.remove(), 3000);
 }
 // BIZ-46 Phase3: 页面加载时同步当前配置值
 async function loadConfig() {
  try {
    const resp = await fetch('/api/admin/config');
    if (resp.ok) {
      const config = await resp.json();
      document.getElementById('cfg-rate-rpm').value = config.rate_rpm || 40;
      document.getElementById('cfg-rate-val').textContent = config.rate_rpm || 40;
      document.getElementById('cfg-queue-max').value = config.queue_max_size || 500;
    }
  } catch (e) {
    console.warn('配置加载失败（可能需要 Admin Token）', e);
  }
 }
 loadConfig();
 connectSSE();
 </script>
 </body>
 </html>
@@ -0,0 +1 @@
 # Sidecar V2 storage module
@@ -0,0 +1,252 @@
 """CRUD operations for Backend (provider) management."""
 import json
 import time
 from typing import Optional
 from storage.db import get_connection, generate_id
 from storage.models import Backend, ModelMapping
 from crypto import encrypt, decrypt
 def create_backend(backend: Backend) -> Backend:
    """Create a new backend. Encrypts API key before storage."""
    if not backend.id:
        backend.id = generate_id("bkd")
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    backend.created_at = now
    backend.updated_at = now
    api_key_encrypted = encrypt(backend.api_key_plain)
    with get_connection() as conn:
        conn.execute(
            """INSERT INTO backends (id, name, label, api_base_url, api_key_encrypted,
               api, timeout_seconds, rpm_limit, pool, enabled, status, model_mappings_json,
               source, cooldown_until, consecutive_429_count, metadata_json, created_at, updated_at)
               VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
            (
                backend.id, backend.name, backend.label, backend.api_base_url,
                api_key_encrypted, backend.api, backend.timeout_seconds,
                backend.rpm_limit, backend.pool, 1 if backend.enabled else 0,
                backend.status, json.dumps(_mappings_to_dict(backend.model_mappings)),
                backend.source, backend.cooldown_until,
                backend.consecutive_429_count,
                json.dumps(backend.metadata), backend.created_at, backend.updated_at,
            ),
        )
        conn.commit()
    return backend
 def get_backend(backend_id: str, decrypt_key: bool = True) -> Optional[Backend]:
    """Get a single backend by ID."""
    with get_connection() as conn:
        row = conn.execute(
            "SELECT * FROM backends WHERE id = ?", (backend_id,)
        ).fetchone()
    if row is None:
        return None
    return _row_to_backend(row, decrypt_key=decrypt_key)
 def list_backends(
    pool: Optional[str] = None,
    enabled_only: bool = False,
    decrypt_key: bool = False,
 ) -> list[Backend]:
    """List backends, optionally filtered by pool."""
    with get_connection() as conn:
        if pool:
            rows = conn.execute(
                "SELECT * FROM backends WHERE pool = ? ORDER BY created_at",
                (pool,),
            ).fetchall()
        else:
            rows = conn.execute(
                "SELECT * FROM backends ORDER BY pool, created_at"
            ).fetchall()
    backends = [_row_to_backend(r, decrypt_key=decrypt_key) for r in rows]
    if enabled_only:
        backends = [b for b in backends if b.enabled]
    return backends
 def update_backend(backend_id: str, updates: dict) -> Optional[Backend]:
    """Update backend fields. If api_key_plain is provided, re-encrypt."""
    current = get_backend(backend_id, decrypt_key=True)
    if current is None:
        return None
    # Apply updates
    allowed = {
        "name", "label", "api_base_url", "api", "timeout_seconds",
        "rpm_limit", "pool", "enabled", "status", "source",
        "cooldown_until", "consecutive_429_count", "metadata",
    }
    for key, value in updates.items():
        if key in allowed:
            setattr(current, key, value)
    current.updated_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    # Handle API key update
    api_key_encrypted = None
    if "api_key_plain" in updates and updates["api_key_plain"]:
        current.api_key_plain = updates["api_key_plain"]
        api_key_encrypted = encrypt(updates["api_key_plain"])
    # Handle model_mappings update
    mappings_json = None
    if "model_mappings" in updates:
        current.model_mappings = updates["model_mappings"]
        mappings_json = json.dumps(_mappings_to_dict(current.model_mappings))
    with get_connection() as conn:
        # Build dynamic UPDATE
        set_clauses = [
            "name = ?", "label = ?", "api_base_url = ?", "api = ?",
            "timeout_seconds = ?", "rpm_limit = ?", "pool = ?", "enabled = ?",
            "status = ?", "source = ?", "cooldown_until = ?",
            "consecutive_429_count = ?", "metadata_json = ?", "updated_at = ?",
        ]
        params = [
            current.name, current.label, current.api_base_url, current.api,
            current.timeout_seconds, current.rpm_limit, current.pool,
            1 if current.enabled else 0, current.status, current.source,
            current.cooldown_until, current.consecutive_429_count,
            json.dumps(current.metadata), current.updated_at,
        ]
        if api_key_encrypted:
            set_clauses.append("api_key_encrypted = ?")
            params.append(api_key_encrypted)
        if mappings_json is not None:
            set_clauses.append("model_mappings_json = ?")
            params.append(mappings_json)
        params.append(backend_id)
        conn.execute(
            f"UPDATE backends SET {', '.join(set_clauses)} WHERE id = ?",
            params,
        )
        conn.commit()
    return get_backend(backend_id, decrypt_key=False)
 def delete_backend(backend_id: str) -> bool:
    """Delete a backend. Returns True if deleted."""
    with get_connection() as conn:
        cursor = conn.execute("DELETE FROM backends WHERE id = ?", (backend_id,))
        conn.commit()
        return cursor.rowcount > 0
 def set_backend_status(backend_id: str, status: str) -> bool:
    """Quickly set backend status (healthy/cooling/error/disabled)."""
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    with get_connection() as conn:
        cursor = conn.execute(
            "UPDATE backends SET status = ?, updated_at = ? WHERE id = ?",
            (status, now, backend_id),
        )
        conn.commit()
        return cursor.rowcount > 0
 def set_backend_cooldown(backend_id: str, cooldown_until: str, count: int) -> bool:
    """Set cooldown state on a backend."""
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    with get_connection() as conn:
        cursor = conn.execute(
            """UPDATE backends SET status = 'cooling', cooldown_until = ?,
               consecutive_429_count = ?, updated_at = ? WHERE id = ?""",
            (cooldown_until, count, now, backend_id),
        )
        conn.commit()
        return cursor.rowcount > 0
 def clear_backend_cooldown(backend_id: str) -> bool:
    """Clear cooldown (back to healthy)."""
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    with get_connection() as conn:
        cursor = conn.execute(
            """UPDATE backends SET status = 'healthy', cooldown_until = NULL,
               consecutive_429_count = 0, updated_at = ? WHERE id = ?""",
            (now, backend_id),
        )
        conn.commit()
        return cursor.rowcount > 0
 def get_pool_stats() -> dict:
    """Get summary stats per pool."""
    with get_connection() as conn:
        rows = conn.execute(
            """SELECT pool, COUNT(*) as total,
               SUM(CASE WHEN enabled = 1 THEN 1 ELSE 0 END) as enabled,
               SUM(CASE WHEN status = 'healthy' THEN 1 ELSE 0 END) as healthy,
               SUM(CASE WHEN status = 'cooling' THEN 1 ELSE 0 END) as cooling,
               SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as error
               FROM backends GROUP BY pool"""
        ).fetchall()
    stats = {}
    for row in rows:
        stats[row["pool"]] = {
            "total": row["total"],
            "enabled": row["enabled"],
            "healthy": row["healthy"],
            "cooling": row["cooling"],
            "error": row["error"],
        }
    return stats
 def _row_to_backend(row, decrypt_key: bool = True) -> Backend:
    """Convert a DB row to a Backend instance."""
    mappings_raw = row["model_mappings_json"] or "{}"
    mappings_dict = json.loads(mappings_raw)
    model_mappings = {}
    for canonical_name, mm in mappings_dict.items():
        model_mappings[canonical_name] = ModelMapping.from_dict(mm)
    backend = Backend(
        id=row["id"],
        name=row["name"],
        label=row["label"],
        api_base_url=row["api_base_url"],
        api_key_encrypted=row["api_key_encrypted"] or "",
        api=row["api"],
        timeout_seconds=row["timeout_seconds"],
        rpm_limit=row["rpm_limit"],
        pool=row["pool"],
        enabled=bool(row["enabled"]),
        status=row["status"],
        model_mappings=model_mappings,
        source=row["source"],
        cooldown_until=row["cooldown_until"],
        consecutive_429_count=row["consecutive_429_count"],
        metadata=json.loads(row["metadata_json"] or "{}"),
        created_at=row["created_at"],
        updated_at=row["updated_at"],
    )
    if decrypt_key and backend.api_key_encrypted:
        from crypto import try_decrypt_existing
        plain = try_decrypt_existing(backend.api_key_encrypted)
        if plain:
            backend.api_key_plain = plain
    return backend
 def _mappings_to_dict(mappings: dict[str, ModelMapping]) -> dict:
    """Convert ModelMapping dict to JSON-safe dict."""
    return {k: v.to_dict() for k, v in mappings.items()}
@@ -0,0 +1,55 @@
 """System configuration KV store operations."""
 import time
 from typing import Optional, Any
 from storage.db import get_connection
 def get_config(key: str) -> Optional[str]:
    """Get a single config value."""
    with get_connection() as conn:
        row = conn.execute(
            "SELECT value FROM system_config WHERE key = ?", (key,)
        ).fetchone()
    return row["value"] if row else None
 def set_config(key: str, value: str, description: str = "") -> None:
    """Set or update a config value."""
    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    with get_connection() as conn:
        conn.execute(
            """INSERT INTO system_config (key, value, description, updated_at)
               VALUES (?, ?, ?, ?)
               ON CONFLICT(key) DO UPDATE SET
               value = excluded.value,
               description = excluded.description,
               updated_at = excluded.updated_at""",
            (key, value, description, now),
        )
        conn.commit()
 def delete_config(key: str) -> bool:
    """Delete a config value."""
    with get_connection() as conn:
        cursor = conn.execute(
            "DELETE FROM system_config WHERE key = ?", (key,)
        )
        conn.commit()
        return cursor.rowcount > 0
 def list_configs() -> list[dict]:
    """List all system config entries."""
    with get_connection() as conn:
        rows = conn.execute("SELECT * FROM system_config ORDER BY key").fetchall()
    return [dict(row) for row in rows]
 def get_all_configs_as_dict() -> dict[str, str]:
    """Get all configs as a simple dict."""
    with get_connection() as conn:
        rows = conn.execute("SELECT key, value FROM system_config").fetchall()
    return {row["key"]: row["value"] for row in rows}
@@ -0,0 +1,74 @@
 """Cooldown event logging."""
 import time
 from typing import Optional
 from storage.db import get_connection, generate_id
 from storage.models import CooldownEvent
 def log_cooldown_event(
    backend_id: str,
    consecutive_count: int,
    cooldown_seconds: int,
    response_summary: str = "",
 ) -> CooldownEvent:
    """Record a cooldown event."""
    event = CooldownEvent(
        id=generate_id("cev"),
        backend_id=backend_id,
        consecutive_count=consecutive_count,
        cooldown_seconds=cooldown_seconds,
        response_summary=response_summary,
        started_at=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
    )
    with get_connection() as conn:
        conn.execute(
            """INSERT INTO cooldown_events
               (id, backend_id, consecutive_count, cooldown_seconds,
                response_summary, started_at)
               VALUES (?, ?, ?, ?, ?, ?)""",
            (event.id, event.backend_id, event.consecutive_count,
             event.cooldown_seconds, event.response_summary, event.started_at),
        )
        conn.commit()
    return event
 def end_cooldown_event(backend_id: str) -> bool:
    """Mark the latest open cooldown event as ended."""
    ended_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
    with get_connection() as conn:
        # Find the latest event for this backend that hasn't ended
        cursor = conn.execute(
            """UPDATE cooldown_events SET ended_at = ?
               WHERE backend_id = ? AND ended_at IS NULL
               ORDER BY started_at DESC LIMIT 1""",
            (ended_at, backend_id),
        )
        conn.commit()
        return cursor.rowcount > 0
 def get_cooldown_history(
    backend_id: Optional[str] = None,
    limit: int = 50,
 ) -> list[dict]:
    """Get cooldown event history."""
    with get_connection() as conn:
        if backend_id:
            rows = conn.execute(
                """SELECT * FROM cooldown_events
                   WHERE backend_id = ?
                   ORDER BY started_at DESC LIMIT ?""",
                (backend_id, limit),
            ).fetchall()
        else:
            rows = conn.execute(
                """SELECT * FROM cooldown_events
                   ORDER BY started_at DESC LIMIT ?""",
                (limit,),
            ).fetchall()
    return [dict(row) for row in rows]
@@ -0,0 +1,193 @@
 """SQLite database connection management with WAL mode."""
 import os
 import sqlite3
 import uuid
 import structlog
 from contextlib import contextmanager
 from typing import Generator
 from config import config
 logger = structlog.get_logger()
 # Module-level DB path
 _DB_PATH: str = ""
 def init_db(db_path: str = "") -> None:
    """Initialize the database connection and ensure WAL mode.
    Creates the data directory if needed and verifies integrity.
    """
    global _DB_PATH
    _DB_PATH = db_path or config.db_path
    # Ensure data directory exists
    os.makedirs(os.path.dirname(_DB_PATH), exist_ok=True)
    # Test connection and enable WAL
    conn = _get_raw_connection()
    try:
        conn.execute("PRAGMA journal_mode=WAL")
        conn.execute("PRAGMA wal_autocheckpoint=1000")
        conn.execute("PRAGMA foreign_keys=ON")
        conn.execute("PRAGMA busy_timeout=5000")
        logger.info("db_initialized", path=_DB_PATH, mode="WAL")
    finally:
        conn.close()
 def _get_raw_connection() -> sqlite3.Connection:
    """Get a raw sqlite3 connection."""
    conn = sqlite3.connect(_DB_PATH, check_same_thread=False)
    conn.row_factory = sqlite3.Row
    conn.execute("PRAGMA journal_mode=WAL")
    conn.execute("PRAGMA foreign_keys=ON")
    return conn
@contextmanager
 def get_connection() -> Generator[sqlite3.Connection, None, None]:
    """Get a database connection with WAL enabled."""
    conn = _get_raw_connection()
    try:
        yield conn
    finally:
        conn.close()
 def generate_id(prefix: str = "") -> str:
    """Generate a unique ID with optional prefix."""
    uid = uuid.uuid4().hex[:12]
    return f"{prefix}_{uid}" if prefix else uid
 def create_tables() -> None:
    """Create all tables if they don't exist."""
    with get_connection() as conn:
        conn.executescript(_DDL)
        conn.commit()
        logger.info("tables_created")
 def run_integrity_check() -> bool:
    """Run PRAGMA integrity_check and return True if OK."""
    with get_connection() as conn:
        result = conn.execute("PRAGMA integrity_check").fetchone()
        ok = result[0] == "ok"
        if not ok:
            logger.error("integrity_check_failed", result=result[0])
        return ok
 def get_db_sizes() -> dict:
    """Get database and WAL file sizes."""
    result = {"db_bytes": 0, "wal_bytes": 0}
    db_path = _DB_PATH
    if os.path.exists(db_path):
        result["db_bytes"] = os.path.getsize(db_path)
    wal_path = db_path + "-wal"
    if os.path.exists(wal_path):
        result["wal_bytes"] = os.path.getsize(wal_path)
    return result
 def wal_checkpoint(mode: str = "TRUNCATE") -> None:
    """Execute WAL checkpoint."""
    with get_connection() as conn:
        conn.execute(f"PRAGMA wal_checkpoint({mode})")
 _DDL = """
 -- Backend configuration table (core)
 CREATE TABLE IF NOT EXISTS backends (
    id TEXT PRIMARY KEY,
    name TEXT NOT NULL,
    label TEXT DEFAULT '',
    api_base_url TEXT NOT NULL,
    api_key_encrypted TEXT NOT NULL,
    api TEXT NOT NULL DEFAULT 'openai-completions',
    timeout_seconds INTEGER NOT NULL DEFAULT 120,
    rpm_limit INTEGER NOT NULL DEFAULT 40,
    pool TEXT NOT NULL DEFAULT 'primary'
        CHECK(pool IN ('primary', 'fallback')),
    enabled INTEGER NOT NULL DEFAULT 1,
    status TEXT NOT NULL DEFAULT 'healthy'
        CHECK(status IN ('healthy', 'cooling', 'error', 'disabled')),
    model_mappings_json TEXT DEFAULT '{}',
    source TEXT NOT NULL DEFAULT 'webui'
        CHECK(source IN ('webui', 'env', 'import')),
    cooldown_until TEXT,
    consecutive_429_count INTEGER DEFAULT 0,
    metadata_json TEXT DEFAULT '{}',
    created_at TEXT NOT NULL DEFAULT (datetime('now')),
    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
 );
 -- Usage logs (hour-bucketed, UPSERT-safe)
 CREATE TABLE IF NOT EXISTS backend_usage_logs (
    id TEXT PRIMARY KEY,
    backend_id TEXT NOT NULL REFERENCES backends(id) ON DELETE CASCADE,
    model TEXT DEFAULT 'unknown',
    prompt_tokens INTEGER DEFAULT 0,
    completion_tokens INTEGER DEFAULT 0,
    total_tokens INTEGER DEFAULT 0,
    cost REAL DEFAULT 0.0,
    request_count INTEGER DEFAULT 0,
    error_count INTEGER DEFAULT 0,
    avg_latency_ms INTEGER DEFAULT 0,
    ttft_ms INTEGER DEFAULT 0,
    hour_bucket TEXT NOT NULL,
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
 );
 CREATE UNIQUE INDEX IF NOT EXISTS idx_usage_backend_hour
    ON backend_usage_logs(backend_id, hour_bucket);
 -- Cooldown event log
 CREATE TABLE IF NOT EXISTS cooldown_events (
    id TEXT PRIMARY KEY,
    backend_id TEXT NOT NULL REFERENCES backends(id) ON DELETE CASCADE,
    consecutive_count INTEGER NOT NULL DEFAULT 1,
    cooldown_seconds INTEGER NOT NULL,
    response_summary TEXT DEFAULT '',
    started_at TEXT NOT NULL DEFAULT (datetime('now')),
    ended_at TEXT
 );
 CREATE INDEX IF NOT EXISTS idx_cooldown_backend_time
    ON cooldown_events(backend_id, started_at);
 -- Backend health state
 CREATE TABLE IF NOT EXISTS backend_health (
    backend_id TEXT PRIMARY KEY REFERENCES backends(id) ON DELETE CASCADE,
    state TEXT NOT NULL DEFAULT 'healthy'
        CHECK(state IN ('healthy', 'degraded', 'down')),
    last_latency_ms INTEGER DEFAULT 0,
    last_status_code INTEGER DEFAULT 200,
    success_rate_5m REAL DEFAULT 1.0,
    consecutive_failures INTEGER DEFAULT 0,
    last_check_at TEXT NOT NULL DEFAULT (datetime('now'))
 );
 -- System configuration KV store
 CREATE TABLE IF NOT EXISTS system_config (
    key TEXT PRIMARY KEY,
    value TEXT NOT NULL,
    description TEXT DEFAULT '',
    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
 );
 -- Daily aggregated stats
 CREATE TABLE IF NOT EXISTS daily_stats (
    id TEXT PRIMARY KEY,
    date TEXT NOT NULL,
    pool TEXT NOT NULL CHECK(pool IN ('primary', 'fallback')),
    total_requests INTEGER DEFAULT 0,
    total_errors INTEGER DEFAULT 0,
    total_tokens INTEGER DEFAULT 0,
    total_cost REAL DEFAULT 0.0,
    unique_backends INTEGER DEFAULT 0,
    created_at TEXT NOT NULL DEFAULT (datetime('now'))
 );
 CREATE UNIQUE INDEX IF NOT EXISTS idx_daily_date_pool ON daily_stats(date, pool);
 """
@@ -0,0 +1,161 @@
 """Data models for Sidecar V2 — backend-centric, Canonical Name routing."""
 from dataclasses import dataclass, field, asdict
 from typing import Optional
 import json
@dataclass
 class ModelMapping:
    """A single model mapping within a backend: Canonical Name → native_id + properties."""
    native_id: str
    reasoning: bool = False
    reasoning_effort: bool = False
    input_modalities: list[str] = field(default_factory=lambda: ["text"])
    cost: dict = field(default_factory=lambda: {
        "input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0
    })
    context_window: int = 128000
    max_tokens: int = 65536
    compat: dict = field(default_factory=dict)
    def to_dict(self) -> dict:
        return asdict(self)
    @classmethod
    def from_dict(cls, d: dict) -> "ModelMapping":
        defaults = {
            "native_id": "",
            "reasoning": False,
            "reasoning_effort": False,
            "input_modalities": ["text"],
            "cost": {"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0},
            "context_window": 128000,
            "max_tokens": 65536,
            "compat": {},
        }
        defaults.update(d)
        return cls(**{k: v for k, v in defaults.items() if k in cls.__dataclass_fields__})
@dataclass
 class Backend:
    """A physical API backend (API Key + URL).
    Represents a single API key endpoint. Multiple backends can serve the same
    Canonical Models through their model_mappings.
    """
    id: str = ""
    name: str = ""
    label: str = ""  # e.g., "nvidia", "siliconflow" — WebUI tag only
    api_base_url: str = ""
    api_key_encrypted: str = ""
    api: str = "openai-completions"
    timeout_seconds: int = 120
    rpm_limit: int = 40
    pool: str = "primary"  # primary | fallback
    enabled: bool = True
    status: str = "healthy"  # healthy | cooling | error | disabled
    model_mappings: dict[str, ModelMapping] = field(default_factory=dict)
    source: str = "webui"  # webui | env | import
    cooldown_until: Optional[str] = None
    consecutive_429_count: int = 0
    metadata: dict = field(default_factory=dict)
    created_at: str = ""
    updated_at: str = ""
    # Runtime fields (not persisted)
    api_key_plain: str = ""  # decrypted at load time, not serialized to DB
    def has_model(self, canonical_name: str) -> bool:
        """Check if backend supports a given Canonical Model."""
        return canonical_name in self.model_mappings
    def get_native_id(self, canonical_name: str) -> str:
        """Get this backend's native model ID for a Canonical Name."""
        mm = self.model_mappings.get(canonical_name)
        return mm.native_id if mm else canonical_name
    def get_model_cost(self, canonical_name: str) -> dict:
        """Get cost info for a Canonical Model on this backend."""
        mm = self.model_mappings.get(canonical_name)
        return mm.cost if mm else {"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0}
    def to_dict(self, mask_key: bool = True) -> dict:
        """Convert to dict for API responses."""
        d = asdict(self)
        # Remove runtime-only fields
        d.pop("api_key_plain", None)
        d.pop("api_key_encrypted", None)
        # Mask API key
        if mask_key and self.api_key_plain:
            d["api_key"] = _mask_key(self.api_key_plain)
        elif self.api_key_plain:
            d["api_key"] = self.api_key_plain
        else:
            d["api_key"] = ""
        # Convert model_mappings to dict for serialization
        d["model_mappings"] = {
            k: v.to_dict() for k, v in self.model_mappings.items()
        }
        return d
 def _mask_key(key: str) -> str:
    if len(key) <= 10:
        return key[:2] + "****"
    return key[:6] + "****" + key[-4:]
@dataclass
 class CooldownEvent:
    id: str = ""
    backend_id: str = ""
    consecutive_count: int = 1
    cooldown_seconds: int = 60
    response_summary: str = ""
    started_at: str = ""
    ended_at: Optional[str] = None
@dataclass
 class BackendHealth:
    backend_id: str = ""
    state: str = "healthy"  # healthy | degraded | down
    last_latency_ms: int = 0
    last_status_code: int = 200
    success_rate_5m: float = 1.0
    consecutive_failures: int = 0
    last_check_at: str = ""
@dataclass
 class UsageLog:
    id: str = ""
    backend_id: str = ""
    model: str = "unknown"
    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    cost: float = 0.0
    request_count: int = 0
    error_count: int = 0
    avg_latency_ms: int = 0
    ttft_ms: int = 0
    hour_bucket: str = ""
@dataclass
 class DailyStats:
    id: str = ""
    date: str = ""
    pool: str = "primary"
    total_requests: int = 0
    total_errors: int = 0
    total_tokens: int = 0
    total_cost: float = 0.0
    unique_backends: int = 0
@@ -0,0 +1,155 @@
 """Usage logging and daily statistics aggregation."""
 import time
 from typing import Optional
 from storage.db import get_connection, generate_id
 def record_usage(
    backend_id: str,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    cost: float,
    latency_ms: int,
    ttft_ms: int = 0,
    is_error: bool = False,
 ) -> None:
    """Record a single request's usage, hour-bucketed with UPSERT."""
    hour_bucket = time.strftime("%Y-%m-%dT%H:00:00Z", time.gmtime())
    uid = generate_id("use")
    with get_connection() as conn:
        # Try update existing hour bucket
        cursor = conn.execute(
            """UPDATE backend_usage_logs SET
               prompt_tokens = prompt_tokens + ?,
               completion_tokens = completion_tokens + ?,
               total_tokens = total_tokens + ?,
               cost = cost + ?,
               request_count = request_count + 1,
               error_count = error_count + ?,
               avg_latency_ms = CAST((avg_latency_ms * request_count + ?) / (request_count + 1) AS INTEGER),
               ttft_ms = CASE WHEN ? > 0 THEN CAST((ttft_ms * request_count + ?) / (request_count + 1) AS INTEGER) ELSE ttft_ms END
               WHERE backend_id = ? AND hour_bucket = ?""",
            (
                prompt_tokens, completion_tokens,
                prompt_tokens + completion_tokens,
                cost,
                1 if is_error else 0,
                latency_ms,
                ttft_ms, ttft_ms,
                backend_id, hour_bucket,
            ),
        )
        if cursor.rowcount == 0:
            # Insert new hour bucket
            conn.execute(
                """INSERT INTO backend_usage_logs
                   (id, backend_id, model, prompt_tokens, completion_tokens,
                    total_tokens, cost, request_count, error_count,
                    avg_latency_ms, ttft_ms, hour_bucket)
                   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
                (
                    uid, backend_id, model,
                    prompt_tokens, completion_tokens,
                    prompt_tokens + completion_tokens,
                    cost, 1, 1 if is_error else 0,
                    latency_ms, ttft_ms, hour_bucket,
                ),
            )
        conn.commit()
 def get_hourly_usage(
    backend_id: Optional[str] = None,
    since: Optional[str] = None,
    limit: int = 168,
 ) -> list[dict]:
    """Get hourly usage data, optionally filtered by backend and time range."""
    with get_connection() as conn:
        if backend_id and since:
            rows = conn.execute(
                """SELECT * FROM backend_usage_logs
                   WHERE backend_id = ? AND hour_bucket >= ?
                   ORDER BY hour_bucket DESC LIMIT ?""",
                (backend_id, since, limit),
            ).fetchall()
        elif backend_id:
            rows = conn.execute(
                """SELECT * FROM backend_usage_logs
                   WHERE backend_id = ? ORDER BY hour_bucket DESC LIMIT ?""",
                (backend_id, limit),
            ).fetchall()
        elif since:
            rows = conn.execute(
                """SELECT * FROM backend_usage_logs
                   WHERE hour_bucket >= ? ORDER BY hour_bucket DESC LIMIT ?""",
                (since, limit),
            ).fetchall()
        else:
            rows = conn.execute(
                """SELECT * FROM backend_usage_logs
                   ORDER BY hour_bucket DESC LIMIT ?""",
                (limit,),
            ).fetchall()
    return [dict(row) for row in rows]
 def get_total_stats() -> dict:
    """Get aggregate stats across all backends."""
    with get_connection() as conn:
        row = conn.execute(
            """SELECT
               SUM(request_count) as total_requests,
               SUM(error_count) as total_errors,
               SUM(total_tokens) as total_tokens,
               SUM(prompt_tokens) as total_prompt_tokens,
               SUM(completion_tokens) as total_completion_tokens,
               SUM(cost) as total_cost
               FROM backend_usage_logs"""
        ).fetchone()
    if row is None:
        return {
            "total_requests": 0, "total_errors": 0,
            "total_tokens": 0, "total_prompt_tokens": 0,
            "total_completion_tokens": 0, "total_cost": 0.0,
        }
    return dict(row)
 def aggregate_daily_stats(date: str) -> None:
    """Aggregate hourly usage into daily stats table."""
    with get_connection() as conn:
        # Aggregate per pool
        conn.execute("""DELETE FROM daily_stats WHERE date = ?""", (date,))
        conn.execute(
            """INSERT INTO daily_stats (id, date, pool, total_requests,
               total_errors, total_tokens, total_cost, unique_backends)
               SELECT
                   ? || '-' || b.pool,
                   ?,
                   b.pool,
                   SUM(u.request_count),
                   SUM(u.error_count),
                   SUM(u.total_tokens),
                   SUM(u.cost),
                   COUNT(DISTINCT u.backend_id)
               FROM backend_usage_logs u
               JOIN backends b ON u.backend_id = b.id
               WHERE u.hour_bucket LIKE ?
               GROUP BY b.pool""",
            (generate_id("day"), date, date + "%"),
        )
        conn.commit()
 def get_daily_stats(days: int = 30) -> list[dict]:
    """Get daily aggregated stats."""
    with get_connection() as conn:
        rows = conn.execute(
            """SELECT * FROM daily_stats ORDER BY date DESC LIMIT ?""",
            (days,),
        ).fetchall()
    return [dict(row) for row in rows]
@@ -1 +0,0 @@
 # nvidia_sidecar tests
@@ -1,207 +0,0 @@
 """
 避退模式并发/死锁回归测试 (BIZ-46 Phase3 6)
 覆盖多线程场景下的 AdaptiveTokenBucket 线程安全性：
 - 并发 record_response + evaluate_retreat
 - 并发 consume + record_response + evaluate_retreat
 - 高负载下避退状态转换正确性
 设计文档: docs/architecture/BIZ-46_Phase3_Architecture_Design.md 6
 """
 from __future__ import annotations
 import threading
 import time
 import pytest
 from nvidia_sidecar.rate_limiter import AdaptiveTokenBucket, RetreatState
 class TestRetreatConcurrency:
    """避退模式并发安全回归测试。"""
    @pytest.mark.asyncio
    async def test_concurrent_record_and_evaluate(self) -> None:
        """多线程同时 record_response + evaluate_retreat 不死锁。
        4 个线程同时操作：
        - 2 个线程执行 record_response (1000 次)
        - 2 个线程执行 evaluate_retreat (1000 次)
        所有线程必须在 10s 内完成，否则判定为死锁。
        """
        bucket = AdaptiveTokenBucket(rate=40 / 60, capacity=40)
        errors: list[Exception] = []
        def worker_record() -> None:
            for i in range(1000):
                try:
                    bucket.record_response(is_429=(i % 10 == 0))
                except Exception as e:
                    errors.append(e)
        def worker_evaluate() -> None:
            for _ in range(1000):
                try:
                    bucket.evaluate_retreat()
                except Exception as e:
                    errors.append(e)
        threads = [
            threading.Thread(target=worker_record),
            threading.Thread(target=worker_record),
            threading.Thread(target=worker_evaluate),
            threading.Thread(target=worker_evaluate),
        ]
        for t in threads:
            t.start()
        for t in threads:
            t.join(timeout=10)
        alive_threads = [t for t in threads if t.is_alive()]
        assert not alive_threads, (
            f"{len(alive_threads)} 个线程未完成，疑似死锁"
        )
        assert not errors, f"并发错误: {errors}"
    @pytest.mark.asyncio
    async def test_concurrent_consume_and_retreat(self) -> None:
        """多线程同时 consume + record_response + evaluate_retreat 不死锁。
        覆盖 _lock (TokenBucket) 和 _retreat_lock (AdaptiveTokenBucket)
        同时被不同线程持有时的交叉锁场景。
        """
        bucket = AdaptiveTokenBucket(rate=40 / 60, capacity=40)
        errors: list[Exception] = []
        def worker_consume() -> None:
            for _ in range(500):
                try:
                    bucket.consume(tokens=1)
                except Exception as e:
                    errors.append(e)
        def worker_retreat() -> None:
            for _ in range(500):
                try:
                    bucket.record_response(is_429=False)
                    bucket.evaluate_retreat()
                except Exception as e:
                    errors.append(e)
        threads = [
            threading.Thread(target=worker_consume),
            threading.Thread(target=worker_consume),
            threading.Thread(target=worker_retreat),
            threading.Thread(target=worker_retreat),
        ]
        for t in threads:
            t.start()
        for t in threads:
            t.join(timeout=10)
        alive_threads = [t for t in threads if t.is_alive()]
        assert not alive_threads, (
            f"{len(alive_threads)} 个线程未完成，疑似死锁"
        )
        assert not errors, f"并发错误: {errors}"
    @pytest.mark.asyncio
    async def test_retreat_state_transitions_under_load(self) -> None:
        """高负载下避退状态转换正确。
        1. 注入 100 个 429 → 验证进入 RETREAT
        2. 注入 200 个成功 → 手动推进时间 → 验证恢复
        """
        bucket = AdaptiveTokenBucket(
            rate=40 / 60,
            capacity=40,
            retreat_window_seconds=0.1,
            retreat_429_threshold=0.05,
            retreat_factor=0.75,
            retreat_min_rpm=5.0,
            recover_window_seconds=0.01,
        )
        # 阶段 1：模拟高 429 率
        for _ in range(100):
            bucket.record_response(is_429=True)
        state = bucket.evaluate_retreat()
        assert state == RetreatState.RETREAT, (
            f"高 429 率应触发避退，实际: {state}"
        )
        assert bucket.get_effective_rate_rpm() < bucket.get_base_rate_rpm(), (
            f"避退后速率应低于基准，实际: "
            f"{bucket.get_effective_rate_rpm()} vs {bucket.get_base_rate_rpm()}"
        )
        # 阶段 2：模拟恢复
        time.sleep(0.15)  # 等待 429 从短窗口中过期
        for _ in range(200):
            bucket.record_response(is_429=False)
        for _ in range(10):
            state = bucket.evaluate_retreat()
        assert state in (RetreatState.RECOVER, RetreatState.NORMAL), (
            f"恢复后应为 RECOVER 或 NORMAL，实际: {state}"
        )
    @pytest.mark.asyncio
    async def test_try_consume_concurrency_safety(self) -> None:
        """并发 try_consume 不死锁。"""
        bucket = AdaptiveTokenBucket(rate=40 / 60, capacity=40)
        errors: list[Exception] = []
        results: list[bool] = []
        def worker() -> None:
            for _ in range(200):
                try:
                    got = bucket.try_consume(tokens=1, timeout=0.1)
                    results.append(got)
                except Exception as e:
                    errors.append(e)
        threads = [threading.Thread(target=worker) for _ in range(8)]
        for t in threads:
            t.start()
        for t in threads:
            t.join(timeout=10)
        alive = [t for t in threads if t.is_alive()]
        assert not alive, f"{len(alive)} 个线程未完成，疑似死锁"
        assert not errors, f"并发错误: {errors}"
        successful = sum(1 for r in results if r)
        assert successful > 0, (
            f"令牌桶应至少成功消费一些令牌，成功: {successful}/{len(results)}"
        )
    @pytest.mark.asyncio
    async def test_high_load_state_coherence(self) -> None:
        """高负载下令牌桶状态一致性：消费总量 ≤ 初始 token + 补充量。"""
        bucket = AdaptiveTokenBucket(rate=10.0, capacity=100)
        consumed_count: list[int] = [0]
        lock = threading.Lock()
        def worker() -> None:
            local_consumed = 0
            for _ in range(50):
                if bucket.consume(tokens=1):
                    local_consumed += 1
                time.sleep(0.001)
            with lock:
                consumed_count[0] += local_consumed
        threads = [threading.Thread(target=worker) for _ in range(10)]
        for t in threads:
            t.start()
        for t in threads:
            t.join(timeout=15)
        max_expected = 100 + int(10.0 * 5)
        assert consumed_count[0] <= max_expected, (
            f"消费量异常: {consumed_count[0]}，应 ≤ {max_expected}"
        )
@@ -1,325 +0,0 @@
 """
 NVIDIA Sidecar — WebUI 后端 API
 提供仪表盘 SSE 实时推送 + 配置热重载 API。
 BIZ-46 Phase3:
 - 架构解耦：移除反向导入 server，改用 Depends(get_context) (§1)
 - SSE 共享缓存：1s TTL snapshot cache，多客户端不重复构建 (§3)
 - Dashboard UX：页面加载同步配置 + 队列深度标题 (§7)
 """
 from __future__ import annotations
 import asyncio
 import json
 import os
 import time
 from pathlib import Path
 from typing import Any, AsyncGenerator
 import structlog
 from fastapi import APIRouter, Depends, HTTPException, Request
 from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
 from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
 from pydantic import BaseModel
 from nvidia_sidecar.context import SidecarContext
 webui_router: APIRouter = APIRouter(prefix="/api", tags=["webui"])
 logger: structlog.stdlib.BoundLogger = structlog.get_logger("nvidia_sidecar.webui")
 STATIC_DIR: Path = Path(__file__).parent / "static"
 # dashboard.html 缓存（严维序评审 #6 / 梁思筑评审 #8：避免每次请求读磁盘）
 _dashboard_html_cache: tuple[str, float] | None = None
 _DASHBOARD_CACHE_TTL: float = 300.0  # 5 分钟
 # Admin API 认证（严维序评审 #1）
 _ADMIN_TOKEN: str | None = os.environ.get("SIDECAR_ADMIN_TOKEN")
 _admin_auth_scheme: HTTPBearer = HTTPBearer(auto_error=False)
 def _get_ctx(request: Request) -> SidecarContext:
    """获取 SidecarContext（webui 路由级注入，避免循环导入 server）。"""
    return request.app.state.sidecar  # type: ignore[no-any-return]
 # ---------------------------------------------------------------------------
 # 配置热重载模型
 # ---------------------------------------------------------------------------
 class ConfigPatch(BaseModel):
    """可在线修改的配置字段。"""
    rate_rpm: int | None = None
    queue_max_size: int | None = None
    fallback_enabled_passthrough: bool | None = None
 # ---------------------------------------------------------------------------
 # SSE 快照构建（BIZ-46 Phase3: 1s TTL 共享缓存）
 # ---------------------------------------------------------------------------
 async def _build_snapshot(ctx: SidecarContext) -> dict[str, Any]:
    """构建当前状态快照（从 SidecarContext 读取，含队列深度）。
    BIZ-46 Phase3: 不再通过反向导入 server 访问全局变量。
    """
    try:
        bucket_status = ctx.token_bucket.get_status()
        now = time.time()
        queue_data: dict[str, Any] = {"current_size": 0, "per_priority": {}}
        try:
            queue_stats = await ctx.priority_queue.get_stats()
            queue_data = {
                "max_size": queue_stats.get("max_size", 0),
                "current_size": queue_stats.get("current_size", 0),
                "per_priority": queue_stats.get("depth_by_priority", {}),
                "total_enqueued": queue_stats.get("total_enqueued", 0),
                "total_dequeued": queue_stats.get("total_dequeued", 0),
                "total_dropped": queue_stats.get("total_dropped", 0),
            }
        except Exception:
            logger.warning(
                "queue_stats_unavailable",
                message="队列统计获取失败，仪表盘队列深度可能不准确",
            )
        return {
            "timestamp": now,
            "uptime_seconds": ctx.uptime_seconds,
            "token_bucket": bucket_status,
            "queue": queue_data,
            "retreat": {
                "state": ctx.token_bucket.get_retreat_state(),
                "effective_rpm": round(ctx.token_bucket.get_effective_rate_rpm(), 1),
                "base_rpm": round(ctx.token_bucket.get_base_rate_rpm(), 1),
                "upstream_429_rate": round(ctx.token_bucket.get_429_rate(), 4),
            },
            "requests": {
                "total": ctx.stats.get("total_requests", 0),
                "nvidia": ctx.stats.get("nvidia_requests", 0),
                "passthrough": ctx.stats.get("passthrough_requests", 0),
                "ratelimited": ctx.stats.get("ratelimited_requests", 0),
            },
            "errors": {
                "queue_full_rejects": ctx.stats.get("queue_full_rejects", 0),
                "upstream_errors": ctx.stats.get("upstream_errors", 0),
            },
        }
    except Exception:
        logger.exception("snapshot_build_error")
        return {"error": "snapshot_unavailable", "timestamp": time.time()}
 async def _build_snapshot_cached(ctx: SidecarContext) -> dict[str, Any]:
    """带 1s TTL 的共享快照缓存（BIZ-46 Phase3 §3）。
    多个 SSE 客户端共享同一份快照，避免重复计算和锁竞争。
    性能收益：
    - 1 客户端: 1 次/s 计算（无变化）
    - 5 客户端: ~5 次/s → 1 次/s
    - 20 客户端: ~20 次/s → 1 次/s
    """
    now_cache = time.monotonic()
    if ctx.snapshot_cache is not None:
        data, ts = ctx.snapshot_cache
        if now_cache - ts < ctx.SNAPSHOT_CACHE_TTL:
            return data
    async with ctx.snapshot_cache_lock:
        # Double-check（避免多个协程同时 miss 后重复构建）
        if ctx.snapshot_cache is not None:
            data, ts = ctx.snapshot_cache
            if now_cache - ts < ctx.SNAPSHOT_CACHE_TTL:
                return data
        snapshot = await _build_snapshot(ctx)
        ctx.snapshot_cache = (snapshot, now_cache)
        return snapshot
 # ---------------------------------------------------------------------------
 # 仪表盘 SSE 推送
 # ---------------------------------------------------------------------------
 async def _dashboard_stream(request: Request, ctx: SidecarContext) -> StreamingResponse:
    """SSE 实时推送 Sidecar 完整状态快照（每秒一次）。
    供 dashboard.html 的 EventSource 消费。
    BIZ-46 Phase3: 使用共享缓存 _build_snapshot_cached，多客户端不重复计算。
    """
    async def event_generator() -> AsyncGenerator[str, None]:
        first_frame = True
        while True:
            if await request.is_disconnected():
                break
            try:
                snapshot: dict[str, Any] = await _build_snapshot_cached(ctx)
                payload_sse = f"data: {json.dumps(snapshot, ensure_ascii=False)}\n\n"
                if first_frame:
                    payload_sse = f"retry: 3000\n{payload_sse}"
                    first_frame = False
                yield payload_sse
            except Exception:
                logger.exception("dashboard_sse_error")
                yield f"data: {json.dumps({'error': 'internal'})}\n\n"
            await asyncio.sleep(1.0)
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        },
    )
 # ---------------------------------------------------------------------------
 # 配置热重载
 # ---------------------------------------------------------------------------
 async def get_config(ctx: SidecarContext) -> dict[str, Any]:
    """获取当前完整配置（从 SidecarContext 读取）。"""
    config = ctx.config
    effective_rpm = float(ctx.token_bucket.get_effective_rate_rpm())
    return {
        "listen_host": config.listen_host,
        "listen_port": config.listen_port,
        "metrics_port": config.metrics_port,
        "upstream_url": config.upstream_url,
        "upstream_api_key": _mask_api_key(config.upstream_api_key),
        "rate_rpm": round(effective_rpm, 1),
        "bucket_capacity": config.bucket_capacity,
        "request_timeout": config.request_timeout,
        "queue_max_size": config.queue_max_size,
        "low_priority_timeout": config.low_priority_timeout,
        "fallback_enabled_passthrough": config.fallback_enabled_passthrough,
        "log_level": config.log_level,
    }
 async def update_config(body: ConfigPatch, ctx: SidecarContext) -> JSONResponse:
    """在线修改配置项并即时生效。"""
    config = ctx.config
    changed: list[str] = []
    if body.rate_rpm is not None:
        if body.rate_rpm <= 0:
            raise HTTPException(status_code=400, detail="rate_rpm must be > 0")
        config.rate_rpm = body.rate_rpm
        ctx.token_bucket.set_rate(body.rate_rpm / 60.0)
        changed.append("rate_rpm")
    if body.queue_max_size is not None:
        if body.queue_max_size <= 0:
            raise HTTPException(status_code=400, detail="queue_max_size must be > 0")
        ok, msg = ctx.priority_queue.set_max_size(body.queue_max_size)
        if not ok:
            raise HTTPException(status_code=400, detail=msg)
        config.queue_max_size = body.queue_max_size
        changed.append("queue_max_size")
        logger.info("queue_max_size_updated", detail=msg)
    if body.fallback_enabled_passthrough is not None:
        config.fallback_enabled_passthrough = body.fallback_enabled_passthrough
        changed.append("fallback_enabled_passthrough")
    logger.info("config_updated", changed=changed)
    return JSONResponse(
        content={"status": "ok", "changed": changed},
    )
 def _mask_api_key(key: str) -> str:
    """对 API Key 进行脱敏处理，仅保留前 4 位以供识别。
    严维序评审 #2 / 沈路明评审 #3：防止 API Key 明文泄露。
    """
    if not key:
        return ""
    if len(key) <= 4:
        return key[:2] + "****"
    return key[:4] + "****"
 # ---------------------------------------------------------------------------
 # 路由注册
 # ---------------------------------------------------------------------------
@webui_router.get("/dashboard/stream")
 async def dashboard_stream(
    request: Request,
    ctx: SidecarContext = Depends(_get_ctx),
 ) -> StreamingResponse:
    """SSE 仪表盘实时推送端点（BIZ-46 Phase3: 使用共享缓存）。"""
    return await _dashboard_stream(request, ctx)
 async def _verify_admin_auth(
    credentials: HTTPAuthorizationCredentials | None = Depends(_admin_auth_scheme),
 ) -> None:
    """Admin API Bearer Token 认证（严维序评审 #1）。
    若设置了 SIDECAR_ADMIN_TOKEN 环境变量，则要求请求携带匹配的 Bearer Token。
    未设置时跳过认证（开发/测试环境）。
    """
    if _ADMIN_TOKEN is None:
        return  # 未配置认证 token，允许无认证访问
    if credentials is None:
        raise HTTPException(status_code=401, detail="需要 Bearer Token 认证（Admin API）")
    if credentials.credentials != _ADMIN_TOKEN:
        raise HTTPException(status_code=403, detail="Admin Token 无效")
@webui_router.get("/admin/config")
 async def admin_get_config(
    _auth: None = Depends(_verify_admin_auth),
    ctx: SidecarContext = Depends(_get_ctx),
 ) -> JSONResponse:
    """获取当前配置（需要 Admin 认证）。"""
    return JSONResponse(content=await get_config(ctx))
@webui_router.post("/admin/config")
 async def admin_update_config(
    body: ConfigPatch,
    _auth: None = Depends(_verify_admin_auth),
    ctx: SidecarContext = Depends(_get_ctx),
 ) -> JSONResponse:
    """在线修改配置（热重载，需要 Admin 认证）。"""
    return await update_config(body, ctx)
 # ---------------------------------------------------------------------------
 # 仪表盘静态页面
 # ---------------------------------------------------------------------------
 def _get_dashboard_html() -> str:
    """获取仪表盘 HTML（带缓存，严维序评审 #6 / 梁思筑评审 #8）。
    首次加载后缓存 5 分钟，避免每次请求读磁盘。
    """
    global _dashboard_html_cache
    now = time.monotonic()
    if _dashboard_html_cache is not None:
        cached_content, cached_at = _dashboard_html_cache
        if now - cached_at < _DASHBOARD_CACHE_TTL:
            return cached_content
    dashboard_path = STATIC_DIR / "dashboard.html"
    if dashboard_path.is_file():
        content = dashboard_path.read_text(encoding="utf-8")
        _dashboard_html_cache = (content, now)
        return content
    return "<h1>dashboard.html not found</h1>"
@webui_router.get("/dashboard", include_in_schema=False)
 async def dashboard_page() -> HTMLResponse:
    """仪表盘 HTML 页面（含缓存策略）。"""
    return HTMLResponse(content=_get_dashboard_html())