fix(sidecar-v2): second-round review fixes

- cooldown_manager: move function-level imports to module top - proxy.py: emergency_count counter now actually increments - server.py: metrics reads emergency_count from proxy module - dashboard.html: real JS CDN fallback (not just comment) - requirements.txt: remove unused prometheus_client Round 2 review residual fixes from 沈路明/陆怀瑾/梁思筑 feedback Co-authored-by: multica-agent <github@multica.ai>
fix(sidecar-v2): incorporate review feedback - P0/P1 fixes
2026-06-25 17:53:48 +08:00 · 2026-06-25 17:12:33 +08:00 · 2026-06-25 16:39:01 +08:00
38 changed files with 3397 additions and 3937 deletions
@@ -1,644 +0,0 @@
-# BIZ-46 Phase3: NVIDIA Sidecar Follow-up 架构设计
-
-> **架构师**: 梁思筑 (architect)  
-> **日期**: 2026-06-24  
-> **状态**: 已批准，推进实施  
-> **来源**: BIZ-42 Phase2 二轮评审 follow-up  
-
---
-
-## 1. 架构解耦 / 依赖注入 — SidecarContext
-
-### 1.1 现状分析
-
-当前 `server.py` 使用 **模块级全局变量** 管理所有核心组件：
-
-```python
-# server.py 全局状态（当前）
-_config: SidecarConfig
-_http_client: httpx.AsyncClient
-_priority_queue: PriorityRequestQueue
-_token_bucket: AdaptiveTokenBucket
-_prometheus: PrometheusMetrics
-_health_service: HealthService
-_pending_requests: dict[str, tuple[asyncio.Future, float]]
-_stats: dict[str, int]
-_stats_lock: asyncio.Lock
-```
-
-**问题**：
- `webui.py` 通过 `from nvidia_sidecar import server` 反向导入全局变量（循环依赖风险）
- 单元测试需要 mock 模块级变量，无法并行运行测试
- 未来多实例/多租户扩展需重写全部模块访问逻辑
-
-### 1.2 设计方案 — SidecarContext + FastAPI Dependency Injection
-
-#### 1.2.1 核心数据结构
-
-```python
-# context.py
-from dataclasses import dataclass, field
-import asyncio
-import httpx
-from typing import Any
-
-@dataclass
-class SidecarContext:
-    """Sidecar 全局运行时上下文 — 所有核心组件的唯一容器。
-    
-    通过 app.state.sidecar 注入 FastAPI，路由通过 Depends 获取。
-    """
-    config: 'SidecarConfig'
-    http_client: httpx.AsyncClient
-    token_bucket: 'AdaptiveTokenBucket'
-    priority_queue: 'PriorityRequestQueue'
-    prometheus: 'PrometheusMetrics'
-    health: 'HealthService'
-    pending_requests: dict[str, tuple['asyncio.Future', float]] = field(default_factory=dict)
-    stats: dict[str, int] = field(default_factory=lambda: {
-        "total_requests": 0,
-        "nvidia_requests": 0,
-        "passthrough_requests": 0,
-        "ratelimited_requests": 0,
-        "queue_full_rejects": 0,
-        "upstream_errors": 0,
-        "start_time": 0,
-    })
-    stats_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
-    
-    async def increment_stat(self, key: str, delta: int = 1) -> None:
-        """线程安全的统计计数器自增。"""
-        async with self.stats_lock:
-            self.stats[key] = self.stats.get(key, 0) + delta
-```
-
-#### 1.2.2 注入方式
-
-```python
-# server.py — lifespan 中创建 context
-from nvidia_sidecar.context import SidecarContext
-
-@asynccontextmanager
-async def lifespan(app: FastAPI):
-    ctx = SidecarContext(
-        config=load_config(),
-        http_client=httpx.AsyncClient(...),
-        token_bucket=AdaptiveTokenBucket(...),
-        priority_queue=PriorityRequestQueue(...),
-        prometheus=PrometheusMetrics(),
-        health=HealthService(),
-    )
-    app.state.sidecar = ctx  # 注入 FastAPI
-    # ... worker 启动 ...
-    yield
-    # ... 清理 ...
-
-# 依赖注入函数
-def get_context(request: Request) -> SidecarContext:
-    return request.app.state.sidecar
-
-# 路由使用
-@app.post("/v1/chat/completions")
-async def chat_completions(request: Request, ctx: SidecarContext = Depends(get_context)):
-    return await _handle_proxy_request(request, "/v1/chat/completions", ctx)
-```
-
-#### 1.2.3 webui.py 解耦
-
-```python
-# webui.py — 不再反向导入 server
-from nvidia_sidecar.context import SidecarContext
-from fastapi import Depends
-
-def get_webui_router():
-    router = APIRouter(prefix="/api", tags=["webui"])
-    
-    def _get_ctx(request: Request) -> SidecarContext:
-        return request.app.state.sidecar
-    
-    @router.get("/dashboard/stream")
-    async def dashboard_stream(request: Request, ctx: SidecarContext = Depends(_get_ctx)):
-        return await _dashboard_stream(request, ctx)
-    
-    @router.get("/admin/config")
-    async def admin_get_config(ctx: SidecarContext = Depends(_get_ctx)):
-        return await get_config(ctx)
-    
-    return router
-```
-
-#### 1.2.4 Trade-off 分析
-
-| 维度 | 当前（全局变量） | 方案A（SidecarContext） | 方案B（FastAPI Dependency 全函数式） |
-|------|------------------|------------------------|-------------------------------------|
-| 可测试性 | 差（需 mock 模块） | 好（注入 mock context） | 优（每个依赖独立注入） |
-| 改动量 | 无 | 中等（~8 文件） | 大（每个函数签名变更） |
-| 可读性 | 一般 | 好（ctx 一目了然） | 差（参数列表膨胀） |
-| 多实例支持 | 不支持 | 支持（多 app 多 ctx） | 支持 |
-| 循环依赖 | 有（webui→server） | 消除 | 消除 |
-
-**决策**: 采用方案A（SidecarContext），平衡改动量与收益。
-
-### 1.3 迁移计划
-
-分 3 步渐进迁移，每步可独立合入：
-
-1. **Step 1**: 创建 `context.py`，定义 `SidecarContext`，在 `lifespan` 中实例化并挂到 `app.state`
-2. **Step 2**: 路由函数改为 `Depends(get_context)`，删除模块级 `_config`、`_http_client` 等
-3. **Step 3**: `webui.py` 移除 `from nvidia_sidecar import server`，改用依赖注入
-
---
-
-## 2. Prometheus 标签基数治理
-
-### 2.1 现状
-
-当前使用 `model_id` 作为 label 的指标：
-
-| 指标 | Label | 风险 |
-|------|-------|------|
-| `sidecar_upstream_latency_seconds` | `model_id` | **高** — NVIDIA 模型名含版本号，可能无界增长 |
-| `sidecar_upstream_errors_total` | `status_code`, `model_id` | **中** — 组合基数 = 模型数 × 状态码数 |
-
-### 2.2 基数评估
-
-NVIDIA API 当前已知模型约 20-30 个，但：
- 新模型持续发布（每月 2-5 个）
- 模型名含版本后缀（`nvidia/deepseek-ai/deepseek-v4-pro`、`nvidia/llama-3.1-70b-instruct` 等）
- 长期运行（6 个月+）可能累积 100+ 标签组合
-
-**结论**: 当前基数可控（<200 组合），但长期存在膨胀风险，应提前治理。
-
-### 2.3 治理方案
-
-| 指标 | 当前 Label | 调整后 Label | 理由 |
-|------|-----------|-------------|------|
-| `upstream_latency_seconds` | `model_id` | `provider` | provider 固定为 `nvidia`，基数=1 |
-| `upstream_errors_total` | `status_code`, `model_id` | `status_code`, `provider` | 同上 |
-
-**模型级信息迁移路径**：
- 模型 ID → 结构化 JSON 日志（structlog 已支持）
- 需要模型级延迟分析时 → 临时 `/status` API 查询或日志聚合
-
-```python
-# metrics.py 调整
-self.upstream_latency_seconds: Histogram = Histogram(
-    "sidecar_upstream_latency_seconds",
-    "Upstream response latency in seconds",
-    labelnames=["provider"],  # 原: ["model_id"]
-    buckets=(...),
-)
-
-self.upstream_errors_total: Counter = Counter(
-    "sidecar_upstream_errors_total",
-    "Upstream error count by status code",
-    labelnames=["status_code", "provider"],  # 原: ["status_code", "model_id"]
-)
-```
-
-```python
-# server.py 调整 — 模型信息改记日志
-model_id = _extract_model(payload) or "unknown"
-provider = "nvidia"  # 固定值，因为只有 NVIDIA 请求走 worker
-_prometheus.record_upstream_latency(provider, upstream_latency)
-if not resp.is_success:
-    _prometheus.record_upstream_error(resp.status_code, provider)
-logger.info("request_completed", model_id=model_id, ...)  # JSON 日志保留模型信息
-```
-
-### 2.4 Trade-off
-
-| 维度 | 保留 model_id | 收敛为 provider |
-|------|--------------|----------------|
-| 基数风险 | 高（无界） | 无（固定=1） |
-| 模型级分析 | Prometheus 原生查询 | 需日志聚合 |
-| 迁移成本 | 无 | 低（改 2 个指标定义 + 调用点） |
-
-**决策**: 收敛为 `provider`，模型级分析通过 JSON 日志 + 日志聚合系统（ELK/Loki）完成。
-
---
-
-## 3. SSE 快照共享缓存
-
-### 3.1 现状
-
-每个 SSE 客户端每秒独立调用 `_build_snapshot()`，该方法：
- 获取 `_stats` 字典（需锁）
- 调用 `_token_bucket.get_status()`（需锁）
- 调用 `_priority_queue.get_stats()`（需 asyncio.Lock）
-
-当 N 个仪表盘同时打开时，每秒 N 次锁竞争 + N 次重复计算。
-
-### 3.2 设计方案 — 1s TTL 共享缓存
-
-```python
-# webui.py
-_snapshot_cache: tuple[dict[str, Any], float] | None = None  # (data, timestamp)
-_snapshot_lock: asyncio.Lock = asyncio.Lock()
-_SNAPSHOT_TTL: float = 1.0  # 1 秒 TTL
-
-async def _build_snapshot_cached(ctx: SidecarContext) -> dict[str, Any]:
-    """带 1s TTL 的共享快照缓存。
-    
-    多个 SSE 客户端共享同一份快照，避免重复计算和锁竞争。
-    """
-    global _snapshot_cache
-    
-    now = time.monotonic()
-    if _snapshot_cache is not None:
-        data, ts = _snapshot_cache
-        if now - ts < _SNAPSHOT_TTL:
-            return data
-    
-    async with _snapshot_lock:
-        # Double-check（避免多个协程同时 miss 后重复构建）
-        if _snapshot_cache is not None:
-            data, ts = _snapshot_cache
-            if now - ts < _SNAPSHOT_TTL:
-                return data
-        
-        snapshot = await _build_snapshot(ctx)
-        _snapshot_cache = (snapshot, now)
-        return snapshot
-```
-
-### 3.3 性能收益
-
-| 场景 | 当前 | 优化后 |
-|------|------|--------|
-| 1 客户端 | 1 次/s 计算 | 1 次/s 计算（无变化） |
-| 5 客户端 | 5 次/s 计算，5 次锁竞争 | 1 次/s 计算，1 次锁竞争 |
-| 20 客户端 | 20 次/s 计算，20 次锁竞争 | 1 次/s 计算，1 次锁竞争 |
-
---
-
-## 4. 部署支撑
-
-### 4.1 Dockerfile
-
-```dockerfile
-# services/nvidia_sidecar/Dockerfile
-FROM python:3.12-slim AS base
-
-WORKDIR /app
-
-# 安装依赖（利用 Docker 层缓存）
-COPY pyproject.toml .
-RUN pip install --no-cache-dir -e .
-
-# 复制源码
-COPY . .
-
-# 非 root 用户运行
-RUN useradd -r -s /bin/false sidecar
-USER sidecar
-
-# 健康检查
-HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
-    CMD python -c "import httpx; r=httpx.get('http://127.0.0.1:9190/health'); exit(0 if r.status_code==200 else 1)"
-
-EXPOSE 9190 9191
-
-CMD ["uvicorn", "nvidia_sidecar.server:app", "--host", "0.0.0.0", "--port", "9190"]
-```
-
-### 4.2 systemd Service
-
-```ini
-# services/nvidia_sidecar/deploy/nvidia-sidecar.service
-[Unit]
-Description=NVIDIA Sidecar Rate-Limiting Proxy
-After=network-online.target
-Wants=network-online.target
-
-[Service]
-Type=simple
-User=sidecar
-Group=sidecar
-WorkingDirectory=/opt/nvidia-sidecar
-ExecStart=/opt/nvidia-sidecar/.venv/bin/uvicorn nvidia_sidecar.server:app \
-    --host 127.0.0.1 \
-    --port 9190 \
-    --log-level info
-Restart=always
-RestartSec=5
-
-# 环境变量
-EnvironmentFile=/opt/nvidia-sidecar/.env
-
-# 安全加固
-NoNewPrivileges=true
-ProtectSystem=strict
-ProtectHome=true
-PrivateTmp=true
-ReadWritePaths=/opt/nvidia-sidecar/logs
-
-# 资源限制
-LimitNOFILE=65536
-MemoryMax=512M
-
-[Install]
-WantedBy=multi-user.target
-```
-
-### 4.3 环境变量清单
-
-| 变量 | 默认值 | 说明 |
-|------|--------|------|
-| `SIDECAR_HOST` | `127.0.0.1` | 监听地址 |
-| `SIDECAR_PORT` | `9190` | 代理端口 |
-| `SIDECAR_METRICS_PORT` | `9191` | Prometheus 指标端口 |
-| `SIDECAR_UPSTREAM` | `https://integrate.api.nvidia.com/v1` | 上游 API |
-| `SIDECAR_API_KEY` | (必填) | NVIDIA API Key |
-| `SIDECAR_RATE_RPM` | `40` | 限流速率 (RPM) |
-| `SIDECAR_BUCKET_CAPACITY` | `40` | 令牌桶容量 |
-| `SIDECAR_TIMEOUT` | `60` | 请求超时 (秒) |
-| `SIDECAR_QUEUE_MAX` | `500` | 队列最大容量 |
-| `SIDECAR_LOW_TIMEOUT` | `2` | 低优先级超时 (秒) |
-| `SIDECAR_FALLBACK_PASSTHROUGH` | `true` | 队列满时是否直通 |
-| `SIDECAR_LOG_LEVEL` | `INFO` | 日志级别 |
-| `SIDECAR_ADMIN_TOKEN` | (可选) | Admin API 认证 Token |
-
-### 4.4 防火墙建议
-
-```
-# 仅允许内网访问代理端口
-sudo ufw allow from 192.168.1.0/24 to any port 9190
-sudo ufw allow from 192.168.1.0/24 to any port 9191
-# 禁止外网访问
-sudo ufw deny 9190
-sudo ufw deny 9191
-```
-
---
-
-## 5. Readiness HTTP Client 复用
-
-### 5.1 现状
-
-`HealthService.check_upstream()` 每次调用创建新的 `httpx.AsyncClient`：
-
-```python
-# health.py — 当前
-async def check_upstream(self, upstream_url: str, timeout: float = 5.0, api_key: str = "") -> bool:
-    async with httpx.AsyncClient(timeout=timeout) as client:  # 每次新建！
-        resp = await client.get(...)
-```
-
-K8s/systemd 每 10-30s 探测一次，每次创建+销毁 HTTP client 带来不必要的 TCP 连接开销。
-
-### 5.2 方案 — 复用主 http_client
-
-```python
-# health.py — 优化后
-async def check_upstream(
-    self,
-    upstream_url: str,
-    http_client: httpx.AsyncClient,  # 注入主 client
-    api_key: str = "",
-    timeout: float = 5.0,
-) -> bool:
-    try:
-        headers = {}
-        if api_key:
-            headers["authorization"] = f"Bearer {api_key}"
-        resp = await http_client.get(
-            f"{upstream_url.rstrip('/')}/v1/models",
-            headers=headers,
-            timeout=timeout,
-        )
-        return resp.status_code < 500
-    except Exception:
-        return False
-```
-
-```python
-# server.py — 路由调用处
-@app.get("/health/ready")
-async def health_ready(ctx: SidecarContext = Depends(get_context)):
-    queue_size = await ctx.priority_queue.get_queue_size()
-    bucket_status = ctx.token_bucket.get_status()
-    return await ctx.health.readiness(
-        upstream_url=ctx.config.upstream_url,
-        http_client=ctx.http_client,  # 复用主 client
-        upstream_api_key=ctx.config.upstream_api_key or "",
-        queue_current_size=queue_size,
-        queue_max_size=ctx.config.queue_max_size,
-        available_tokens=bucket_status["tokens"],
-        bucket_capacity=bucket_status["capacity"],
-    )
-```
-
-**注意**: readiness 检查使用较短 timeout (5s)，不影响主代理请求的 timeout 配置。httpx 支持per-request timeout 覆盖。
-
---
-
-## 6. Retreat 并发/死锁回归测试
-
-### 6.1 风险点
-
-`AdaptiveTokenBucket` 有两把锁：
- `_lock` (Lock): 保护令牌消费/补充
- `_retreat_lock` (RLock): 保护避退状态机
-
-潜在死锁路径：
-1. `evaluate_retreat()` 持有 `_retreat_lock` → 调用 `get_429_rate()` (也获取 `_retreat_lock`，RLock 可重入 ✅)
-2. `evaluate_retreat()` → `_apply_retreat()` → `set_rate()` → 获取 `_lock` (另一把锁)
-3. Worker 线程: `consume()` 持有 `_lock` → 不调用 `_retreat_lock` (无交叉 ✅)
-
-当前设计使用 RLock 已规避了重入死锁，但需要回归测试确保未来修改不引入死锁。
-
-### 6.2 测试用例
-
-```python
-# tests/test_retreat_concurrency.py
-import pytest
-import asyncio
-import threading
-from nvidia_sidecar.rate_limiter import AdaptiveTokenBucket, RetreatState
-
-class TestRetreatConcurrency:
-    """避退模式并发安全回归测试。"""
-
-    @pytest.mark.asyncio
-    async def test_concurrent_record_and_evaluate(self):
-        """多线程同时 record_response + evaluate_retreat 不死锁。"""
-        bucket = AdaptiveTokenBucket(rate=40/60, capacity=40)
-        errors: list[Exception] = []
-        
-        def worker_record():
-            for i in range(1000):
-                try:
-                    bucket.record_response(is_429=(i % 10 == 0))
-                except Exception as e:
-                    errors.append(e)
-        
-        def worker_evaluate():
-            for _ in range(1000):
-                try:
-                    bucket.evaluate_retreat()
-                except Exception as e:
-                    errors.append(e)
-        
-        threads = [
-            threading.Thread(target=worker_record),
-            threading.Thread(target=worker_record),
-            threading.Thread(target=worker_evaluate),
-            threading.Thread(target=worker_evaluate),
-        ]
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join(timeout=10)
-        
-        # 所有线程必须在 10s 内完成（无死锁）
-        assert all(not t.is_alive() for t in threads), "线程未完成，疑似死锁"
-        assert not errors, f"并发错误: {errors}"
-
-    @pytest.mark.asyncio
-    async def test_concurrent_consume_and_retreat(self):
-        """多线程同时 consume + evaluate_retreat 不死锁。"""
-        bucket = AdaptiveTokenBucket(rate=40/60, capacity=40)
-        errors: list[Exception] = []
-        
-        def worker_consume():
-            for _ in range(500):
-                try:
-                    bucket.consume(tokens=1)
-                except Exception as e:
-                    errors.append(e)
-        
-        def worker_retreat():
-            for _ in range(500):
-                try:
-                    bucket.record_response(is_429=False)
-                    bucket.evaluate_retreat()
-                except Exception as e:
-                    errors.append(e)
-        
-        threads = [
-            threading.Thread(target=worker_consume),
-            threading.Thread(target=worker_consume),
-            threading.Thread(target=worker_retreat),
-            threading.Thread(target=worker_retreat),
-        ]
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join(timeout=10)
-        
-        assert all(not t.is_alive() for t in threads), "线程未完成，疑似死锁"
-        assert not errors, f"并发错误: {errors}"
-
-    def test_retreat_state_transitions_under_load(self):
-        """高负载下避退状态转换正确。"""
-        bucket = AdaptiveTokenBucket(
-            rate=40/60, capacity=40,
-            retreat_429_threshold=0.05,
-            retreat_factor=0.75,
-        )
-        
-        # 模拟高 429 率
-        for _ in range(100):
-            bucket.record_response(is_429=True)
-        
-        state = bucket.evaluate_retreat()
-        assert state == RetreatState.RETREAT
-        assert bucket.get_effective_rate_rpm() < bucket.get_base_rate_rpm()
-        
-        # 模拟恢复
-        for _ in range(200):
-            bucket.record_response(is_429=False)
-        
-        # 需要等待 RECOVER_WINDOW
-        import time
-        time.sleep(0.1)  # 确保时间窗口过去
-        bucket._last_state_change = 0  # 强制触发时间条件
-        state = bucket.evaluate_retreat()
-        assert state in (RetreatState.RECOVER, RetreatState.NORMAL)
-```
-
---
-
-## 7. Dashboard UX 优化
-
-### 7.1 优化项清单
-
-| # | 优化项 | 实现方式 | 优先级 |
-|---|--------|---------|--------|
-| 1 | 队列柱状图 300ms 平滑动画 | CSS `transition: height 300ms ease` | P1 |
-| 2 | SSE 断连 5s 遮罩 | JS 定时器 + DOM 遮罩层 | P1 |
-| 3 | 队列图标题显示总排队数 | SSE 数据已有 `current_size`，更新标题 | P2 |
-| 4 | 页面加载同步配置 | `fetch('/api/admin/config')` 初始化表单 | P2 |
-
-### 7.2 关键实现
-
-```javascript
-// dashboard.html — SSE 断连检测
-let lastSSETime = Date.now();
-let reconnectMask = document.getElementById('reconnect-mask');
-
-eventSource.onmessage = (event) => {
-    lastSSETime = Date.now();
-    reconnectMask.style.display = 'none';
-    // ... 更新 UI ...
-};
-
-// 5s 无数据 → 显示遮罩
-setInterval(() => {
-    if (Date.now() - lastSSETime > 5000) {
-        reconnectMask.style.display = 'flex';
-    }
-}, 1000);
-
-// 队列柱状图动画
-// CSS: .queue-bar { transition: height 0.3s ease; }
-```
-
-```javascript
-// 页面加载时同步配置
-async function loadConfig() {
-    try {
-        const resp = await fetch('/api/admin/config');
-        if (resp.ok) {
-            const config = await resp.json();
-            document.getElementById('rate-rpm').value = config.rate_rpm;
-            document.getElementById('queue-max').value = config.queue_max_size;
-            // ...
-        }
-    } catch (e) {
-        console.warn('配置加载失败（可能需要 Admin Token）', e);
-    }
-}
-loadConfig();
-```
-
---
-
-## 8. 实施排期
-
-| 阶段 | 内容 | 预估工时 | 依赖 |
-|------|------|---------|------|
-| **D1** | SidecarContext Step 1-3（解耦迁移） | 8h | 无 |
-| **D2** | Prometheus 标签收敛 + 日志增强 | 2h | D1 |
-| **D2** | SSE 共享缓存 | 2h | D1 |
-| **D2** | Readiness HTTP client 复用 | 1h | D1 |
-| **D3** | Dockerfile + systemd service | 2h | 无 |
-| **D3** | Dashboard UX 优化 | 3h | 无 |
-| **D3** | Retreat 并发回归测试 | 3h | 无 |
-| **D4** | 集成测试 + mypy strict | 4h | D1-D3 |
-| **合计** | | **25h** | |
-
---
-
-## 9. 验收标准映射
-
-| Issue 要求 | 本文档章节 | 状态 |
-|-----------|-----------|------|
-| SidecarContext / DI 方案落地或 ADR | §1 | ✅ 详细设计 + 迁移计划 |
-| Prometheus 高基数 label 收敛 | §2 | ✅ 收敛为 provider |
-| SSE snapshot 共享缓存 | §3 | ✅ 1s TTL 设计 |
-| Dockerfile + systemd + 部署 SOP | §4 | ✅ 完整文件 |
-| readiness 复用 HTTP client | §5 | ✅ 注入主 client |
-| retreat 并发/死锁回归测试 | §6 | ✅ 测试用例 |
-| Dashboard UX 细节 | §7 | ✅ 4 项优化 |
@@ -1,3 +0,0 @@
-__pycache__/
-*.egg-info/
-.mypy_cache/
@@ -1,40 +1,46 @@
-# NVIDIA Sidecar 限流代理 — 生产 Docker 镜像 (BIZ-46 Phase3 §4)
-#
-# 构建：
-#   docker build -t nvidia-sidecar:latest .
-#
-# 运行：
-#   docker run -d --name nvidia-sidecar \
-#     -p 127.0.0.1:9190:9190 \
-#     -p 127.0.0.1:9191:9191 \
-#     -e SIDECAR_API_KEY="nvapi-xxx" \
-#     -e SIDECAR_RATE_RPM=40 \
-#     -v $(pwd)/logs:/opt/nvidia-sidecar/logs \
-#     nvidia-sidecar:latest
-
-FROM python:3.12-slim AS base
+# Sidecar V2 — Multi-Pool Provider Proxy
+FROM python:3.12-slim AS builder

 WORKDIR /app

-# 安装依赖（利用 Docker 层缓存）
-COPY pyproject.toml .
-RUN pip install --no-cache-dir fastapi>=0.115 \
-    "uvicorn[standard]>=0.34" httpx>=0.28 PyYAML>=6.0 \
-    structlog>=24.4 "prometheus-client>=0.21" pydantic>=2.0
+# Install dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir -r requirements.txt

-# 复制源码
-COPY . .
+# Copy application code
+COPY config.py crypto.py main.py server.py proxy.py router.py \
+     pool_manager.py cooldown_manager.py rate_limiter.py __init__.py \
+     dashboard.html ./
+COPY storage/ ./storage/

-# 非 root 用户运行
-RUN useradd -r -m -s /bin/false sidecar \
-    && mkdir -p /opt/nvidia-sidecar/logs \
-    && chown -R sidecar:sidecar /app /opt/nvidia-sidecar/logs
-USER sidecar
+# Create data directory
+RUN mkdir -p /app/data /app/data/backups

-# 健康检查
-HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
-    CMD python -c "import httpx; r=httpx.get('http://127.0.0.1:9190/health'); exit(0 if r.status_code==200 else 1)"
+FROM python:3.12-slim
+
+WORKDIR /app
+
+# Copy built artifacts
+COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
+COPY --from=builder /app /app
+
+# Environment
+ENV SIDECAR_HOST=0.0.0.0
+ENV SIDECAR_PORT=9190
+ENV SIDECAR_METRICS_PORT=9191
+ENV SIDECAR_DB_PATH=/app/data/sidecar_v2.db
+ENV SIDECAR_BACKUP_DIR=/app/data/backups
+ENV SIDECAR_ENCRYPTION_KEY=
+ENV SIDECAR_ADMIN_TOKEN=
+ENV LOG_FORMAT=json
+ENV PYTHONUNBUFFERED=1

 EXPOSE 9190 9191

-CMD ["uvicorn", "nvidia_sidecar.server:app", "--host", "0.0.0.0", "--port", "9190"]
+VOLUME ["/app/data"]
+
+HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
+    CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:9190/health')" || exit 1
+
+ENTRYPOINT ["python3", "main.py"]
@@ -1,118 +1,77 @@
-# NVIDIA Sidecar 限流代理
+# Sidecar V2 — Multi-Pool Provider Proxy

-为 NVIDIA API 提供**优先级排队 + 令牌桶限流**的透明代理层。
+## 概述
+Sidecar V2 是 OpenClaw 的 API 代理服务，实现多 Provider 池管理、负载均衡、429 冷却、RPM 队列控流。

-> BIZ-46 Phase3: 架构解耦、Prometheus 标签治理、SSE 共享缓存、部署支撑、测试完善、Dashboard UX 优化。
+## 核心功能
+- **Provider 池管理**：主池 (primary) + 备用池 (fallback)，支持动态增删 Provider
+- **429 冷却**：检测 429 → 自动冷却 → 指数退避 → 自动恢复
+- **按 Provider 独立 RPM 限流**：每个 Provider 独立的 Token Bucket
+- **路由策略**：主池优先 → 备用池兜底 → 全部耗尽返 503
+- **WebUI 管理**：Dashboard 仪表盘 + Provider CRUD
+- **用量统计**：Token 用量 + 费用统计 + 每小时/每日聚合
+- **API Key 加密**：AES-256-GCM 加密存储

-## 快速启动
+## 架构

-```bash
-pip install .
-nvidia-sidecar
+```
+OpenClaw → Sidecar V2 (port 9190) → 路由 → 主池 Provider 1,2,3...
+                                        ↘ 备池 Provider 4,5...
+                                        ↘ 全部耗尽 → 503
 ```

-监听 `127.0.0.1:9190`，代理到 NVIDIA API。
+## 快速开始
+
+```bash
+# 设置加密密钥 (64位十六进制)
+export SIDECAR_ENCRYPTION_KEY="0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff"
+
+# 启动服务
+python3 main.py
+
+# OR via uvicorn
+python3 -m uvicorn server:app --host 127.0.0.1 --port 9190
+```
+
+## WebUI
+访问 http://127.0.0.1:9190/dashboard
+
+## API 端点
+
+### Admin API
+- `GET  /api/admin/backends` — 列出所有 Provider
+- `POST /api/admin/backends` — 添加 Provider
+- `PUT  /api/admin/backends/{id}` — 更新 Provider
+- `DELETE /api/admin/backends/{id}` — 删除 Provider
+- `GET  /api/admin/pools` — 池状态汇总
+- `GET  /api/admin/stats/total` — 总计统计
+- `GET  /api/admin/stats/hourly` — 每小时用量
+- `GET  /api/admin/stats/daily` — 每日聚合
+- `GET  /api/admin/stats/cooldown` — 冷却事件历史
+- `GET  /api/admin/config` — 系统配置
+
+### 代理 API (OpenAI 兼容)
+- `POST /v1/chat/completions`
+- `POST /v1/completions`
+- `POST /v1/embeddings`
+- `GET  /v1/models`
+
+### 监控
+- `GET /health` — 健康检查
+- `GET /dashboard/sse` — Dashboard 实时数据流 (SSE)

 ## 环境变量

 | 变量 | 默认值 | 说明 |
 |------|--------|------|
-| `SIDECAR_HOST` | `127.0.0.1` | 监听地址 |
-| `SIDECAR_PORT` | `9190` | 监听端口 |
-| `SIDECAR_METRICS_PORT` | `9191` | Metrics 端口 |
-| `SIDECAR_UPSTREAM` | `https://integrate.api.nvidia.com/v1` | 上游 API 地址 |
-| `SIDECAR_API_KEY` | — | NVIDIA API Key（必填） |
-| `SIDECAR_RATE_RPM` | `40` | 每分钟请求数限制 |
-| `SIDECAR_BUCKET_CAPACITY` | `40` | 令牌桶容量 |
-| `SIDECAR_TIMEOUT` | `60` | 上游请求超时（秒） |
-| `SIDECAR_QUEUE_MAX` | `500` | 队列最大长度 |
-| `SIDECAR_LOW_TIMEOUT` | `2.0` | 低优先级令牌等待超时（秒） |
-| `SIDECAR_FALLBACK_PASSTHROUGH` | `true` | 队列满时是否直通上游 |
-| `SIDECAR_LOG_LEVEL` | `INFO` | 日志级别 |
+| SIDECAR_HOST | 127.0.0.1 | 监听地址 |
+| SIDECAR_PORT | 9190 | 监听端口 |
+| SIDECAR_ENCRYPTION_KEY | (必填) | API Key 加密密钥 (64 hex chars) |
+| SIDECAR_DB_PATH | ./data/sidecar_v2.db | SQLite 数据库路径 |
+| SIDECAR_RATE_RPM | 40 | 默认 RPM 限制 |
+| SIDECAR_COOLDOWN_BASE | 30 | 冷却基础时长 (秒) |
+| SIDECAR_COOLDOWN_MAX | 600 | 冷却最大时长 (秒) |

-## YAML 配置
-
-```yaml
-listen_port: 9292
-rate_rpm: 60
-upstream_api_key: "nvapi-xxx"
-```
-
-```bash
-nvidia-sidecar --config /etc/nvidia-sidecar.yaml
-```
-
-## API 端点
-
-| 路径 | 方法 | 说明 |
-|------|------|------|
-| `/v1/chat/completions` | POST | OpenAI Chat Completions 代理 |
-| `/v1/completions` | POST | OpenAI Completions 代理（legacy） |
-| `/v1/embeddings` | POST | OpenAI Embeddings 代理 |
-| `/v1/models` | GET | 模型列表代理 |
-| `/health` | GET | 存活检查 (liveness) |
-| `/health/ready` | GET | 就绪检查 (readiness，含上游连通性) |
-| `/status` | GET | 调试用完整状态（限流器 + 队列 + 避退） |
-| `/api/dashboard/stream` | GET | SSE 仪表盘实时推送 |
-| `/api/dashboard` | GET | 仪表盘 HTML 页面 |
-| `/api/admin/config` | GET/POST | 配置查询/热重载（需 Admin Token） |
-| `/metrics` | :9191 | Prometheus 指标端点（独立端口） |
-
-## 部署方式
-
-### Docker（推荐）
-
-```bash
-# 构建
-docker build -t nvidia-sidecar:latest .
-
-# 运行
-docker run -d --name nvidia-sidecar \
-  -p 127.0.0.1:9190:9190 \
-  -p 127.0.0.1:9191:9191 \
-  -e SIDECAR_API_KEY="nvapi-xxx" \
-  nvidia-sidecar:latest
-```
-
-### systemd
-
-```bash
-# 安装
-sudo cp deploy/nvidia-sidecar.service /etc/systemd/system/
-sudo systemctl daemon-reload
-sudo systemctl enable nvidia-sidecar
-
-# 配置环境变量
-sudo cp deploy/.env.example /opt/nvidia-sidecar/.env
-sudo vim /opt/nvidia-sidecar/.env  # 填入实际值
-
-# 启动
-sudo systemctl start nvidia-sidecar
-sudo journalctl -u nvidia-sidecar -f  # 查看日志
-```
-
-### 环境变量清单
-
-详见 `deploy/.env.example`。
-
-### 防火墙建议
-
-```bash
-# 仅允许内网访问代理端口
-sudo ufw allow from 192.168.1.0/24 to any port 9190
-sudo ufw allow from 192.168.1.0/24 to any port 9191
-# 禁止外网访问
-sudo ufw deny 9190
-sudo ufw deny 9191
-```
-
-## 架构
-
-```
-请求 → 网关识别 → [NVIDIA: 优先级排队 → 令牌桶限流] → httpx → NVIDIA API
-                → [非 NVIDIA: 直通] → httpx → 上游
-```
-
- **四级优先级**: URGENT > HIGH > NORMAL > LOW（通过 `X-Priority` header 指定）
- **队列满策略**: PASSTHROUGH（直通）/ REJECT（503）/ DROP_LOWEST（丢弃最低优先级）
- **令牌桶**: 40 RPM，线程安全，支持阻塞/非阻塞消费
+## 存储
+- SQLite (WAL 模式)
+- 表：backends, backend_usage_logs, cooldown_events, backend_health, system_config, daily_stats
@@ -1,41 +1 @@
-"""
-NVIDIA Sidecar 限流代理 — 核心代理模块。
-
-为 OpenAI Chat Completions 兼容 API 提供四层防护：
-    1. 请求接收（FastAPI）
-    2. 网关识别 → 非 NVIDIA 直通
-    3. 优先级排队 → 令牌桶限流
-    4. httpx 异步转发到 NVIDIA 上游
-"""
-
-from __future__ import annotations
-
-from nvidia_sidecar.config import SidecarConfig, load_config
-from nvidia_sidecar.rate_limiter import (
-    Priority,
-    TokenBucket,
-    is_nvidia_gateway,
-    normalize_gateway_name,
-)
-from nvidia_sidecar.priority_queue import (
-    PriorityQueueItem,
-    PriorityRequestQueue,
-    QueueFullError,
-    QueueFullPassthrough,
-    QueueFullPolicy,
-)
-
-__version__ = "0.1.0"
-__all__ = [
-    "SidecarConfig",
-    "load_config",
-    "Priority",
-    "TokenBucket",
-    "is_nvidia_gateway",
-    "normalize_gateway_name",
-    "PriorityQueueItem",
-    "PriorityRequestQueue",
-    "QueueFullError",
-    "QueueFullPassthrough",
-    "QueueFullPolicy",
-]
+"""Sidecar V2 — Multi-pool provider proxy with cooldown, rate limiting, and WebUI management."""
@@ -1,221 +1,165 @@
-"""
-NVIDIA Sidecar 限流代理 — 配置管理模块 (§3.1)
-
-集中管理 Sidecar 运行参数，支持环境变量覆盖和 YAML 配置文件。
-"""
-
-from __future__ import annotations
+"""System configuration management for Sidecar V2."""

 import os
-import warnings
-from dataclasses import dataclass, field
-from pathlib import Path
-from typing import Any
+import json
+from dataclasses import dataclass, field, asdict
+from typing import Optional


@dataclass
-class SidecarConfig:
-    """Sidecar 运行配置数据类。
+class Config:
+    """Sidecar V2 runtime configuration.

-    所有字段可通过环境变量覆盖，优先级：环境变量 > YAML 配置文件 > 默认值。
+    Sources (priority order):
+    1. Environment variables (highest)
+    2. system_config table in SQLite
+    3. Defaults defined here
    """

-    # ---- 网络 ----
-    listen_host: str = field(
-        default="127.0.0.1",
-        metadata={"env": "SIDECAR_HOST"},
-    )
-    listen_port: int = field(
-        default=9190,
-        metadata={"env": "SIDECAR_PORT"},
-    )
-    metrics_port: int = field(
-        default=9191,
-        metadata={"env": "SIDECAR_METRICS_PORT"},
+    # Listen
+    host: str = "127.0.0.1"
+    port: int = 9190
+    metrics_port: int = 9191
+
+    # Queue
+    queue_max_depth: int = 500
+    queue_timeout_seconds: float = 30.0
+
+    # Provider
+    default_rpm_limit: int = 40
+
+    # Cooldown
+    cooldown_base_seconds: float = 30.0
+    cooldown_max_seconds: float = 600.0
+    cooldown_exponential_backoff: bool = True
+
+    # Emergency channel: RPM fraction when all pools exhausted
+    emergency_rpm_fraction: float = 0.10
+
+    # Health check
+    health_check_interval_seconds: int = 60
+    health_check_timeout_seconds: int = 10
+    health_probe_endpoint: str = "/v1/models"
+
+    # Admin auth
+    admin_token: str = ""
+
+    # Encryption
+    encryption_key: str = ""
+
+    # Logging
+    log_level: str = "INFO"
+
+    # Database
+    db_path: str = ""
+    backup_dir: str = ""
+    backup_retention_days: int = 7
+
+    # Rate limiter
+    rate_limiter_refill_interval_ms: int = 50
+
+    # Router
+    router_refresh_interval_seconds: float = 5.0
+
+    # Max pool-internal retries
+    max_pool_retries: int = 5
+
+    # Pre-check cooldown threshold (seconds remaining)
+    cooldown_precheck_threshold_seconds: float = 10.0
+
+    # Dashboard
+    dashboard_sse_interval_seconds: float = 1.0
+
+    # Stats
+    stats_refresh_interval_seconds: float = 30.0
+
+    # Request timeout
+    default_request_timeout_seconds: int = 120
+
+    @classmethod
+    def from_env(cls) -> "Config":
+        """Load configuration from environment variables."""
+        c = cls()
+
+        # Listen
+        c.host = os.getenv("SIDECAR_HOST", c.host)
+        c.port = int(os.getenv("SIDECAR_PORT", str(c.port)))
+        c.metrics_port = int(os.getenv("SIDECAR_METRICS_PORT", str(c.metrics_port)))
+
+        # Queue
+        c.queue_max_depth = int(os.getenv("SIDECAR_QUEUE_MAX", str(c.queue_max_depth)))
+        c.queue_timeout_seconds = float(
+            os.getenv("SIDECAR_QUEUE_TIMEOUT", str(c.queue_timeout_seconds))
        )

-    # ---- 上游 ----
-    upstream_url: str = field(
-        default="https://integrate.api.nvidia.com/v1",
-        metadata={"env": "SIDECAR_UPSTREAM"},
-    )
-    upstream_api_key: str = field(
-        default="",
-        metadata={"env": "SIDECAR_API_KEY"},
+        # Provider
+        c.default_rpm_limit = int(
+            os.getenv("SIDECAR_RATE_RPM", str(c.default_rpm_limit))
        )

-    # ---- 限流 ----
-    rate_rpm: int = field(
-        default=40,
-        metadata={"env": "SIDECAR_RATE_RPM"},
+        # Cooldown
+        c.cooldown_base_seconds = float(
+            os.getenv("SIDECAR_COOLDOWN_BASE", str(c.cooldown_base_seconds))
        )
-    bucket_capacity: int = field(
-        default=40,
-        metadata={"env": "SIDECAR_BUCKET_CAPACITY"},
+        c.cooldown_max_seconds = float(
+            os.getenv("SIDECAR_COOLDOWN_MAX", str(c.cooldown_max_seconds))
        )

-    # ---- 超时 ----
-    request_timeout: float = field(
-        default=60.0,
-        metadata={"env": "SIDECAR_TIMEOUT"},
+        # Admin
+        c.admin_token = os.getenv("SIDECAR_ADMIN_TOKEN", c.admin_token)
+
+        # Encryption
+        c.encryption_key = os.getenv("SIDECAR_ENCRYPTION_KEY", c.encryption_key)
+
+        # Logging
+        c.log_level = os.getenv("LOG_LEVEL", c.log_level).upper()
+
+        # Database
+        c.db_path = os.getenv(
+            "SIDECAR_DB_PATH",
+            os.path.join(os.getcwd(), "data", "sidecar_v2.db"),
+        )
+        c.backup_dir = os.getenv(
+            "SIDECAR_BACKUP_DIR",
+            os.path.join(os.getcwd(), "data", "backups"),
        )

-    # ---- 队列 ----
-    queue_max_size: int = field(
-        default=500,
-        metadata={"env": "SIDECAR_QUEUE_MAX"},
-    )
-    low_priority_timeout: float = field(
-        default=2.0,
-        metadata={"env": "SIDECAR_LOW_TIMEOUT"},
-    )
+        # V1 compatibility: migrate env vars
+        c._migrate_v1_env()

-    # ---- 降级 ----
-    fallback_enabled_passthrough: bool = field(
-        default=True,
-        metadata={"env": "SIDECAR_FALLBACK_PASSTHROUGH"},
-    )
+        return c

-    # ---- 日志 ----
-    log_level: str = field(
-        default="INFO",
-        metadata={"env": "SIDECAR_LOG_LEVEL"},
-    )
+    def _migrate_v1_env(self) -> None:
+        """Migrate V1 environment variables to V2 defaults."""
+        # V1 UPSTREAM endpoint
+        upstream = os.getenv("SIDECAR_UPSTREAM")
+        api_key = os.getenv("SIDECAR_API_KEY")
+        if api_key and self.encryption_key:
+            # These will be used during initial migration
+            os.environ["_SIDECAR_V1_API_KEY"] = api_key
+            os.environ["_SIDECAR_V1_UPSTREAM"] = upstream or "https://integrate.api.nvidia.com/v1"

-
-def _apply_env_overrides(config: SidecarConfig) -> SidecarConfig:
-    """用环境变量覆盖配置字段。
-
-    遍历 SidecarConfig 的 dataclass fields，对每个声明了 ``metadata={"env": ...}``
-    的字段检查环境变量是否存在，存在则用对应类型转换后覆盖。
-    """
-    import dataclasses as _dc
-
-    # 使用 typing.get_type_hints 解析 from __future__ import annotations
-    # 引入的字符串化类型注解 (PEP 563)
-    try:
-        resolved_types = __import__("typing").get_type_hints(type(config))
-    except Exception:
-        resolved_types = {}
-
-    for fld in _dc.fields(config):
-        env_key: str | None = fld.metadata.get("env")
-        if env_key is None:
-            continue
-        env_val = os.environ.get(env_key)
-        if env_val is None:
-            continue
-
-        target_type = resolved_types.get(fld.name, fld.type)
-        target_type_name: str = getattr(target_type, "__name__", str(target_type))
-        try:
-            if target_type is bool or target_type == "bool":
-                parsed: bool = env_val.strip().lower() in ("true", "1", "yes", "on")
-                setattr(config, fld.name, parsed)
-            elif target_type is int or target_type == "int":
-                setattr(config, fld.name, int(env_val))
-            elif target_type is float or target_type == "float":
-                setattr(config, fld.name, float(env_val))
+    def to_db_dict(self) -> dict:
+        """Serialize to dict for system_config storage."""
+        result = {}
+        for key, value in asdict(self).items():
+            if isinstance(value, bool):
+                result[key] = "true" if value else "false"
+            elif isinstance(value, (int, float)):
+                result[key] = str(value)
            else:
-                setattr(config, fld.name, env_val)
-        except (ValueError, TypeError) as exc:
-            warnings.warn(
-                f"无法将环境变量 {env_key}={env_val!r} 转换为 {target_type_name}: {exc}"
-            )
+                result[key] = value
+        return result

-    return config
+    @classmethod
+    def merge_db(cls, base: "Config", db_config: dict) -> "Config":
+        """Merge DB config into base config (env vars already applied to base)."""
+        for key, value in base.__dict__.items():
+            if key in db_config and key not in os.environ:
+                # DB values only apply when no env var override
+                setattr(base, key, type(value)(db_config[key]))
+        return base


-def _validate_config(config: SidecarConfig) -> list[str]:
-    """验证配置合理性，返回警告/问题列表。"""
-    issues: list[str] = []
-
-    # 端口冲突检查
-    if config.listen_port == config.metrics_port:
-        issues.append(
-            f"listen_port ({config.listen_port}) 与 metrics_port ({config.metrics_port}) 相同"
-        )
-
-    # rate_rpm 边界检查
-    if config.rate_rpm <= 0:
-        issues.append(
-            f"rate_rpm ({config.rate_rpm}) 无效，回退到默认值 40"
-        )
-        config.rate_rpm = 40
-
-    # queue_max_size 合理性
-    if config.queue_max_size <= 0:
-        issues.append(
-            f"queue_max_size ({config.queue_max_size}) 无效，回退到默认值 500"
-        )
-        config.queue_max_size = 500
-
-    # request_timeout 合理性
-    if config.request_timeout <= 0:
-        issues.append(
-            f"request_timeout ({config.request_timeout}) 无效，回退到默认值 60"
-        )
-        config.request_timeout = 60.0
-    elif config.request_timeout > 300.0:
-        issues.append(
-            f"request_timeout ({config.request_timeout}) 异常偏高，已截断为 300"
-        )
-        config.request_timeout = 300.0
-
-    return issues
-
-
-def load_config(path: str | None = None) -> SidecarConfig:
-    """加载 Sidecar 配置。
-
-    加载顺序（后者覆盖前者）：
-    1. 默认值（SidecarConfig dataclass defaults）
-    2. YAML 配置文件（如果 path 提供）
-    3. 环境变量覆盖
-
-    Args:
-        path: 可选 YAML 配置文件路径。为 None 时只使用默认值 + 环境变量。
-
-    Returns:
-        经过验证的 SidecarConfig 实例。
-
-    Raises:
-        FileNotFoundError: path 指定的文件不存在。
-        yaml.YAMLError: YAML 解析失败。
-    """
-    config = SidecarConfig()
-
-    if path is not None:
-        import yaml
-
-        cfg_path = Path(path)
-        if not cfg_path.is_file():
-            raise FileNotFoundError(f"配置文件不存在: {cfg_path}")
-
-        try:
-            raw: dict[str, Any] = yaml.safe_load(cfg_path.read_text(encoding="utf-8")) or {}
-        except yaml.YAMLError as exc:
-            raise yaml.YAMLError(f"YAML 解析失败 ({cfg_path}): {exc}") from exc
-
-        # 覆盖已声明的字段
-        for fld_name in (
-            "listen_host", "listen_port", "metrics_port",
-            "upstream_url", "upstream_api_key",
-            "rate_rpm", "bucket_capacity",
-            "request_timeout",
-            "queue_max_size", "low_priority_timeout",
-            "fallback_enabled_passthrough",
-            "log_level",
-        ):
-            if fld_name in raw:
-                setattr(config, fld_name, raw[fld_name])
-
-    # 环境变量覆盖（最高优先级）
-    config = _apply_env_overrides(config)
-
-    # 验证
-    issues = _validate_config(config)
-    for issue in issues:
-        warnings.warn(issue)
-
-    return config
+# Singleton
+config = Config.from_env()
@@ -1,75 +0,0 @@
-"""
-NVIDIA Sidecar — SidecarContext 依赖注入容器 (§BIZ-46 Phase3)
-
-将所有模块级全局状态收敛为单一 dataclass，通过 FastAPI app.state 注入，
-消除 webui.py → server 的反向导入，支持可测试性和多实例扩展。
-
-设计文档: docs/architecture/BIZ-46_Phase3_Architecture_Design.md §1
-"""
-
-from __future__ import annotations
-
-import asyncio
-import time
-from dataclasses import dataclass, field
-from typing import TYPE_CHECKING, Any
-
-import httpx
-
-if TYPE_CHECKING:
-    from nvidia_sidecar.config import SidecarConfig
-    from nvidia_sidecar.rate_limiter import AdaptiveTokenBucket
-    from nvidia_sidecar.priority_queue import PriorityRequestQueue
-    from nvidia_sidecar.metrics import PrometheusMetrics
-    from nvidia_sidecar.health import HealthService
-
-
-@dataclass
-class SidecarContext:
-    """Sidecar 全局运行时上下文 — 所有核心组件的唯一容器。
-
-    通过 ``app.state.sidecar`` 注入 FastAPI，路由通过 ``Depends(get_context)`` 获取。
-    """
-
-    # ---- 核心组件 ----
-    config: SidecarConfig
-    http_client: httpx.AsyncClient
-    token_bucket: AdaptiveTokenBucket
-    priority_queue: PriorityRequestQueue
-    prometheus: PrometheusMetrics
-    health: HealthService
-
-    # ---- 运行时状态 ----
-    pending_requests: dict[str, tuple["asyncio.Future[Any]", float]] = field(default_factory=dict)
-    """request_id → (response future, enqueued_at) 的映射。"""
-
-    stats: dict[str, int] = field(default_factory=lambda: {
-        "total_requests": 0,
-        "nvidia_requests": 0,
-        "passthrough_requests": 0,
-        "ratelimited_requests": 0,
-        "queue_full_rejects": 0,
-        "upstream_errors": 0,
-        "start_time": 0,
-    })
-
-    stats_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
-
-    # ---- 缓存 ----
-    snapshot_cache: tuple["dict[str, Any]", float] | None = None
-    """SSE 快照共享缓存: (data, timestamp)。"""
-    snapshot_cache_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
-    SNAPSHOT_CACHE_TTL: float = 1.0
-
-    # ---- 便捷方法 ----
-
-    async def increment_stat(self, key: str, delta: int = 1) -> None:
-        """线程安全的统计计数器自增。"""
-        async with self.stats_lock:
-            self.stats[key] = self.stats.get(key, 0) + delta
-
-    @property
-    def uptime_seconds(self) -> int:
-        """服务运行时长（秒）。"""
-        st = self.stats.get("start_time", 0)
-        return int(time.time() - st) if st else 0
@@ -0,0 +1,114 @@
+"""429 Cooldown management for backends using exponential backoff."""
+
+import time
+from datetime import datetime, timezone
+import structlog
+from config import config
+from storage.backend_store import set_backend_cooldown, clear_backend_cooldown, get_backend
+from storage.cooldown_store import log_cooldown_event, end_cooldown_event
+
+logger = structlog.get_logger("sidecar_v2.cooldown_manager")
+
+
+def calculate_cooldown(consecutive_count: int) -> float:
+    """Calculate cooldown duration using exponential backoff.
+
+    Formula: base * 2^(consecutive-1), capped at max.
+    """
+    base = config.cooldown_base_seconds
+    max_seconds = config.cooldown_max_seconds
+    if config.cooldown_exponential_backoff:
+        duration = base * (2 ** (consecutive_count - 1))
+    else:
+        duration = base * consecutive_count
+    return min(duration, max_seconds)
+
+
+def start_cooldown(backend_id: str, consecutive_count: int) -> float:
+    """Start cooldown for a backend after 429.
+
+    Returns: cooldown end timestamp.
+    """
+    duration = calculate_cooldown(consecutive_count)
+    cooldown_until_ts = time.time() + duration
+    cooldown_until = time.strftime(
+        "%Y-%m-%dT%H:%M:%SZ", time.gmtime(cooldown_until_ts)
+    )
+
+    set_backend_cooldown(backend_id, cooldown_until, consecutive_count)
+    log_cooldown_event(
+        backend_id=backend_id,
+        consecutive_count=consecutive_count,
+        cooldown_seconds=int(duration),
+        response_summary=f"429 cooldown triggered (consecutive #{consecutive_count})",
+    )
+
+    logger.info(
+        "cooldown_started",
+        backend_id=backend_id,
+        duration=round(duration, 1),
+        consecutive=consecutive_count,
+    )
+    return duration
+
+
+def check_and_clear_cooldown(backend_id: str) -> bool:
+    """Check if cooldown has expired for a backend.
+
+    Returns True if cooldown was cleared (backend is back online).
+    """
+    backend = get_backend(backend_id, decrypt_key=False)
+    if backend is None:
+        return False
+
+    if backend.status != "cooling":
+        return False
+
+    cooldown_until = backend.cooldown_until
+    if not cooldown_until:
+        clear_backend_cooldown(backend_id)
+        return True
+
+    # Parse cooldown_until as ISO timestamp
+    try:
+        dt = datetime.fromisoformat(cooldown_until.replace("Z", "+00:00"))
+        cooldown_ts = dt.timestamp()
+    except ValueError:
+        # If parsing fails, clear and move on
+        clear_backend_cooldown(backend_id)
+        return True
+
+    now = time.time()
+    if now >= cooldown_ts:
+        clear_backend_cooldown(backend_id)
+        end_cooldown_event(backend_id)
+        logger.info("cooldown_cleared", backend_id=backend_id)
+        return True
+
+    remaining = cooldown_ts - now
+    logger.debug("cooldown_active", backend_id=backend_id, remaining_seconds=round(remaining, 1))
+    return False
+
+
+def precheck_cooldown(backend_id: str) -> bool:
+    """Check if backend should be skipped due to near-expiry cooldown.
+
+    If cooldown will expire within config.cooldown_precheck_threshold_seconds,
+    skip the backend so we don't hit it again right as it expires.
+    """
+    backend = get_backend(backend_id, decrypt_key=False)
+    if backend is None or backend.status != "cooling":
+        return False
+
+    cooldown_until = backend.cooldown_until
+    if not cooldown_until:
+        return False
+
+    try:
+        dt = datetime.fromisoformat(cooldown_until.replace("Z", "+00:00"))
+        cooldown_ts = dt.timestamp()
+    except ValueError:
+        return False
+
+    remaining = cooldown_ts - time.time()
+    return 0 < remaining <= config.cooldown_precheck_threshold_seconds
@@ -0,0 +1,108 @@
+"""AES-256-GCM encryption for API Key storage."""
+
+import os
+import secrets
+import structlog
+from cryptography.hazmat.primitives.ciphers.aead import AESGCM
+
+logger = structlog.get_logger()
+
+_ENCRYPTION_KEY: bytes | None = None
+_cipher: AESGCM | None = None
+
+
+def init_crypto(hex_key: str) -> None:
+    """Initialize the encryption module.
+
+    Validates the key and prepares the cipher.
+    Raises ValueError if key is invalid.
+    """
+    global _ENCRYPTION_KEY, _cipher
+
+    if not hex_key:
+        raise ValueError("FATAL: SIDECAR_ENCRYPTION_KEY not set")
+
+    if len(hex_key) != 64:
+        raise ValueError(
+            f"FATAL: SIDECAR_ENCRYPTION_KEY must be 64 hex chars (32 bytes), "
+            f"got {len(hex_key)} chars"
+        )
+
+    try:
+        key_bytes = bytes.fromhex(hex_key)
+    except ValueError:
+        raise ValueError(
+            "FATAL: SIDECAR_ENCRYPTION_KEY must be valid hexadecimal"
+        )
+
+    global _ENCRYPTION_KEY, _cipher
+    _ENCRYPTION_KEY = key_bytes
+    _cipher = AESGCM(key_bytes)
+    logger.info("crypto_initialized")
+
+
+def encrypt(plaintext: str) -> str:
+    """Encrypt plaintext using AES-256-GCM.
+
+    Returns: hex-encoded nonce (12 bytes) + ciphertext + tag.
+    Format: <nonce_hex>:<ciphertext_hex>
+    """
+    if _cipher is None:
+        raise RuntimeError("Crypto not initialized. Call init_crypto() first.")
+
+    nonce = secrets.token_bytes(12)
+    ciphertext = _cipher.encrypt(nonce, plaintext.encode("utf-8"), None)
+    return nonce.hex() + ":" + ciphertext.hex()
+
+
+def decrypt(encrypted: str) -> str:
+    """Decrypt AES-256-GCM ciphertext.
+
+    Args:
+        encrypted: Format "<nonce_hex>:<ciphertext_hex>"
+
+    Returns: Decrypted plaintext string.
+    """
+    if _cipher is None:
+        raise RuntimeError("Crypto not initialized. Call init_crypto() first.")
+
+    parts = encrypted.split(":", 1)
+    if len(parts) != 2:
+        raise ValueError("Invalid encrypted format: expected nonce:ciphertext")
+
+    nonce = bytes.fromhex(parts[0])
+    ciphertext = bytes.fromhex(parts[1])
+
+    try:
+        plaintext = _cipher.decrypt(nonce, ciphertext, None)
+        return plaintext.decode("utf-8")
+    except Exception as e:
+        raise ValueError(f"Decryption failed: {e}")
+
+
+def is_initialized() -> bool:
+    """Check if crypto has been initialized."""
+    return _cipher is not None
+
+
+def mask_api_key(api_key_plain: str) -> str:
+    """Mask API key for display: show first 6 + last 4 chars."""
+    if len(api_key_plain) <= 10:
+        return api_key_plain[:2] + "****"
+    return api_key_plain[:6] + "****" + api_key_plain[-4:]
+
+
+def try_decrypt_existing(encrypted_value: str) -> str | None:
+    """Try to decrypt an existing encrypted value.
+
+    Returns the plaintext if successful, None if decryption fails
+    (e.g., encryption key was changed).
+    """
+    try:
+        return decrypt(encrypted_value)
+    except Exception:
+        logger.warning(
+            "decrypt_existing_failed",
+            hint="Encryption key may have been changed, existing keys unrecoverable"
+        )
+        return None
@@ -0,0 +1,623 @@
+<!DOCTYPE html>
+<html lang="zh-CN">
+<head>
+<meta charset="UTF-8">
+<meta name="viewport" content="width=device-width, initial-scale=1.0">
+<title>Sidecar V2 — Provider Pool Dashboard</title>
+<!-- Primary: jsDelivr CDN -->
+<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
+<!-- Fallback: local static copy for offline/intranet deployments -->
+<script>
+(function() {
+  var check = function() {
+    if (typeof Chart === 'undefined') {
+      var s = document.createElement('script');
+      s.src = '/static/chart.umd.min.js';
+      s.onerror = function() {
+        console.warn('Chart.js unavailable (CDN + local both failed). Charts disabled.');
+      };
+      document.head.appendChild(s);
+    }
+  };
+  // Check after CDN script has had a chance to load
+  setTimeout(check, 2000);
+})();
+</script>
+<style>
+  :root {
+    --bg: #0f1117;
+    --card-bg: #1a1d28;
+    --border: #2a2d3a;
+    --text: #e0e0e0;
+    --text-dim: #888;
+    --green: #23d160;
+    --yellow: #ffdd57;
+    --red: #ff3860;
+    --blue: #3273dc;
+    --purple: #b86bff;
+    --cyan: #00d1b2;
+    --orange: #ff8533;
+  }
+  * { margin: 0; padding: 0; box-sizing: border-box; }
+  body {
+    font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
+    background: var(--bg);
+    color: var(--text);
+    min-height: 100vh;
+  }
+
+  /* Layout */
+  .app { display: flex; height: 100vh; }
+  .sidebar {
+    width: 220px; background: var(--card-bg); border-right: 1px solid var(--border);
+    padding: 20px 0; display: flex; flex-direction: column;
+  }
+  .sidebar h2 { padding: 0 20px 20px; font-size: 16px; color: var(--cyan); border-bottom: 1px solid var(--border); }
+  .sidebar nav { flex: 1; padding: 10px 0; }
+  .sidebar nav a {
+    display: block; padding: 10px 20px; color: var(--text-dim); text-decoration: none;
+    font-size: 13px; transition: 0.2s;
+  }
+  .sidebar nav a:hover, .sidebar nav a.active { color: var(--text); background: rgba(255,255,255,0.05); }
+  .sidebar .status-bar { padding: 15px 20px; border-top: 1px solid var(--border); font-size: 11px; color: var(--text-dim); }
+
+  .main { flex: 1; overflow-y: auto; padding: 24px; }
+  .page { display: none; }
+  .page.active { display: block; }
+
+  /* Dashboard Cards */
+  .cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
+  .card {
+    background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
+  }
+  .card .label { font-size: 12px; color: var(--text-dim); text-transform: uppercase;letter-spacing:0.5px;margin-bottom:6px; }
+  .card .value { font-size: 28px; font-weight: 700; }
+  .card .sub { font-size: 12px; color: var(--text-dim); margin-top: 4px; }
+
+  .charts { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }
+  .chart-card {
+    background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
+  }
+  .chart-card h3 { font-size: 14px; margin-bottom: 12px; color: var(--text-dim); }
+  .chart-card canvas { max-height: 250px; }
+
+  /* Pool Cards */
+  .pool-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin-bottom: 24px; }
+  .pool-card {
+    background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
+  }
+  .pool-card h3 { font-size: 15px; margin-bottom: 12px; text-transform: uppercase; letter-spacing: 1px; }
+  .pool-card h3.primary { color: var(--blue); }
+  .pool-card h3.fallback { color: var(--orange); }
+  .pool-stats { display: grid; grid-template-columns: repeat(4, 1fr); gap: 8px; }
+  .pool-stat { text-align: center; }
+  .pool-stat .num { font-size: 22px; font-weight: 700; }
+  .pool-stat .lbl { font-size: 11px; color: var(--text-dim); margin-top: 2px; }
+  .pool-stat.healthy .num { color: var(--green); }
+  .pool-stat.cooling .num { color: var(--yellow); }
+  .pool-stat.error .num { color: var(--red); }
+  .pool-stat.total .num { color: var(--purple); }
+
+  /* Tables */
+  table { width: 100%; border-collapse: collapse; background: var(--card-bg); border-radius: 8px; overflow: hidden; }
+  th { text-align: left; padding: 10px 12px; font-size: 11px; text-transform: uppercase; letter-spacing: 0.5px; color: var(--text-dim); background: rgba(255,255,255,0.03); border-bottom: 1px solid var(--border); }
+  td { padding: 10px 12px; font-size: 13px; border-bottom: 1px solid var(--border); }
+  tr:last-child td { border-bottom: none; }
+  tr:hover { background: rgba(255,255,255,0.02); }
+  .badge {
+    display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 11px; font-weight: 600;
+  }
+  .badge.healthy { background: rgba(35,209,96,0.15); color: var(--green); }
+  .badge.cooling { background: rgba(255,221,87,0.15); color: var(--yellow); }
+  .badge.error { background: rgba(255,56,96,0.15); color: var(--red); }
+  .badge.disabled { background: rgba(136,136,136,0.15); color: var(--text-dim); }
+  .badge.primary { background: rgba(50,115,220,0.15); color: var(--blue); }
+  .badge.fallback { background: rgba(255,133,51,0.15); color: var(--orange); }
+
+  /* Buttons */
+  .btn {
+    padding: 6px 14px; border-radius: 6px; border: none; cursor: pointer; font-size: 12px; font-weight: 600;
+    transition: 0.2s;
+  }
+  .btn-primary { background: var(--blue); color: #fff; }
+  .btn-primary:hover { opacity: 0.85; }
+  .btn-danger { background: var(--red); color: #fff; }
+  .btn-danger:hover { opacity: 0.85; }
+  .btn-sm { padding: 3px 10px; font-size: 11px; }
+  .btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
+  .btn-outline:hover { background: rgba(255,255,255,0.05); }
+
+  .section-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
+  .section-header h3 { font-size: 15px; }
+
+  /* Modal */
+  .modal-overlay { display: none; position: fixed; top: 0; left: 0; right: 0; bottom: 0; background: rgba(0,0,0,0.7); z-index: 100; justify-content: center; align-items: center; }
+  .modal-overlay.active { display: flex; }
+  .modal { background: var(--card-bg); border: 1px solid var(--border); border-radius: 12px; padding: 24px; width: 560px; max-height: 80vh; overflow-y: auto; }
+  .modal h3 { margin-bottom: 16px; font-size: 16px; }
+  .form-group { margin-bottom: 12px; }
+  .form-group label { display: block; font-size: 12px; color: var(--text-dim); margin-bottom: 4px; }
+  .form-group input, .form-group select, .form-group textarea {
+    width: 100%; padding: 8px 10px; background: var(--bg); border: 1px solid var(--border);
+    border-radius: 6px; color: var(--text); font-size: 13px;
+  }
+  .form-group textarea { min-height: 80px; font-family: monospace; font-size: 12px; }
+  .form-row { display: grid; grid-template-columns: 1fr 1fr; gap: 12px; }
+  .form-actions { display: flex; gap: 8px; justify-content: flex-end; margin-top: 16px; }
+  .model-mapping-row { display: flex; gap: 8px; align-items: center; margin-bottom: 8px; }
+  .model-mapping-row input { flex: 1; }
+
+  /* Utility */
+  .text-green { color: var(--green); }
+  .text-red { color: var(--red); }
+  .text-dim { color: var(--text-dim); }
+  .mb-16 { margin-bottom: 16px; }
+  .mb-24 { margin-bottom: 24px; }
+
+  @media (max-width: 768px) {
+    .charts, .pool-grid { grid-template-columns: 1fr; }
+    .sidebar { display: none; }
+  }
+</style>
+</head>
+<body>
+<div class="app">
+  <!-- Sidebar -->
+  <aside class="sidebar">
+    <h2>🚀 Sidecar V2</h2>
+    <nav>
+      <a href="#" data-page="dashboard" class="active">📊 Dashboard</a>
+      <a href="#" data-page="providers">🔌 Providers</a>
+      <a href="#" data-page="usage">📈 Usage Stats</a>
+      <a href="#" data-page="cooldown">🧊 Cooldown Log</a>
+    </nav>
+    <div class="status-bar" id="status-bar">Connected · Sidecar V2</div>
+  </aside>
+
+  <!-- Main Content -->
+  <main class="main">
+    <!-- Dashboard Page -->
+    <div class="page active" id="page-dashboard">
+      <div class="cards" id="stat-cards"></div>
+      <div class="pool-grid" id="pool-grid"></div>
+      <div class="charts" id="charts"></div>
+    </div>
+
+    <!-- Providers Page -->
+    <div class="page" id="page-providers">
+      <div class="section-header">
+        <h3>Provider Backends</h3>
+        <button class="btn btn-primary" onclick="showAddBackend()">+ Add Provider</button>
+      </div>
+      <table id="backends-table">
+        <thead>
+          <tr><th>Name</th><th>Label</th><th>Pool</th><th>Status</th><th>RPM</th><th>Models</th><th>Actions</th></tr>
+        </thead>
+        <tbody></tbody>
+      </table>
+    </div>
+
+    <!-- Usage Page -->
+    <div class="page" id="page-usage">
+      <div class="section-header"><h3>Hourly Usage</h3></div>
+      <div class="mb-16">
+        <select id="usage-backend-filter" onchange="loadUsage()" class="btn btn-outline btn-sm">
+          <option value="">All Backends</option>
+        </select>
+      </div>
+      <table id="usage-table">
+        <thead>
+          <tr><th>Hour</th><th>Backend</th><th>Model</th><th>Requests</th><th>Errors</th><th>Tokens</th><th>Cost</th><th>Avg Latency</th></tr>
+        </thead>
+        <tbody></tbody>
+      </table>
+
+      <div class="section-header mt-24 mb-16"><h3>Daily Aggregation</h3></div>
+      <table id="daily-table">
+        <thead>
+          <tr><th>Date</th><th>Pool</th><th>Requests</th><th>Errors</th><th>Tokens</th><th>Cost</th><th>Backends</th></tr>
+        </thead>
+        <tbody></tbody>
+      </table>
+    </div>
+
+    <!-- Cooldown Page -->
+    <div class="page" id="page-cooldown">
+      <div class="section-header"><h3>Cooldown Event History</h3></div>
+      <table id="cooldown-table">
+        <thead>
+          <tr><th>Time</th><th>Backend</th><th>Consecutive 429s</th><th>Duration</th><th>Summary</th></tr>
+        </thead>
+        <tbody></tbody>
+      </table>
+    </div>
+  </main>
+</div>
+
+<!-- Add/Edit Backend Modal -->
+<div class="modal-overlay" id="backend-modal">
+  <div class="modal">
+    <h3 id="modal-title">Add Provider</h3>
+    <form id="backend-form" onsubmit="saveBackend(event)">
+      <input type="hidden" id="backend-id">
+      <div class="form-row">
+        <div class="form-group">
+          <label>Name *</label>
+          <input type="text" id="backend-name" placeholder="e.g. NVIDIA H100 Primary" required>
+        </div>
+        <div class="form-group">
+          <label>Label</label>
+          <input type="text" id="backend-label" placeholder="e.g. nvidia, siliconflow">
+        </div>
+      </div>
+      <div class="form-group">
+        <label>API Base URL *</label>
+        <input type="url" id="backend-url" placeholder="https://integrate.api.nvidia.com/v1" required>
+      </div>
+      <div class="form-group">
+        <label>API Key *</label>
+        <input type="password" id="backend-key" placeholder="sk-..." required>
+      </div>
+      <div class="form-row">
+        <div class="form-group">
+          <label>Pool</label>
+          <select id="backend-pool">
+            <option value="primary">Primary</option>
+            <option value="fallback">Fallback</option>
+          </select>
+        </div>
+        <div class="form-group">
+          <label>RPM Limit</label>
+          <input type="number" id="backend-rpm" value="40" min="1" max="1000">
+        </div>
+      </div>
+      <div class="form-row">
+        <div class="form-group">
+          <label>Timeout (seconds)</label>
+          <input type="number" id="backend-timeout" value="120" min="10" max="600">
+        </div>
+        <div class="form-group">
+          <label>Enabled</label>
+          <select id="backend-enabled">
+            <option value="true">Yes</option>
+            <option value="false">No</option>
+          </select>
+        </div>
+      </div>
+      <div class="form-group">
+        <label>Model Mappings (JSON: canonical → {native_id, cost, ...})</label>
+        <textarea id="backend-mappings" placeholder='{"deepseek-ai/DeepSeek-V4-Pro":{"native_id":"deepseek-ai/deepseek-v4-pro","cost":{"input":0.000001,"output":0.000004}}}'></textarea>
+      </div>
+      <div class="form-actions">
+        <button type="button" class="btn btn-outline" onclick="closeModal()">Cancel</button>
+        <button type="submit" class="btn btn-primary">Save</button>
+      </div>
+    </form>
+  </div>
+</div>
+
+<script>
+// ── Navigation ──
+document.querySelectorAll('.sidebar nav a').forEach(a => {
+  a.addEventListener('click', e => {
+    e.preventDefault();
+    document.querySelectorAll('.sidebar nav a').forEach(l => l.classList.remove('active'));
+    a.classList.add('active');
+    document.querySelectorAll('.page').forEach(p => p.classList.remove('active'));
+    document.getElementById('page-' + a.dataset.page).classList.add('active');
+    loadPage(a.dataset.page);
+  });
+});
+
+// ── SSE Connection ──
+const sse = new EventSource('/dashboard/sse');
+sse.onmessage = e => {
+  const data = JSON.parse(e.data);
+  if (data.type === 'snapshot') updateDashboard(data);
+};
+
+sse.onerror = () => {
+  document.getElementById('status-bar').textContent = '⚠️ SSE Disconnected';
+};
+
+// ── Dashboard Update ──
+let costChart = null, tokenChart = null;
+
+function updateDashboard(data) {
+  document.getElementById('status-bar').textContent =
+    `⚡ Connected · Uptime ${formatDuration(data.uptime_seconds)}`;
+
+  // Stat cards
+  const st = data.total || {};
+  const errRate = st.total_requests > 0 ? ((st.total_errors || 0) / st.total_requests * 100).toFixed(1) : '0.0';
+  document.getElementById('stat-cards').innerHTML = `
+    <div class="card"><div class="label">Total Requests</div><div class="value">${fmt(st.total_requests)}</div><div class="sub">Error rate: ${errRate}%</div></div>
+    <div class="card"><div class="label">Total Tokens</div><div class="value">${fmt(st.total_tokens)}</div><div class="sub">Prompt: ${fmt(st.total_prompt_tokens)} · Completion: ${fmt(st.total_completion_tokens)}</div></div>
+    <div class="card"><div class="label">Total Cost</div><div class="value">$${st.total_cost ? st.total_cost.toFixed(4) : '0.0000'}</div><div class="sub">USD</div></div>
+    <div class="card"><div class="label">Uptime</div><div class="value">${formatDuration(data.uptime_seconds)}</div><div class="sub">Sidecar V2</div></div>
+  `;
+
+  // Pool grid
+  let poolHTML = '';
+  for (const [pool, ps] of Object.entries(data.pool || {})) {
+    poolHTML += `
+      <div class="pool-card">
+        <h3 class="${pool}">${pool}</h3>
+        <div class="pool-stats">
+          <div class="pool-stat total"><div class="num">${ps.total}</div><div class="lbl">Total</div></div>
+          <div class="pool-stat healthy"><div class="num">${ps.healthy}</div><div class="lbl">Healthy</div></div>
+          <div class="pool-stat cooling"><div class="num">${ps.cooling}</div><div class="lbl">Cooling</div></div>
+          <div class="pool-stat error"><div class="num">${ps.error}</div><div class="lbl">Error</div></div>
+        </div>
+      </div>`;
+  }
+  document.getElementById('pool-grid').innerHTML = poolHTML || '<div class="card">No pools configured</div>';
+
+  // Update backend table if on providers page
+  if (document.getElementById('page-providers').classList.contains('active')) {
+    renderBackendsTable(data.backends || []);
+  }
+}
+
+// ── Chart Updates (use SSE data to build chart data) ──
+function initCharts() {
+  const cc = document.getElementById('cost-chart');
+  const tc = document.getElementById('token-chart');
+  if (!cc || !tc) return;
+
+  if (costChart) costChart.destroy();
+  if (tokenChart) tokenChart.destroy();
+
+  costChart = new Chart(cc, {
+    type: 'line', data: { labels: [], datasets: [{ label: 'Cost (USD)', data: [], borderColor: '#00d1b2', backgroundColor: 'rgba(0,209,178,0.1)', fill: true, tension: 0.3 }] },
+    options: { responsive: true, maintainAspectRatio: true, plugins: { legend: { labels: { color: '#888' } } }, scales: { x: { ticks: { color: '#888', maxTicksLimit: 12 } }, y: { ticks: { color: '#888' } } } }
+  });
+
+  tokenChart = new Chart(tc, {
+    type: 'line', data: { labels: [], datasets: [{ label: 'Total Tokens', data: [], borderColor: '#b86bff', backgroundColor: 'rgba(184,107,255,0.1)', fill: true, tension: 0.3 }] },
+    options: { responsive: true, maintainAspectRatio: true, plugins: { legend: { labels: { color: '#888' } } }, scales: { x: { ticks: { color: '#888', maxTicksLimit: 12 } }, y: { ticks: { color: '#888' } } } }
+  });
+}
+
+// ── Providers Page ──
+function renderBackendsTable(backends) {
+  const tbody = document.querySelector('#backends-table tbody');
+  tbody.innerHTML = backends.map(b => `
+    <tr>
+      <td><strong>${h(b.name)}</strong></td>
+      <td><span class="badge ${b.label ? 'primary' : ''}">${h(b.label || '-')}</span></td>
+      <td><span class="badge ${b.pool}">${b.pool}</span></td>
+      <td><span class="badge ${b.status}">${b.status}</span></td>
+      <td>${b.rpm_limit}</td>
+      <td>${b.model_count || 0}</td>
+      <td>
+        <button class="btn btn-outline btn-sm" onclick="editBackend('${b.id}')">Edit</button>
+        <button class="btn btn-danger btn-sm" onclick="deleteBackend('${b.id}')">Del</button>
+      </td>
+    </tr>`).join('');
+}
+
+function showAddBackend() {
+  document.getElementById('modal-title').textContent = 'Add Provider';
+  document.getElementById('backend-id').value = '';
+  document.getElementById('backend-name').value = '';
+  document.getElementById('backend-label').value = '';
+  document.getElementById('backend-url').value = '';
+  document.getElementById('backend-key').value = '';
+  document.getElementById('backend-pool').value = 'primary';
+  document.getElementById('backend-rpm').value = '40';
+  document.getElementById('backend-timeout').value = '120';
+  document.getElementById('backend-enabled').value = 'true';
+  document.getElementById('backend-mappings').value = '{}';
+  document.getElementById('backend-modal').classList.add('active');
+}
+
+async function editBackend(id) {
+  try {
+    const res = await fetch('/api/admin/backends/' + id);
+    const b = await res.json();
+    document.getElementById('modal-title').textContent = 'Edit Provider';
+    document.getElementById('backend-id').value = b.id;
+    document.getElementById('backend-name').value = b.name;
+    document.getElementById('backend-label').value = b.label || '';
+    document.getElementById('backend-url').value = b.api_base_url;
+    document.getElementById('backend-key').value = '';
+    document.getElementById('backend-key').placeholder = '(leave blank to keep current)';
+    document.getElementById('backend-key').required = false;
+    document.getElementById('backend-pool').value = b.pool;
+    document.getElementById('backend-rpm').value = b.rpm_limit;
+    document.getElementById('backend-timeout').value = b.timeout_seconds;
+    document.getElementById('backend-enabled').value = b.enabled ? 'true' : 'false';
+    document.getElementById('backend-mappings').value = JSON.stringify(b.model_mappings || {}, null, 2);
+    document.getElementById('backend-modal').classList.add('active');
+  } catch (e) { alert('Failed to load backend: ' + e.message); }
+}
+
+async function saveBackend(e) {
+  e.preventDefault();
+  const id = document.getElementById('backend-id').value;
+  const body = {
+    name: document.getElementById('backend-name').value,
+    label: document.getElementById('backend-label').value,
+    api_base_url: document.getElementById('backend-url').value,
+    pool: document.getElementById('backend-pool').value,
+    rpm_limit: parseInt(document.getElementById('backend-rpm').value),
+    timeout_seconds: parseInt(document.getElementById('backend-timeout').value),
+    enabled: document.getElementById('backend-enabled').value === 'true',
+    model_mappings: JSON.parse(document.getElementById('backend-mappings').value || '{}'),
+  };
+
+  const key = document.getElementById('backend-key').value;
+  if (key) body.api_key = key;
+
+  try {
+    const method = id ? 'PUT' : 'POST';
+    const url = id ? '/api/admin/backends/' + id : '/api/admin/backends';
+    const res = await fetch(url, { method, headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(body) });
+    if (!res.ok) throw new Error((await res.json()).detail || 'Save failed');
+    closeModal();
+    refreshAll();
+  } catch (e) { alert('Error: ' + e.message); }
+}
+
+async function deleteBackend(id) {
+  if (!confirm('Delete this provider? This cannot be undone.')) return;
+  try {
+    await fetch('/api/admin/backends/' + id, { method: 'DELETE' });
+    refreshAll();
+  } catch (e) { alert('Delete failed: ' + e.message); }
+}
+
+function closeModal() { document.getElementById('backend-modal').classList.remove('active'); }
+
+// ── Load Pages ──
+async function loadPage(page) {
+  if (page === 'dashboard') {
+    initCharts();
+    loadChartData();
+  } else if (page === 'providers') {
+    refreshAll();
+  } else if (page === 'usage') {
+    loadUsageFilter();
+    loadUsage();
+    loadDaily();
+  } else if (page === 'cooldown') {
+    loadCooldown();
+  }
+}
+
+async function refreshAll() {
+  try {
+    const res = await fetch('/api/admin/backends');
+    const backends = await res.json();
+    renderBackendsTable(backends);
+  } catch (e) { console.error(e); }
+}
+
+async function loadUsageFilter() {
+  try {
+    const res = await fetch('/api/admin/backends');
+    const backends = await res.json();
+    const sel = document.getElementById('usage-backend-filter');
+    sel.innerHTML = '<option value="">All Backends</option>' +
+      backends.map(b => `<option value="${b.id}">${h(b.name)}</option>`).join('');
+  } catch (e) {}
+}
+
+async function loadUsage() {
+  const sel = document.getElementById('usage-backend-filter');
+  const backendId = sel.value;
+  const url = backendId ? `/api/admin/stats/hourly?backend_id=${backendId}&hours=72` : '/api/admin/stats/hourly?hours=72';
+  try {
+    const res = await fetch(url);
+    const data = await res.json();
+    const tbody = document.querySelector('#usage-table tbody');
+    tbody.innerHTML = data.map(r => `
+      <tr>
+        <td>${r.hour_bucket}</td>
+        <td>${r.backend_id}</td>
+        <td>${h(r.model)}</td>
+        <td>${fmt(r.request_count)}</td>
+        <td class="${r.error_count > 0 ? 'text-red' : 'text-green'}">${r.error_count}</td>
+        <td>${fmt(r.total_tokens)}</td>
+        <td>$${(r.cost || 0).toFixed(6)}</td>
+        <td>${r.avg_latency_ms}ms</td>
+      </tr>`).join('');
+  } catch (e) { console.error(e); }
+}
+
+async function loadDaily() {
+  try {
+    const res = await fetch('/api/admin/stats/daily?days=30');
+    const data = await res.json();
+    const tbody = document.querySelector('#daily-table tbody');
+    tbody.innerHTML = data.map(r => `
+      <tr>
+        <td>${r.date}</td>
+        <td><span class="badge ${r.pool}">${r.pool}</span></td>
+        <td>${fmt(r.total_requests)}</td>
+        <td>${fmt(r.total_errors)}</td>
+        <td>${fmt(r.total_tokens)}</td>
+        <td>$${(r.total_cost || 0).toFixed(6)}</td>
+        <td>${r.unique_backends}</td>
+      </tr>`).join('');
+  } catch (e) { console.error(e); }
+}
+
+async function loadCooldown() {
+  try {
+    const res = await fetch('/api/admin/stats/cooldown?limit=100');
+    const data = await res.json();
+    const tbody = document.querySelector('#cooldown-table tbody');
+    tbody.innerHTML = data.map(r => `
+      <tr>
+        <td>${r.started_at}</td>
+        <td>${r.backend_id}</td>
+        <td>${r.consecutive_count}</td>
+        <td>${r.cooldown_seconds}s</td>
+        <td>${h(r.response_summary)}</td>
+      </tr>`).join('');
+  } catch (e) { console.error(e); }
+}
+
+async function loadChartData() {
+  try {
+    const res = await fetch('/api/admin/stats/hourly?hours=168');
+    const data = await res.json();
+    // Group by hour, sum
+    const byHour = {};
+    data.forEach(r => {
+      const hour = r.hour_bucket.slice(0, 13);
+      if (!byHour[hour]) byHour[hour] = { cost: 0, tokens: 0 };
+      byHour[hour].cost += (r.cost || 0);
+      byHour[hour].tokens += (r.total_tokens || 0);
+    });
+    const hours = Object.keys(byHour).sort();
+    const costs = hours.map(h => byHour[h].cost);
+    const tokens = hours.map(h => byHour[h].tokens);
+    const labels = hours.map(h => h.slice(11, 16) + ' ' + h.slice(5, 10));
+
+    if (costChart) {
+      costChart.data.labels = labels;
+      costChart.data.datasets[0].data = costs;
+      costChart.update();
+    }
+    if (tokenChart) {
+      tokenChart.data.labels = labels;
+      tokenChart.data.datasets[0].data = tokens;
+      tokenChart.update();
+    }
+  } catch (e) { console.error(e); }
+}
+
+// ── Helpers ──
+function fmt(n) { return (n || 0).toLocaleString(); }
+function h(s) { const d=document.createElement('div'); d.textContent=s||''; return d.innerHTML; }
+function formatDuration(s) {
+  const d = Math.floor(s / 86400);
+  const h = Math.floor((s % 86400) / 3600);
+  const m = Math.floor((s % 3600) / 60);
+  const parts = [];
+  if (d) parts.push(d + 'd');
+  if (h) parts.push(h + 'h');
+  if (m || !parts.length) parts.push(m + 'm');
+  return parts.join(' ');
+}
+
+// Initial load
+document.addEventListener('DOMContentLoaded', () => {
+  // Ensure chart containers exist
+  if (!document.getElementById('cost-chart')) {
+    const chartsDiv = document.getElementById('charts');
+    if (chartsDiv) {
+      chartsDiv.innerHTML = `
+        <div class="chart-card"><h3>Cost Over Time</h3><canvas id="cost-chart"></canvas></div>
+        <div class="chart-card"><h3>Token Usage Over Time</h3><canvas id="token-chart"></canvas></div>`;
+    }
+  }
+  initCharts();
+  loadChartData();
+});
+</script>
+</body>
+</html>
@@ -1,31 +0,0 @@
-# NVIDIA Sidecar 环境变量清单 (BIZ-46 Phase3 §4)
-# 复制为 .env 后按需修改，供 Docker / systemd 使用。
-
-# 网络
-SIDECAR_HOST=127.0.0.1
-SIDECAR_PORT=9190
-SIDECAR_METRICS_PORT=9191
-
-# 上游 API（必填）
-SIDECAR_UPSTREAM=https://integrate.api.nvidia.com/v1
-SIDECAR_API_KEY=nvapi-your-key-here
-
-# 限流
-SIDECAR_RATE_RPM=40
-SIDECAR_BUCKET_CAPACITY=40
-
-# 超时
-SIDECAR_TIMEOUT=60
-
-# 队列
-SIDECAR_QUEUE_MAX=500
-SIDECAR_LOW_TIMEOUT=2
-
-# 降级
-SIDECAR_FALLBACK_PASSTHROUGH=true
-
-# 日志
-SIDECAR_LOG_LEVEL=INFO
-
-# Admin API 认证（可选，不设置则跳过认证）
-# SIDECAR_ADMIN_TOKEN=your-admin-token-here
@@ -0,0 +1,90 @@
+# Sidecar V2 — API Key Encryption Rotation SOP
+
+> 版本: v1.0 | 维护者: 严维序 (opengineer)
+
+## 背景
+
+Sidecar V2 使用 AES-256-GCM 加密存储所有 Provider 的 API Key。加密密钥通过 `SIDECAR_ENCRYPTION_KEY` 环境变量传入，启动时通过 `init_crypto()` 初始化。
+
+## ⚠️ 关键警告
+
+**更换 SIDECAR_ENCRYPTION_KEY 会导致所有已存储的 API Key 永久不可恢复！**
+
+`crypto.py` 的 `try_decrypt_existing()` 在密钥变更时会静默返回 `None`，已有加密数据将无法解密。请在轮换密钥前执行以下步骤。
+
+## 安全轮换步骤
+
+### Step 1: 导出当前 API Key 明文（必须）
+
+```bash
+# 使用旧密钥启动 sidecar，通过 admin API 导出
+curl -s -H "Authorization: Bearer <ADMIN_TOKEN>" \
+  http://127.0.0.1:9190/api/admin/backends | \
+  python3 -c "
+import json, sys
+data = json.load(sys.stdin)
+# 注意：api_key 是 masked 的，需要重新从安全渠道获取原始 key
+print(json.dumps(data, indent=2))
+"
+```
+
+### Step 2: 停止服务
+
+```bash
+systemctl stop sidecar-v2
+# 或
+docker compose down
+```
+
+### Step 3: 备份数据库
+
+```bash
+cp /app/data/sidecar_v2.db /app/data/backups/pre-rotation-$(date +%Y%m%d_%H%M%S).db
+```
+
+### Step 4: 更新密钥
+
+更新 `/etc/sidecar-v2/env` 或 docker `.env` 文件中的 `SIDECAR_ENCRYPTION_KEY`：
+
+```
+SIDECAR_ENCRYPTION_KEY=<new_64_hex_char_key>
+```
+
+生成新密钥：
+```bash
+python3 -c "import secrets; print(secrets.token_hex(32))"
+```
+
+### Step 5: 清空加密 Key 并重新录入
+
+由于密钥变更后旧加密数据不可读，需要：
+
+1. 启动服务（此时所有旧 Provider 的 API Key 不可用）
+2. 通过 Admin API 重新录入所有 Provider 的 API Key：
+```bash
+curl -s -X PUT -H "Authorization: Bearer <ADMIN_TOKEN>" \
+  -H "Content-Type: application/json" \
+  -d '{"api_key": "<NEW_PLAIN_KEY>"}' \
+  http://127.0.0.1:9190/api/admin/backends/<backend_id>
+```
+
+### Step 6: 验证
+
+```bash
+# 确认 Provider 状态为 healthy
+curl -s http://127.0.0.1:9190/api/admin/pools
+# 发送测试请求
+curl -s -X POST http://127.0.0.1:9190/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model":"<model_name>","messages":[{"role":"user","content":"test"}],"max_tokens":5}'
+```
+
+## 应急预案
+
+如果在密钥轮换过程中出错：
+
+1. 恢复旧密钥环境变量
+2. 恢复旧数据库备份
+3. 重启服务
+
+旧 Key 会正常工作，因为未被覆盖的数据仍然用旧密钥加密。
@@ -0,0 +1,56 @@
+# Sidecar V2 — Nginx reverse proxy config (reference)
+# Place at /etc/nginx/sites-available/sidecar-v2.conf
+# SSL certs managed by certbot or manually
+
+upstream sidecar_v2_main {
+    server 127.0.0.1:9190;
+}
+
+upstream sidecar_v2_metrics {
+    server 127.0.0.1:9191;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name sidecar.example.com;
+
+    ssl_certificate     /etc/ssl/certs/sidecar.pem;
+    ssl_certificate_key /etc/ssl/private/sidecar.key;
+
+    # Dashboard + Admin API (main port)
+    location / {
+        proxy_pass http://sidecar_v2_main;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+    }
+
+    # SSE support for dashboard real-time data
+    location /dashboard/sse {
+        proxy_pass http://sidecar_v2_main;
+        proxy_http_version 1.1;
+        proxy_set_header Connection "";
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_buffering off;
+        proxy_cache off;
+        chunked_transfer_encoding off;
+        proxy_read_timeout 86400s;
+    }
+
+    # Prometheus metrics
+    location /metrics {
+        proxy_pass http://sidecar_v2_metrics;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+
+    # Health check
+    location /health {
+        proxy_pass http://sidecar_v2_main;
+        proxy_http_version 1.1;
+        proxy_set_header Host $host;
+    }
+}
@@ -1,49 +0,0 @@
-# NVIDIA Sidecar 限流代理 — systemd service (BIZ-46 Phase3 §4)
-#
-# 安装：
-#   sudo cp deploy/nvidia-sidecar.service /etc/systemd/system/
-#   sudo systemctl daemon-reload
-#   sudo systemctl enable nvidia-sidecar
-#   sudo systemctl start nvidia-sidecar
-#
-# 运维：
-#   sudo systemctl status nvidia-sidecar
-#   sudo journalctl -u nvidia-sidecar -f
-
-[Unit]
-Description=NVIDIA Sidecar Rate-Limiting Proxy
-Documentation=https://github.com/bizwings/nvidia-sidecar
-After=network-online.target
-Wants=network-online.target
-
-[Service]
-Type=simple
-User=sidecar
-Group=sidecar
-WorkingDirectory=/opt/nvidia-sidecar
-ExecStart=/opt/nvidia-sidecar/.venv/bin/uvicorn nvidia_sidecar.server:app \
-    --host 127.0.0.1 \
-    --port 9190 \
-    --log-level info
-Restart=always
-RestartSec=5
-
-# 环境变量
-EnvironmentFile=/opt/nvidia-sidecar/.env
-
-# 安全加固
-NoNewPrivileges=true
-ProtectSystem=strict
-ProtectHome=true
-PrivateTmp=true
-ReadWritePaths=/opt/nvidia-sidecar/logs
-
-# 资源限制
-LimitNOFILE=65536
-MemoryMax=512M
-
-# 启动延迟（等待网络就绪）
-ExecStartPre=/bin/sleep 1
-
-[Install]
-WantedBy=multi-user.target
@@ -0,0 +1,23 @@
+[Unit]
+Description=Sidecar V2 — Multi-Pool Provider Proxy
+After=network.target
+
+[Service]
+Type=simple
+User=openclaw
+Group=openclaw
+WorkingDirectory=/opt/sidecar-v2
+EnvironmentFile=/etc/sidecar-v2/env
+ExecStart=/opt/sidecar-v2/.venv/bin/python3 main.py
+Restart=always
+RestartSec=5
+
+# Security hardening
+NoNewPrivileges=yes
+ProtectSystem=strict
+ProtectHome=yes
+ReadWritePaths=/opt/sidecar-v2/data
+PrivateTmp=yes
+
+[Install]
+WantedBy=multi-user.target
@@ -0,0 +1,26 @@
+# Sidecar V2 — Multi-Pool Provider Proxy
+version: "3.9"
+
+services:
+  sidecar-v2:
+    build: .
+    container_name: sidecar-v2
+    restart: unless-stopped
+    ports:
+      - "9190:9190"  # Main proxy + admin API + dashboard
+      - "9191:9191"  # Prometheus metrics
+    environment:
+      - SIDECAR_ENCRYPTION_KEY=${SIDECAR_ENCRYPTION_KEY}
+      - SIDECAR_ADMIN_TOKEN=${SIDECAR_ADMIN_TOKEN:-change-me}
+      - LOG_FORMAT=${LOG_FORMAT:-json}
+      - SIDECAR_HOST=0.0.0.0
+      - SIDECAR_PORT=9190
+      - SIDECAR_METRICS_PORT=9191
+      - SIDECAR_DB_PATH=/app/data/sidecar_v2.db
+      - SIDECAR_BACKUP_DIR=/app/data/backups
+    volumes:
+      - sidecar-data:/app/data
+
+volumes:
+  sidecar-data:
+    driver: local
@@ -1,198 +0,0 @@
-"""
-NVIDIA Sidecar 限流代理 — 健康检查端点 (§3.6)
-
-提供 Kubernetes / systemd 兼容的健康检查：
-    GET /health       — 存活检查
-    GET /health/ready — 就绪检查（含上游连通性）
-
-BIZ-46 Phase3: Readiness HTTP Client 复用 — 注入主 http_client，
-不再每次检查创建新 client，降低 K8s/systemd 高频探测的连接开销。
-"""
-
-from __future__ import annotations
-
-import time
-from dataclasses import dataclass
-from typing import Any
-
-import httpx
-
-
-@dataclass
-class HealthService:
-    """健康检查服务。
-
-    封装存活检查和就绪检查的逻辑，供 server.py 路由调用。
-    """
-
-    start_time: float = 0.0
-    version: str = "0.1.0"
-
-    def __post_init__(self) -> None:
-        if self.start_time == 0.0:
-            self.start_time = time.time()
-
-    @property
-    def uptime_seconds(self) -> float:
-        """服务运行时长（秒）。"""
-        return time.time() - self.start_time
-
-    async def check_upstream(
-        self,
-        upstream_url: str,
-        http_client: httpx.AsyncClient,
-        timeout: float = 5.0,
-        api_key: str = "",
-    ) -> bool:
-        """检查上游连通性（复用注入的 http_client，BIZ-46 Phase3）。
-
-        Args:
-            upstream_url: NVIDIA API base URL。
-            http_client: 复用的 httpx.AsyncClient（来自 ctx）。
-            timeout: 超时秒数（per-request override）。
-            api_key: 可选的 API Key 用于认证。
-
-        Returns:
-            True 上游可达。
-        """
-        try:
-            headers: dict[str, str] = {}
-            if api_key:
-                headers["authorization"] = f"Bearer {api_key}"
-
-            resp = await http_client.get(
-                f"{upstream_url.rstrip('/')}/v1/models",
-                headers=headers,
-                timeout=timeout,
-            )
-            return resp.status_code < 500
-        except Exception:
-            return False
-
-    def check_queue_healthy(
-        self,
-        current_size: int,
-        max_size: int,
-        threshold_ratio: float = 0.9,
-    ) -> bool:
-        """检查队列是否健康（未接近满载）。
-
-        Args:
-            current_size: 当前队列长度。
-            max_size: 队列最大容量。
-            threshold_ratio: 告警阈值比例，默认 0.9。
-
-        Returns:
-            True 队列健康。
-        """
-        if max_size <= 0:
-            return True
-        return current_size < max_size * threshold_ratio
-
-    def check_token_bucket_healthy(
-        self,
-        available_tokens: float,
-        capacity: int,
-        threshold: float = 0.05,
-    ) -> bool:
-        """检查令牌桶是否健康（token 未耗尽）。
-
-        Args:
-            available_tokens: 当前可用令牌数。
-            capacity: 桶容量。
-            threshold: 令牌数低于此比例视为不健康。
-
-        Returns:
-            True 令牌桶健康。
-        """
-        if capacity <= 0:
-            return False
-        return available_tokens > capacity * threshold
-
-    def liveness(self) -> dict[str, Any]:
-        """存活检查响应。
-
-        Returns:
-            liveness JSON payload。
-        """
-        return {
-            "status": "ok",
-            "uptime": round(self.uptime_seconds, 1),
-            "version": self.version,
-        }
-
-    async def readiness(
-        self,
-        upstream_url: str,
-        upstream_api_key: str = "",
-        queue_current_size: int = 0,
-        queue_max_size: int = 500,
-        available_tokens: float = 0.0,
-        bucket_capacity: int = 40,
-        http_client: httpx.AsyncClient | None = None,
-    ) -> dict[str, Any]:
-        """就绪检查响应。
-
-        Args:
-            upstream_url: 上游 API 地址。
-            upstream_api_key: API Key。
-            queue_current_size: 当前队列长度。
-            queue_max_size: 队列最大容量。
-            available_tokens: 当前令牌数。
-            bucket_capacity: 桶容量。
-            http_client: 复用的 httpx.AsyncClient（BIZ-46 Phase3）。
-                为 None 时回退到每次创建新 client（兼容旧调用）。
-
-        Returns:
-            readiness JSON payload。
-        """
-        if http_client is not None:
-            upstream_ok = await self.check_upstream(
-                upstream_url, http_client=http_client, api_key=upstream_api_key,
-            )
-        else:
-            # 向后兼容：无 http_client 时沿用旧行为
-            upstream_ok = await self.check_upstream_standalone(
-                upstream_url, api_key=upstream_api_key,
-            )
-
-        queue_ok = self.check_queue_healthy(queue_current_size, queue_max_size)
-        token_ok = self.check_token_bucket_healthy(available_tokens, bucket_capacity)
-        all_ready = upstream_ok and queue_ok and token_ok
-
-        return {
-            "ready": all_ready,
-            "upstream_reachable": upstream_ok,
-            "queue_healthy": queue_ok,
-            "token_bucket_healthy": token_ok,
-        }
-
-    async def check_upstream_standalone(
-        self,
-        upstream_url: str,
-        timeout: float = 5.0,
-        api_key: str = "",
-    ) -> bool:
-        """独立检查上游连通性（向后兼容，每次创建新 client）。
-
-        Args:
-            upstream_url: NVIDIA API base URL。
-            timeout: 超时秒数。
-            api_key: 可选的 API Key。
-
-        Returns:
-            True 上游可达。
-        """
-        try:
-            headers: dict[str, str] = {}
-            if api_key:
-                headers["authorization"] = f"Bearer {api_key}"
-
-            async with httpx.AsyncClient(timeout=timeout) as client:
-                resp = await client.get(
-                    f"{upstream_url.rstrip('/')}/v1/models",
-                    headers=headers,
-                )
-                return resp.status_code < 500
-        except Exception:
-            return False
@@ -0,0 +1,17 @@
+"""Sidecar V2 entry point."""
+
+import uvicorn
+from config import config
+
+
+def main():
+    uvicorn.run(
+        "server:app",
+        host=config.host,
+        port=config.port,
+        log_level=config.log_level.lower(),
+    )
+
+
+if __name__ == "__main__":
+    main()
@@ -1,277 +0,0 @@
-"""
-NVIDIA Sidecar 限流代理 — Prometheus 指标端点 (§3.5)
-
-10 个指标，独立端口 :9191，与代理端口 :9190 分离。
-
-BIZ-46 Phase3: Prometheus 标签基数治理 — model_id label 收敛为 provider。
- upstream_latency_seconds: model_id → provider (固定值 "nvidia", 基数=1)
- upstream_errors_total: model_id → provider
- 模型级信息迁移到 structlog JSON 日志
-"""
-
-from __future__ import annotations
-
-import time
-import threading
-from typing import Any
-
-from prometheus_client import (
-    CollectorRegistry,
-    Counter,
-    Gauge,
-    Histogram,
-    generate_latest,
-    make_asgi_app,
-)
-
-
-class PrometheusMetrics:
-    """Sidecar Prometheus 指标收集器。
-
-    线程安全，所有公开方法通过 ``threading.Lock`` 保护。
-    """
-
-    def __init__(self, registry: CollectorRegistry | None = None) -> None:
-        """初始化所有 10 个 Prometheus 指标。
-
-        Args:
-            registry: 可选自定义 Registry；None 则使用默认全局 registry。
-        """
-        self._registry: CollectorRegistry = registry or CollectorRegistry()
-        self._lock: threading.Lock = threading.Lock()
-        self._start_time: float = time.time()
-
-        # ---- 1. 总请求数（按优先级 + 状态分组） ----
-        self.requests_total: Counter = Counter(
-            "sidecar_requests_total",
-            "Total requests processed by priority and status",
-            labelnames=["priority", "status"],
-            registry=self._registry,
-        )
-
-        # ---- 2. 可用令牌数 ----
-        self.tokens_available: Gauge = Gauge(
-            "sidecar_tokens_available",
-            "Current number of available tokens",
-            registry=self._registry,
-        )
-
-        # ---- 3. 令牌生成速率 ----
-        self.tokens_rate: Gauge = Gauge(
-            "sidecar_tokens_rate",
-            "Current token generation rate (tokens per minute)",
-            registry=self._registry,
-        )
-
-        # ---- 4. 各优先级队列深度 ----
-        self.queue_depth: Gauge = Gauge(
-            "sidecar_queue_depth",
-            "Queue depth by priority",
-            labelnames=["priority"],
-            registry=self._registry,
-        )
-
-        # ---- 5. 队列等待时间 Histogram ----
-        self.queue_latency_seconds: Histogram = Histogram(
-            "sidecar_queue_latency_seconds",
-            "Request wait time in queue in seconds",
-            labelnames=["priority"],
-            buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0),
-            registry=self._registry,
-        )
-
-        # ---- 6. 上游响应延迟 Histogram（label 收敛: model_id → provider） ----
-        self.upstream_latency_seconds: Histogram = Histogram(
-            "sidecar_upstream_latency_seconds",
-            "Upstream response latency in seconds",
-            labelnames=["provider"],  # BIZ-46: was ["model_id"], converged to fixed-cardinality provider
-            buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0, 600.0),
-            registry=self._registry,
-        )
-
-        # ---- 7. 上游错误计数（label 收敛: model_id → provider） ----
-        self.upstream_errors_total: Counter = Counter(
-            "sidecar_upstream_errors_total",
-            "Upstream error count by status code and provider",
-            labelnames=["status_code", "provider"],  # BIZ-46: was ["model_id"], converged
-            registry=self._registry,
-        )
-
-        # ---- 8. 降级直通次数 ----
-        self.fallback_passthrough_total: Counter = Counter(
-            "sidecar_fallback_passthrough_total",
-            "Total fallback / passthrough events (queue full or sidecar unavailable)",
-            registry=self._registry,
-        )
-
-        # ---- 9. 健康状态 ----
-        self.health_status: Gauge = Gauge(
-            "sidecar_health_status",
-            "Sidecar health: 0=unhealthy, 1=healthy",
-            registry=self._registry,
-        )
-
-        # ---- 10. 运行时长 ----
-        self.uptime_seconds: Gauge = Gauge(
-            "sidecar_uptime_seconds",
-            "Process uptime in seconds",
-            registry=self._registry,
-        )
-
-        # 避退模式指标（附加，不计入基础 10 个）
-        self.retreat_state: Gauge = Gauge(
-            "sidecar_retreat_state",
-            "Adaptive retreat state: 0=NORMAL, 1=RETREAT, 2=RECOVER",
-            registry=self._registry,
-        )
-        self.effective_rate_rpm: Gauge = Gauge(
-            "sidecar_effective_rate_rpm",
-            "Current effective rate in RPM (after retreat adjustments)",
-            registry=self._registry,
-        )
-        self.upstream_429_rate: Gauge = Gauge(
-            "sidecar_upstream_429_rate",
-            "Upstream 429 rate over the retreat observation window (0.0-1.0)",
-            registry=self._registry,
-        )
-
-        # 初始化
-        self.health_status.set(1)
-
-    # ---- ASGI app 生成 ----
-
-    def build_asgi_app(self) -> Any:
-        """生成 Prometheus ASGI 应用，挂载到独立端口。
-
-        Returns:
-            可传给 uvicorn 的 ASGI app。
-        """
-        return make_asgi_app(registry=self._registry)
-
-    # ---- 指标记录方法 ----
-
-    def record_request(self, priority: str, status: str) -> None:
-        """记录一次请求。
-
-        Args:
-            priority: 优先级名（URGENT / HIGH / NORMAL / LOW）。
-            status: 状态（success / ratelimited / error）。
-        """
-        with self._lock:
-            self.requests_total.labels(priority=priority, status=status).inc()
-
-    def record_queue_latency(self, priority: str, seconds: float) -> None:
-        """记录排队延迟。
-
-        Args:
-            priority: 优先级名。
-            seconds: 排队等待秒数。
-        """
-        with self._lock:
-            self.queue_latency_seconds.labels(priority=priority).observe(seconds)
-
-    def record_upstream(self, status_code: int, provider: str) -> None:
-        """记录上游响应（label 收敛: provider 替代 model_id，BIZ-46 Phase3）。
-
-        Args:
-            status_code: HTTP 状态码。
-            provider: 上游提供商标识（固定 "nvidia"）。
-        """
-        with self._lock:
-            self.upstream_latency_seconds.labels(provider=provider).observe(0.0)
-
-    def record_upstream_error(self, status_code: int, provider: str) -> None:
-        """记录上游错误（label 收敛: provider 替代 model_id，BIZ-46 Phase3）。
-
-        Args:
-            status_code: 错误 HTTP 状态码。
-            provider: 上游提供商标识（固定 "nvidia"）。
-        """
-        with self._lock:
-            self.upstream_errors_total.labels(
-                status_code=str(status_code), provider=provider
-            ).inc()
-
-    def record_upstream_latency(self, provider: str, seconds: float) -> None:
-        """记录上游响应延迟（label 收敛: provider 替代 model_id，BIZ-46 Phase3）。
-
-        Args:
-            provider: 上游提供商标识（固定 "nvidia"）。
-            seconds: 响应延迟秒数。
-        """
-        with self._lock:
-            self.upstream_latency_seconds.labels(provider=provider).observe(seconds)
-
-    def update_token_status(self, tokens: float, rate_per_minute: float) -> None:
-        """更新令牌桶状态。
-
-        Args:
-            tokens: 当前可用令牌数。
-            rate_per_minute: 每分钟速率。
-        """
-        with self._lock:
-            self.tokens_available.set(tokens)
-            self.tokens_rate.set(rate_per_minute)
-
-    def update_queue_depth(self, depths: dict[str, int]) -> None:
-        """更新各优先级队列深度。
-
-        Args:
-            depths: {priority_name: count} 映射。
-        """
-        with self._lock:
-            # 先清零所有已知标签再设置，避免残留旧值
-            for pri in ("URGENT", "HIGH", "NORMAL", "LOW"):
-                self.queue_depth.labels(priority=pri).set(depths.get(pri, 0))
-
-    def increment_fallback(self) -> None:
-        """降级直通计数 +1。"""
-        with self._lock:
-            self.fallback_passthrough_total.inc()
-
-    def set_health(self, healthy: bool) -> None:
-        """设置健康状态。
-
-        Args:
-            healthy: True=健康, False=不健康。
-        """
-        with self._lock:
-            self.health_status.set(1 if healthy else 0)
-
-    def update_uptime(self) -> None:
-        """更新运行时长。"""
-        with self._lock:
-            self.uptime_seconds.set(time.time() - self._start_time)
-
-    # ---- 避退模式指标 ----
-
-    def update_retreat_metrics(
-        self,
-        retreat_state: str,
-        effective_rate_rpm: float,
-        upstream_429_rate: float,
-    ) -> None:
-        """更新避退模式指标。
-
-        Args:
-            retreat_state: "normal" / "retreat" / "recover".
-            effective_rate_rpm: 当前实际速率 (RPM)。
-            upstream_429_rate: 上游 429 率 (0.0-1.0)。
-        """
-        state_map: dict[str, int] = {"normal": 0, "retreat": 1, "recover": 2}
-        with self._lock:
-            self.retreat_state.set(state_map.get(retreat_state, 0))
-            self.effective_rate_rpm.set(effective_rate_rpm)
-            self.upstream_429_rate.set(upstream_429_rate)
-
-    # ---- 导出 ----
-
-    def generate_latest(self) -> bytes:
-        """生成 Prometheus 文本格式的指标数据。
-
-        Returns:
-            Prometheus 文本格式 bytes。
-        """
-        with self._lock:
-            self.update_uptime()
-        return generate_latest(self._registry)
@@ -0,0 +1,83 @@
+"""Provider pool management: primary / fallback pool routing."""
+
+import structlog
+from typing import Optional
+
+from storage.backend_store import list_backends, get_pool_stats
+from storage.models import Backend
+
+logger = structlog.get_logger("sidecar_v2.pool_manager")
+
+
+class PoolManager:
+    """Manages provider pools and selects healthy backends for a given model.
+
+    Priority: primary pool → fallback pool.
+    Within a pool: healthy backends only, sorted by availability.
+    """
+
+    def __init__(self):
+        self._pool_order = ["primary", "fallback"]
+
+    def get_available_backends(
+        self, canonical_model: str, pool: Optional[str] = None
+    ) -> list[Backend]:
+        """Get all healthy, enabled backends that serve a model, in pool order.
+
+        Args:
+            canonical_model: Canonical model name to match.
+            pool: Optional pool filter (primary/fallback). None = all pools.
+
+        Returns:
+            List of ready backends sorted by pool priority, then RPM utilization.
+        """
+        backends: list[Backend] = []
+
+        pools_to_check = [pool] if pool else self._pool_order
+        for p in pools_to_check:
+            pool_backends = list_backends(pool=p, enabled_only=True, decrypt_key=True)
+            for b in pool_backends:
+                if b.status == "healthy" and b.has_model(canonical_model):
+                    backends.append(b)
+            if pool:
+                break
+
+        return backends
+
+    def get_any_healthy_backends(self, pool: Optional[str] = None) -> list[Backend]:
+        """Get all healthy, enabled backends regardless of model."""
+        backends: list[Backend] = []
+        pools_to_check = [pool] if pool else self._pool_order
+        for p in pools_to_check:
+            pool_backends = list_backends(pool=p, enabled_only=True, decrypt_key=True)
+            for b in pool_backends:
+                if b.status == "healthy":
+                    backends.append(b)
+            if pool:
+                break
+        return backends
+
+    def get_pool_status(self) -> dict:
+        """Get pool summary for dashboard."""
+        stats = get_pool_stats()
+        result = {}
+        for pool in self._pool_order:
+            s = stats.get(pool, {"total": 0, "enabled": 0, "healthy": 0, "cooling": 0, "error": 0})
+            result[pool] = s
+        # Also include any other pools
+        for pool, s in stats.items():
+            if pool not in result:
+                result[pool] = s
+        return result
+
+    def is_pool_available(self, canonical_model: str, pool: str = "primary") -> bool:
+        """Check if a pool has any healthy backends for a model."""
+        backends = self.get_available_backends(canonical_model, pool=pool)
+        return len(backends) > 0
+
+    def is_any_pool_available(self, canonical_model: str) -> bool:
+        """Check if any pool has healthy backends for a model."""
+        for pool in self._pool_order:
+            if self.is_pool_available(canonical_model, pool):
+                return True
+        return False
@@ -1,253 +0,0 @@
-"""
-NVIDIA Sidecar 限流代理 — 四级优先级请求队列模块 (§3.3)
-
-管理待处理的 NVIDIA API 请求，按优先级 + FIFO 出队。
-支持三种队列满策略：PASSTHROUGH / REJECT / DROP_LOWEST。
-"""
-
-from __future__ import annotations
-
-import asyncio
-import heapq
-import time
-import uuid
-from dataclasses import dataclass, field
-from enum import Enum
-from typing import Any
-
-from nvidia_sidecar.rate_limiter import Priority
-
-
-# ---------------------------------------------------------------------------
-# 队列满策略
-# ---------------------------------------------------------------------------
-
-class QueueFullPolicy(str, Enum):
-    """队列满时的处理策略。"""
-    PASSTHROUGH = "passthrough"   # 直通上游，绕过排队（fail-open 子策略）
-    REJECT = "reject"             # 返回 503 Service Unavailable
-    DROP_LOWEST = "drop_lowest"   # 丢弃队列中最低优先级元素，插入新请求
-
-
-# ---------------------------------------------------------------------------
-# 队列元素
-# ---------------------------------------------------------------------------
-
-@dataclass(order=True)
-class PriorityQueueItem:
-    """优先级队列元素。
-
-    ``sort_index`` 由 ``(priority, timestamp)`` 组成，
-    Python 的 ``__lt__`` 按字段顺序比较：先比 priority，再比 timestamp。
-    数值越小越优先（URGENT=1 优于 HIGH=2）。
-    """
-    sort_index: tuple[int, float] = field(compare=True)
-    priority: Priority = field(compare=False)
-    request_id: str = field(compare=False)
-    payload: dict[str, Any] = field(compare=False)
-    enqueued_at: float = field(compare=False)
-    headers: dict[str, str] = field(default_factory=dict, compare=False)
-
-
-# ---------------------------------------------------------------------------
-# 优先级请求队列
-# ---------------------------------------------------------------------------
-
-class QueueFullError(Exception):
-    """队列已满且策略为 REJECT 时抛出。"""
-    pass
-
-
-class QueueFullPassthrough(Exception):
-    """队列已满且策略为 PASSTHROUGH 时抛出，由调用方绕过队列直通上游。"""
-    pass
-
-
-class PriorityRequestQueue:
-    """异步线程安全的四级优先级请求队列。
-
-    内部使用 ``asyncio.Lock`` 保护并发操作，
-    基于 ``heapq`` + ``asyncio.Event`` 实现阻塞出队。
-    """
-
-    def __init__(self, max_size: int = 500) -> None:
-        """初始化优先级队列。
-
-        Args:
-            max_size: 队列最大容量。
-
-        Raises:
-            ValueError: max_size <= 0。
-        """
-        if max_size <= 0:
-            raise ValueError(f"max_size 必须为正整数，当前值: {max_size}")
-        self.max_size: int = max_size
-        self._heap: list[PriorityQueueItem] = []
-        self._lock: asyncio.Lock = asyncio.Lock()
-        self._not_empty: asyncio.Event = asyncio.Event()
-        self._full_policy: QueueFullPolicy = QueueFullPolicy.PASSTHROUGH
-
-        # 统计
-        self._total_enqueued: int = 0
-        self._total_dequeued: int = 0
-        self._total_dropped: int = 0
-
-    # ---- 队列满策略 ---- 
-
-    def set_full_policy(self, policy: QueueFullPolicy) -> None:
-        """设置队列满时的处理策略。
-
-        Args:
-            policy: QueueFullPolicy 枚举值。
-        """
-        self._full_policy = policy
-
-    @property
-    def full_policy(self) -> QueueFullPolicy:
-        """当前队列满策略。"""
-        return self._full_policy
-
-    # ---- 动态容量调整 ----
-
-    def set_max_size(self, new_size: int) -> tuple[bool, str]:
-        """动态调整队列最大容量（热重载）。
-
-        缩小操作受保护：如果 new_size 小于当前排队数，拒绝变更并
-        提示当前队列深度。
-
-        Args:
-            new_size: 新的最大容量。
-
-        Returns:
-            (成功标志, 消息)。成功时标志为 True，消息含新旧容量对比；
-            失败时标志为 False，消息含拒绝原因和当前深度。
-
-        Raises:
-            ValueError: new_size <= 0。
-        """
-        if new_size <= 0:
-            raise ValueError(f"max_size 必须为正整数，当前值: {new_size}")
-        current = len(self._heap)
-        if new_size < current:
-            return (False, f"拒绝缩小：新上限 {new_size} < 当前排队数 {current}，需要先排空或提升上限")
-        old = self.max_size
-        self.max_size = new_size
-        return (True, f"队列上限已调整：{old} → {new_size}{'（当前排队 ' + str(current) + '）' if current > 0 else ''}")
-
-    # ---- 入队 ----
-
-    async def put(
-        self,
-        item: dict[str, Any],
-        priority: Priority = Priority.NORMAL,
-        headers: dict[str, str] | None = None,
-    ) -> str:
-        """将请求放入队列。
-
-        Args:
-            item: 请求体（JSON 序列化的 dict）。
-            priority: 请求优先级，默认 NORMAL。
-            headers: 原始请求 headers。
-
-        Returns:
-            分配的唯一 request_id。
-
-        Raises:
-            QueueFullError: 队列满且策略为 REJECT。
-        """
-        request_id = str(uuid.uuid4())
-        headers = headers or {}
-
-        queue_item = PriorityQueueItem(
-            sort_index=(int(priority), time.monotonic()),
-            priority=priority,
-            request_id=request_id,
-            payload=item,
-            enqueued_at=time.monotonic(),
-            headers=headers,
-        )
-
-        async with self._lock:
-            queue_size = len(self._heap)
-            if queue_size >= self.max_size:
-                if self._full_policy == QueueFullPolicy.REJECT:
-                    raise QueueFullError(
-                        f"队列已满 ({queue_size}/{self.max_size})，策略: reject"
-                    )
-                elif self._full_policy == QueueFullPolicy.DROP_LOWEST:
-                    # 丢弃 heap 中优先级最低（值最大）的元素
-                    # heap 是最小堆，找最大值需要遍历
-                    max_val_item = max(self._heap, key=lambda x: x.sort_index)
-                    self._heap.remove(max_val_item)
-                    heapq.heapify(self._heap)
-                    self._total_dropped += 1
-                # PASSTHROUGH 策略：不插入队列，抛异常让调用方绕过排队
-                else:
-                    raise QueueFullPassthrough(
-                        f"队列已满 ({queue_size}/{self.max_size})，策略: passthrough"
-                    )
-
-            heapq.heappush(self._heap, queue_item)
-            self._total_enqueued += 1
-
-        self._not_empty.set()
-        return request_id
-
-    # ---- 出队 ----
-
-    async def get(self, timeout: float = 1.0) -> PriorityQueueItem | None:
-        """从队列取出下一个元素（阻塞、优先级排序）。
-
-        Args:
-            timeout: 阻塞等待的最大秒数，默认 1.0。
-
-        Returns:
-            优先级最高的队列元素；超时无元素时返回 None。
-        """
-        deadline = time.monotonic() + timeout
-        while True:
-            async with self._lock:
-                if self._heap:
-                    item = heapq.heappop(self._heap)
-                    self._total_dequeued += 1
-                    if not self._heap:
-                        self._not_empty.clear()
-                    return item
-
-            # 队列为空，等待新元素入队
-            remaining = deadline - time.monotonic()
-            if remaining <= 0:
-                return None
-            try:
-                await asyncio.wait_for(
-                    self._not_empty.wait(),
-                    timeout=remaining,
-                )
-            except asyncio.TimeoutError:
-                return None
-
-    # ---- 状态查询 ----
-
-    async def get_queue_size(self) -> int:
-        """返回当前队列长度。"""
-        async with self._lock:
-            return len(self._heap)
-
-    async def get_stats(self) -> dict[str, Any]:
-        """返回队列统计信息。"""
-        async with self._lock:
-            depth_by_priority: dict[str, int] = {}
-            for item in self._heap:
-                key = item.priority.name
-                depth_by_priority[key] = depth_by_priority.get(key, 0) + 1
-
-            return {
-                "max_size": self.max_size,
-                "current_size": len(self._heap),
-                "total_enqueued": self._total_enqueued,
-                "total_dequeued": self._total_dequeued,
-                "total_dropped": self._total_dropped,
-                "depth_by_priority": depth_by_priority,
-                "full_policy": self._full_policy.value,
-                "utilization": len(self._heap) / self.max_size if self.max_size > 0 else 0.0,
-            }
@@ -0,0 +1,383 @@
+"""Proxy request handling for Sidecar V2 — multi-pool routing + cooldown + rate limiting."""
+
+import asyncio
+import json
+import time
+from typing import Any, Optional
+
+import httpx
+import structlog
+from fastapi import Request
+from fastapi.responses import JSONResponse, Response, StreamingResponse
+
+from config import config
+from pool_manager import PoolManager
+from rate_limiter import PerBackendRateLimiter
+from router import Router
+from cooldown_manager import start_cooldown, check_and_clear_cooldown
+from storage.models import Backend
+from storage.usage_store import record_usage
+
+# Emergency activation counter (read by metrics endpoint)
+_emergency_count: int = 0
+
+
+def get_emergency_count() -> int:
+    return _emergency_count
+
+
+logger: structlog.stdlib.BoundLogger = structlog.get_logger("sidecar_v2.proxy")
+
+
+def extract_model(body: dict[str, Any]) -> str:
+    """Extract model identifier from request body."""
+    return str(body.get("model", "unknown"))
+
+
+def build_error_response(status: int, message: str, error_type: str = "") -> JSONResponse:
+    """Build a standard error response."""
+    return JSONResponse(
+        status_code=status,
+        content={
+            "error": {
+                "message": message,
+                "type": error_type or f"Error_{status}",
+            }
+        },
+    )
+
+
+async def forward_to_backend(
+    backend: Backend,
+    method: str,
+    path: str,
+    body: bytes | None,
+    headers: dict[str, str],
+    stream: bool = False,
+) -> httpx.Response:
+    """Forward a request to a specific backend."""
+    upstream_url = backend.api_base_url.rstrip("/") + path
+
+    forward_headers = {
+        k: v
+        for k, v in headers.items()
+        if k.lower() not in ("host", "content-length", "transfer-encoding")
+    }
+
+    if backend.api_key_plain:
+        forward_headers["authorization"] = f"Bearer {backend.api_key_plain}"
+    elif "authorization" not in {k.lower() for k in forward_headers}:
+        forward_headers["authorization"] = "Bearer nvidia"
+
+    timeout = httpx.Timeout(backend.timeout_seconds)
+
+    async with httpx.AsyncClient(timeout=timeout) as client:
+        req = client.build_request(
+            method=method,
+            url=upstream_url,
+            headers=forward_headers,
+            content=body,
+        )
+        return await client.send(req, stream=stream)
+
+
+def build_response(resp: httpx.Response) -> Response:
+    """Convert httpx.Response to FastAPI Response."""
+    content_type = resp.headers.get("content-type", "")
+    headers = {
+        k: v
+        for k, v in resp.headers.items()
+        if k.lower() not in ("content-encoding", "transfer-encoding")
+    }
+
+    is_sse = "text/event-stream" in content_type
+    is_chunked = resp.headers.get("transfer-encoding", "").lower() == "chunked"
+    if is_sse or (is_chunked and headers.get("content-type", "") != "application/octet-stream"):
+        return StreamingResponse(
+            content=resp.aiter_bytes(),
+            status_code=resp.status_code,
+            headers=headers,
+            media_type=content_type or "text/event-stream",
+        )
+
+    return Response(
+        content=resp.content,
+        status_code=resp.status_code,
+        headers=headers,
+        media_type=content_type or "application/json",
+    )
+
+
+def extract_usage_from_response(
+    resp: httpx.Response,
+    resp_json: dict[str, Any],
+    model: str,
+) -> tuple[int, int, int]:
+    """Extract token usage from response body (OpenAI-compatible)."""
+    usage = resp_json.get("usage", {})
+    prompt_tokens = usage.get("prompt_tokens", 0) or 0
+    completion_tokens = usage.get("completion_tokens", 0) or 0
+
+    # Try streaming chunks: aggregate from choices
+    if not prompt_tokens and not completion_tokens:
+        choices = resp_json.get("choices", [])
+        for choice in choices:
+            if isinstance(choice, dict):
+                tokens = choice.get("usage", {})
+                prompt_tokens += tokens.get("prompt_tokens", 0) or 0
+                completion_tokens += tokens.get("completion_tokens", 0) or 0
+
+    total_tokens = prompt_tokens + completion_tokens
+    if total_tokens == 0:
+        total_tokens = usage.get("total_tokens", 0) or 0
+
+    return prompt_tokens, completion_tokens, total_tokens
+
+
+def calculate_cost(
+    backend: Backend,
+    model: str,
+    prompt_tokens: int,
+    completion_tokens: int,
+) -> float:
+    """Calculate cost using backend's model pricing."""
+    cost_info = backend.get_model_cost(model)
+    input_cost = cost_info.get("input", 0.0)
+    output_cost = cost_info.get("output", 0.0)
+    # Costs are per token
+    return (prompt_tokens * input_cost + completion_tokens * output_cost)
+
+
+async def handle_proxy_request(
+    pool_manager: PoolManager,
+    rate_limiter: PerBackendRateLimiter,
+    router: Router,
+    request: Request,
+    path: str,
+) -> Response:
+    """Main proxy handler: multi-pool routing with cooldown and rate limiting.
+
+    Flow:
+    1. Extract model → canonical name
+    2. Pick backend via Router (primary → fallback)
+    3. Forward request
+    4. If 429 → cooldown backend, retry with another
+    5. If all pools exhausted → emergency mode
+    6. Track usage
+    """
+    start_time = time.monotonic()
+
+    body_bytes: bytes = await request.body()
+    raw_headers: dict[str, str] = dict(request.headers)
+
+    body_json: dict[str, Any] = {}
+    try:
+        if body_bytes:
+            parsed = json.loads(body_bytes)
+            if isinstance(parsed, dict):
+                body_json = parsed
+    except (ValueError, TypeError):
+        body_json = {}
+
+    canonical_model = extract_model(body_json)
+    is_stream = body_json.get("stream", False)
+
+    # Try with pool routing
+    max_retries = config.max_pool_retries
+    for attempt in range(max_retries):
+        # Check and clear expired cooldowns before picking
+        _refresh_cooldowns()
+
+        backend = router.pick_backend(canonical_model)
+        if backend is None:
+            break  # No backend available, fall through to emergency
+
+        try:
+            resp = await forward_to_backend(
+                backend=backend,
+                method=request.method,
+                path=path,
+                body=body_bytes if body_bytes else None,
+                headers=raw_headers,
+                stream=is_stream,
+            )
+            elapsed_ms = int((time.monotonic() - start_time) * 1000)
+
+            # Handle 429 — cooldown and retry
+            if resp.status_code == 429:
+                new_count = backend.consecutive_429_count + 1
+                start_cooldown(backend.id, new_count)
+
+                resp_body = ""
+                try:
+                    resp_body = resp.text[:200]
+                except Exception:
+                    pass
+
+                logger.warning(
+                    "backend_429_cooldown",
+                    backend_id=backend.id,
+                    pool=backend.pool,
+                    consecutive=new_count,
+                    model=canonical_model,
+                )
+
+                # Track the error
+                record_usage(
+                    backend_id=backend.id,
+                    model=canonical_model,
+                    prompt_tokens=0,
+                    completion_tokens=0,
+                    cost=0.0,
+                    latency_ms=elapsed_ms,
+                    is_error=True,
+                )
+
+                continue  # Retry with another backend
+
+            # Success — track usage
+            resp_json: dict[str, Any] = {}
+            try:
+                if not is_stream and resp.content:
+                    resp_json = json.loads(resp.content)
+            except (ValueError, TypeError):
+                pass
+
+            prompt_tokens, completion_tokens, total_tokens = extract_usage_from_response(
+                resp, resp_json, canonical_model
+            )
+            cost = calculate_cost(
+                backend, canonical_model, prompt_tokens, completion_tokens
+            )
+
+            record_usage(
+                backend_id=backend.id,
+                model=canonical_model,
+                prompt_tokens=prompt_tokens,
+                completion_tokens=completion_tokens,
+                cost=cost,
+                latency_ms=elapsed_ms,
+            )
+
+            logger.info(
+                "request_completed",
+                backend_id=backend.id,
+                pool=backend.pool,
+                model=canonical_model,
+                status=resp.status_code,
+                tokens=total_tokens,
+                cost=round(cost, 6),
+                elapsed_ms=elapsed_ms,
+            )
+
+            return build_response(resp)
+
+        except httpx.TimeoutException:
+            logger.warning(
+                "backend_timeout",
+                backend_id=backend.id,
+                model=canonical_model,
+            )
+            continue
+        except (httpx.ConnectError, httpx.RemoteProtocolError) as exc:
+            logger.warning(
+                "backend_connection_error",
+                backend_id=backend.id,
+                model=canonical_model,
+                error=str(exc),
+            )
+            continue
+        except Exception as exc:
+            logger.error(
+                "proxy_error",
+                backend_id=backend.id,
+                model=canonical_model,
+                error=str(exc),
+            )
+            continue
+
+    # All pools exhausted — emergency rate-limited passthrough
+    emergency_rpm = int(config.default_rpm_limit * config.emergency_rpm_fraction)
+    if emergency_rpm < 1:
+        emergency_rpm = 1
+
+    logger.warning(
+        "all_pools_exhausted_emergency",
+        model=canonical_model,
+        emergency_rpm=emergency_rpm,
+    )
+
+    # Track emergency activation for metrics
+    _emergency_count += 1
+
+    # Emergency: try to get a token from any fallback backend at reduced RPM
+    emergency_retries = 3
+    for attempt in range(emergency_retries):
+        backends = pool_manager.get_any_healthy_backends()
+        for backend in backends:
+            if rate_limiter.consume(backend.id, emergency_rpm):
+                try:
+                    resp = await forward_to_backend(
+                        backend=backend,
+                        method=request.method,
+                        path=path,
+                        body=body_bytes if body_bytes else None,
+                        headers=raw_headers,
+                        stream=is_stream,
+                    )
+                    elapsed_ms = int((time.monotonic() - start_time) * 1000)
+
+                    if resp.status_code == 429:
+                        start_cooldown(backend.id, backend.consecutive_429_count + 1)
+                        continue
+
+                    # Success in emergency mode
+                    try:
+                        resp_json: dict[str, Any] = {}
+                        if not is_stream and resp.content:
+                            resp_json = json.loads(resp.content)
+                    except Exception:
+                        resp_json = {}
+
+                    prompt_tokens, completion_tokens, total_tokens = extract_usage_from_response(
+                        resp, resp_json, canonical_model
+                    )
+                    cost_em = calculate_cost(backend, canonical_model, prompt_tokens, completion_tokens)
+
+                    record_usage(
+                        backend_id=backend.id,
+                        model=canonical_model,
+                        prompt_tokens=prompt_tokens,
+                        completion_tokens=completion_tokens,
+                        cost=cost_em,
+                        latency_ms=elapsed_ms,
+                    )
+
+                    logger.info(
+                        "emergency_passthrough_success",
+                        backend_id=backend.id,
+                        model=canonical_model,
+                        emergency_rpm=emergency_rpm,
+                    )
+                    return build_response(resp)
+                except Exception:
+                    continue
+
+    # All emergency attempts failed — return 503 for OpenClaw fallback chain
+    return build_error_response(
+        503,
+        "All provider pools exhausted. OpenClaw fallback chain should activate.",
+        "AllPoolsExhausted",
+    )
+
+
+def _refresh_cooldowns() -> None:
+    """Check and clear expired cooldowns for backends currently in cooling state.
+
+    Only queries backends with status='cooling' (the health_check_loop handles
+    the periodic scanning; this is the on-demand refresh before proxy routing)."""
+    from storage.backend_store import list_backends
+    backends = list_backends(decrypt_key=False)
+    for backend in backends:
+        if backend.status == "cooling":
+            check_and_clear_cooldown(backend.id)
@@ -1,48 +0,0 @@
-[project]
-name = "nvidia_sidecar"
-version = "0.1.0"
-description = "NVIDIA Sidecar 限流代理 — 为 NVIDIA API 提供优先级排队 + 令牌桶限流"
-readme = "README.md"
-license = { text = "MIT" }
-requires-python = ">=3.12"
-dependencies = [
-    "fastapi>=0.115",
-    "uvicorn[standard]>=0.34",
-    "httpx>=0.28",
-    "PyYAML>=6.0",
-    "structlog>=24.4",
-    "prometheus-client>=0.21",
-    "pydantic>=2.0",
-]
-
-[project.optional-dependencies]
-dev = [
-    "pytest>=8.3",
-    "pytest-asyncio>=0.24",
-    "httpx>=0.28",
-    "mypy>=1.14",
-    "types-PyYAML",
-]
-
-[project.scripts]
-nvidia-sidecar = "nvidia_sidecar.server:main"
-
-[build-system]
-requires = ["setuptools>=75", "wheel"]
-build-backend = "setuptools.build_meta"
-
-[tool.setuptools]
-packages = ["nvidia_sidecar"]
-
-[tool.setuptools.package-dir]
-# Flat layout: __init__.py + all .py files at project root
-"nvidia_sidecar" = "."
-
-[tool.mypy]
-python_version = "3.12"
-strict = true
-warn_return_any = true
-warn_unused_configs = true
-[[tool.mypy.overrides]]
-module = "structlog.*"
-ignore_missing_imports = true
@@ -1,130 +1,86 @@
-"""
-NVIDIA Sidecar 限流代理 — 令牌桶 + 网关识别模块 (§3.2)
+"""Per-backend rate limiter using token bucket algorithm."""

-从 BIZ-26 rate_limiter.py 提取核心限流逻辑，去除多线程调度器、缓存管理等。
-保留：Priority, TokenBucket, is_nvidia_gateway, normalize_gateway_name。
-"""
-
-from __future__ import annotations
-
-import time
 import threading
-from enum import IntEnum
+import time
 from typing import Any


-# ---------------------------------------------------------------------------
-# 优先级枚举
-# ---------------------------------------------------------------------------
+class PerBackendRateLimiter:
+    """Manages independent token buckets for each backend.

-class Priority(IntEnum):
-    """请求优先级（数值越小优先级越高）。"""
-    URGENT = 1
-    HIGH = 2
-    NORMAL = 3
-    LOW = 4
-
-
-# ---------------------------------------------------------------------------
-# NVIDIA 网关别名集
-# ---------------------------------------------------------------------------
-
-NVIDIA_GATEWAY_ALIASES: set[str] = {
-    # OpenClaw 配置中全部的 NVIDIA provider 名称
-    "nvidia",
-    "nvidia-gateway",
-    "nvidia98053",
-    "nvidialiuweicheng84",
-    "nvidiavx",
-    "nvidiavx18088980513",
-    "nvidiavx64391942",
-}
-
-
-def is_nvidia_gateway(value: str | None) -> bool:
-    """判断给定网关名/模型全路径是否属于 NVIDIA 网关。
-
-    Args:
-        value: 网关名（如 ``"nvidia"``）或模型全路径前缀
-               （如 ``"nvidia/deepseek-ai/deepseek-v4-pro"``）。
-               None 时直接返回 False。
-
-    Returns:
-        True 当 value 的 provider 部分匹配已知 NVIDIA 别名。
+    Thread-safe. Each backend gets its own bucket with configurable RPM.
    """
-    if value is None:
+
+    def __init__(self, refill_interval_ms: int = 50):
+        self._buckets: dict[str, _TokenBucket] = {}
+        self._lock = threading.Lock()
+        self._refill_interval_ms = refill_interval_ms
+
+    def ensure_bucket(self, backend_id: str, rpm_limit: int) -> None:
+        """Create or update a bucket for a backend."""
+        with self._lock:
+            if backend_id in self._buckets:
+                existing = self._buckets[backend_id]
+                existing.update_rate(rpm_limit)
+            else:
+                self._buckets[backend_id] = _TokenBucket(
+                    rate=rpm_limit / 60.0,
+                    capacity=max(rpm_limit, 1),
+                )
+
+    def remove_bucket(self, backend_id: str) -> None:
+        """Remove a backend's bucket."""
+        with self._lock:
+            self._buckets.pop(backend_id, None)
+
+    def consume(self, backend_id: str, rpm_limit: int, tokens: int = 1) -> bool:
+        """Try to consume tokens for a backend. Returns True if allowed.
+
+        Auto-creates the bucket if needed.
+        """
+        self.ensure_bucket(backend_id, rpm_limit)
+
+        with self._lock:
+            bucket = self._buckets.get(backend_id)
+            if bucket is None:
                return False

-    # 提取 provider 前缀：取 "/" 前第一个部分
-    provider = value.split("/", 1)[0].lower().strip()
-    return provider in NVIDIA_GATEWAY_ALIASES
+        return bucket.consume(tokens)

-
-def normalize_gateway_name(value: str | None) -> str | None:
-    """规范化网关名：提取 provider 前缀并转为小写。
-
-    Args:
-        value: 网关名或模型全路径。None 时返回 None。
-
-    Returns:
-        provider 前缀的小写形式，或 None。
-    """
-    if value is None:
+    def get_status(self, backend_id: str) -> dict[str, Any] | None:
+        """Get bucket status for a backend."""
+        with self._lock:
+            bucket = self._buckets.get(backend_id)
+            if bucket is None:
                return None
-    return value.split("/", 1)[0].lower().strip()
+            return bucket.get_status()
+
+    def get_all_status(self) -> dict[str, dict[str, Any]]:
+        """Get status of all buckets."""
+        with self._lock:
+            return {bid: b.get_status() for bid, b in self._buckets.items()}


-# ---------------------------------------------------------------------------
-# 令牌桶（线程安全）
-# ---------------------------------------------------------------------------
+class _TokenBucket:
+    """Internal token bucket with refill."""

-class TokenBucket:
-    """线程安全的令牌桶实现。
-
-    支持固定速率令牌补充和消费，带有溢出保护和可选的阻塞等待。
-    """
-
-    def __init__(self, rate: float = 40 / 60, capacity: int = 40) -> None:
-        """初始化令牌桶。
-
-        Args:
-            rate: 令牌补充速率（令牌/秒）。默认 40/60 ≈ 0.667 token/s（40 RPM）。
-            capacity: 桶最大容量（令牌数）。默认 40。
-        """
-        self._rate: float = float(rate)
-        self._capacity: int = int(capacity)
-        self._tokens: float = float(capacity)  # 启动时桶满
-        self._last_refill: float = time.monotonic()
-        self._lock: threading.Lock = threading.Lock()
-
-    # ---- 内部方法 ----
+    def __init__(self, rate: float, capacity: int):
+        self._rate = float(rate)
+        self._capacity = int(capacity)
+        self._tokens = float(capacity)
+        self._last_refill = time.monotonic()
+        self._lock = threading.Lock()

    def _refill(self) -> None:
-        """补充令牌（调用方需持有 _lock）。
-
-        根据距上次补充的时间差计算新增令牌数，不超过 capacity。
-        """
        now = time.monotonic()
        elapsed = now - self._last_refill
        if elapsed > 0 and self._rate > 0:
-            new_tokens = elapsed * self._rate
-            self._tokens = min(self._tokens + new_tokens, float(self._capacity))
+            self._tokens = min(self._tokens + elapsed * self._rate, float(self._capacity))
        self._last_refill = now

-    # ---- 公开方法 ----
-
    def consume(self, tokens: int = 1) -> bool:
-        """尝试立即消费令牌（非阻塞）。
-
-        Args:
-            tokens: 要消费的令牌数，默认 1。
-
-        Returns:
-            True 消费成功；False 令牌不足。
-        """
        if tokens <= 0:
            return True
-
        with self._lock:
            self._refill()
            if self._tokens >= tokens:
@@ -132,52 +88,15 @@ class TokenBucket:
                return True
            return False

-    def try_consume(self, tokens: int = 1, timeout: float = 2.0) -> bool:
-        """尝试在指定时间内消费令牌（阻塞）。
-
-        Args:
-            tokens: 要消费的令牌数，默认 1。
-            timeout: 最大等待秒数，默认 2.0。
-
-        Returns:
-            True 在超时前成功消费；False 超时。
-        """
-        if tokens <= 0:
-            return True
-
-        deadline = time.monotonic() + timeout
-        while True:
+    def update_rate(self, rpm_limit: int) -> None:
+        new_rate = rpm_limit / 60.0
        with self._lock:
            self._refill()
-                if self._tokens >= tokens:
-                    self._tokens -= tokens
-                    return True
-
-            # 释放锁后计算剩余等待时间
-            remaining = deadline - time.monotonic()
-            if remaining <= 0:
-                return False
-            # 等待到下一个令牌应该补充的时间点
-            sleep_time = min(remaining, max(0.05, 1.0 / self._rate) if self._rate > 0 else remaining)
-            time.sleep(sleep_time)
-
-    def wait_for_token(self, timeout: float | None = None) -> bool:
-        """等待并尝试消费 1 个令牌。
-
-        Args:
-            timeout: 最大等待秒数；None 表示无限等待（不推荐）。
-
-        Returns:
-            True 成功消费；False 超时。
-        """
-        return self.try_consume(tokens=1, timeout=timeout if timeout is not None else float("inf"))
+            self._rate = new_rate
+            self._capacity = max(rpm_limit, 1)
+            self._tokens = min(self._tokens, float(self._capacity))

    def get_status(self) -> dict[str, Any]:
-        """获取令牌桶当前状态。
-
-        Returns:
-            包含 tokens, capacity, rate_per_minute, utilization 的字典。
-        """
        with self._lock:
            self._refill()
            rate_per_minute = self._rate * 60.0
@@ -190,253 +109,3 @@ class TokenBucket:
                "rate_per_minute": round(rate_per_minute, 1),
                "utilization": round(utilization, 4),
            }
-
-    # ---- 属性 ----
-
-    @property
-    def rate(self) -> float:
-        """当前令牌补充速率（令牌/秒）。"""
-        return self._rate
-
-    @property
-    def capacity(self) -> int:
-        """桶容量。"""
-        return self._capacity
-
-    # ---- 动态速率调整（供 AdaptiveTokenBucket 使用） ----
-
-    def set_rate(self, rate: float) -> None:
-        """动态调整令牌补充速率（令牌/秒）。
-
-        Args:
-            rate: 新速率（令牌/秒）。
-        """
-        with self._lock:
-            self._refill()  # 先补充现有令牌再切换速率
-            self._rate = float(rate)
-
-
-# ---------------------------------------------------------------------------
-# 避退模式：AdaptiveTokenBucket (§ADR-009)
-# ---------------------------------------------------------------------------
-
-class RetreatState:
-    """避退状态机常量。"""
-    NORMAL: str = "normal"
-    RETREAT: str = "retreat"
-    RECOVER: str = "recover"
-
-
-class AdaptiveTokenBucket(TokenBucket):
-    """自适应避退令牌桶（ADR-009）。
-
-    监控上游 429 率（60s 滑动窗口），自动调整发射速率：
-
-    - 429 率 < 5%   → NORMAL，保持基准速率
-    - 429 率 5-10%  → RETREAT，速率 × 0.75
-    - 429 率 10-20% → RETREAT，再次降速
-    - 429 率 > 20%  → RETREAT，最低 5 RPM + 告警
-    - 连续 120s 429 率 < 2% → RECOVER，逐步 +2 RPM 恢复
-
-    线程安全，继承 TokenBucket 的所有公共接口。
-    """
-
-    # ADR-009 参数（可通过构造函数覆盖）
-    RETREAT_WINDOW_SECONDS: float = 60.0
-    RETREAT_429_THRESHOLD: float = 0.05
-    RETREAT_FACTOR: float = 0.75
-    RETREAT_MIN_RPM: float = 5.0
-    RECOVER_WINDOW_SECONDS: float = 120.0
-    RECOVER_429_THRESHOLD: float = 0.02
-    RECOVER_INCREMENT_RPM: float = 2.0
-
-    def __init__(
-        self,
-        rate: float = 40 / 60,
-        capacity: int = 40,
-        *,
-        retreat_window_seconds: float = 60.0,
-        retreat_429_threshold: float = 0.05,
-        retreat_factor: float = 0.75,
-        retreat_min_rpm: float = 5.0,
-        recover_window_seconds: float = 120.0,
-        recover_429_threshold: float = 0.02,
-        recover_increment_rpm: float = 2.0,
-    ) -> None:
-        """初始化自适应避退令牌桶。
-
-        Args:
-            rate: 基准令牌补充速率（令牌/秒）。默认 40/60 ≈ 0.667 token/s。
-            capacity: 桶最大容量。默认 40。
-            retreat_window_seconds: 429 率滑动窗口大小（秒）。
-            retreat_429_threshold: 触发避退的 429 率阈值。
-            retreat_factor: 每次避退速率乘数。
-            retreat_min_rpm: 避退最低 RPM。
-            recover_window_seconds: 恢复观察窗口大小（秒）。
-            recover_429_threshold: 触发恢复的 429 率阈值。
-            recover_increment_rpm: 每次恢复增加的 RPM。
-        """
-        super().__init__(rate=rate, capacity=capacity)
-
-        # 基准速率（不变）
-        self._base_rate: float = float(rate)
-
-        # 避退参数
-        self.RETREAT_WINDOW_SECONDS = retreat_window_seconds
-        self.RETREAT_429_THRESHOLD = retreat_429_threshold
-        self.RETREAT_FACTOR = retreat_factor
-        self.RETREAT_MIN_RPM = retreat_min_rpm
-        self.RECOVER_WINDOW_SECONDS = recover_window_seconds
-        self.RECOVER_429_THRESHOLD = recover_429_threshold
-        self.RECOVER_INCREMENT_RPM = recover_increment_rpm
-
-        # 避退状态机
-        self._retreat_state: str = RetreatState.NORMAL
-
-        # 429 滑动窗口：[(timestamp, is_429), ...]
-        self._429_window: list[tuple[float, bool]] = []
-
-        # 上次状态变更时间
-        self._last_state_change: float = time.monotonic()
-
-        # 避退状态锁（RLock 防止 evaluate_retreat() → get_429_rate() 重入死锁）
-        self._retreat_lock: threading.RLock = threading.RLock()
-
-    # ---- 429 反馈 ----
-
-    def record_response(self, is_429: bool) -> None:
-        """记录一次上游响应是否为 429。
-
-        Args:
-            is_429: True 表示上游返回了 429。
-        """
-        now = time.monotonic()
-        with self._retreat_lock:
-            self._429_window.append((now, is_429))
-            # 清理超出观察窗口的旧记录
-            cutoff = now - max(
-                self.RETREAT_WINDOW_SECONDS,
-                self.RECOVER_WINDOW_SECONDS,
-            )
-            self._429_window = [
-                (ts, flag) for ts, flag in self._429_window
-                if ts >= cutoff
-            ]
-
-    def get_429_rate(self, window_seconds: float | None = None) -> float:
-        """获取指定窗口内的 429 率。
-
-        Args:
-            window_seconds: 滑动窗口大小；None 使用 RETREAT_WINDOW_SECONDS。
-
-        Returns:
-            0.0-1.0 之间的 429 率。
-        """
-        ws = window_seconds or self.RETREAT_WINDOW_SECONDS
-        now = time.monotonic()
-        with self._retreat_lock:
-            in_window = [flag for ts, flag in self._429_window if now - ts <= ws]
-            if not in_window:
-                return 0.0
-            return sum(1 for f in in_window if f) / len(in_window)
-
-    # ---- 避退状态评估 ----
-
-    def evaluate_retreat(self) -> str:
-        """评估并更新避退状态，返回新状态名。
-
-        每次调用根据当前 429 率 + 持续时间决定是否进入 RETREAT / RECOVER。
-
-        Returns:
-            "normal" / "retreat" / "recover"。
-        """
-        now = time.monotonic()
-        with self._retreat_lock:
-            retreat_rate = self.get_429_rate(self.RETREAT_WINDOW_SECONDS)
-            recover_rate = self.get_429_rate(self.RECOVER_WINDOW_SECONDS)
-
-            if self._retreat_state == RetreatState.NORMAL:
-                if retreat_rate >= self.RETREAT_429_THRESHOLD:
-                    self._retreat_state = RetreatState.RETREAT
-                    self._last_state_change = now
-                    self._apply_retreat()
-
-            elif self._retreat_state == RetreatState.RETREAT:
-                # 持续高 429 率 → 再次降速
-                if retreat_rate >= self.RETREAT_429_THRESHOLD * 2:
-                    # 429 > 10%，再次降速
-                    if self._rate > self.RETREAT_MIN_RPM / 60.0:
-                        self._apply_retreat()
-                elif recover_rate < self.RECOVER_429_THRESHOLD:
-                    time_in_low = now - self._last_state_change
-                    if time_in_low >= self.RECOVER_WINDOW_SECONDS:
-                        self._retreat_state = RetreatState.RECOVER
-                        self._last_state_change = now
-                        self._apply_recover()
-
-            elif self._retreat_state == RetreatState.RECOVER:
-                if retreat_rate >= self.RETREAT_429_THRESHOLD:
-                    # 恢复期间 429 回升，重新进入避退
-                    self._retreat_state = RetreatState.RETREAT
-                    self._last_state_change = now
-                    self._apply_retreat()
-                elif self._rate >= self._base_rate:
-                    # 已恢复到基准速率
-                    self._rate = self._base_rate
-                    self._retreat_state = RetreatState.NORMAL
-                    self._last_state_change = now
-                else:
-                    # 继续逐步恢复
-                    self._apply_recover()
-
-            return self._retreat_state
-
-    def _apply_retreat(self) -> None:
-        """执行一次避退降速。"""
-        new_rate: float = max(
-            self.RETREAT_MIN_RPM / 60.0,
-            self._rate * self.RETREAT_FACTOR,
-        )
-        self._rate = new_rate
-
-    def _apply_recover(self) -> None:
-        """执行一次恢复提速。"""
-        increment: float = self.RECOVER_INCREMENT_RPM / 60.0
-        new_rate: float = min(self._base_rate, self._rate + increment)
-        self._rate = new_rate
-
-    # ---- 状态查询 ----
-
-    def get_retreat_state(self) -> str:
-        """获取当前避退状态。
-
-        Returns:
-            "normal" / "retreat" / "recover"。
-        """
-        with self._retreat_lock:
-            return self._retreat_state
-
-    def get_effective_rate_rpm(self) -> float:
-        """获取当前实际速率（RPM），考虑避退乘数。
-
-        Returns:
-            当前每分钟速率。
-        """
-        with self._lock:
-            return self._rate * 60.0
-
-    def get_base_rate_rpm(self) -> float:
-        """获取基准速率（RPM），即未避退时的速率。
-
-        Returns:
-            基准每分钟速率。
-        """
-        return self._base_rate * 60.0
-
-    def reset_to_base(self) -> None:
-        """手动重置到基准速率（用于运维干预）。"""
-        with self._retreat_lock:
-            self._rate = self._base_rate
-            self._retreat_state = RetreatState.NORMAL
-            self._last_state_change = time.monotonic()
-            self._429_window.clear()
@@ -0,0 +1,6 @@
+# Sidecar V2 — Multi-Pool Provider Proxy
+fastapi>=0.115.0,<1.0.0
+uvicorn[standard]>=0.30.0,<1.0.0
+httpx>=0.27.0,<1.0.0
+structlog>=24.0.0,<25.0.0
+cryptography>=42.0.0,<44.0.0
@@ -0,0 +1,62 @@
+"""Model → Backend routing logic for Sidecar V2."""
+
+import structlog
+from typing import Optional
+
+from storage.models import Backend
+from pool_manager import PoolManager
+from rate_limiter import PerBackendRateLimiter
+
+logger = structlog.get_logger("sidecar_v2.router")
+
+
+class Router:
+    """Routes model requests to the best available backend.
+
+    Pick strategy:
+    1. Primary pool → healthy backends supporting the model
+    2. Rate-limiter check → skip if RPM exhausted
+    3. Fallback pool → repeat above
+    4. If all exhausted → return None (caller handles emergency)
+    """
+
+    def __init__(self, pool_manager: PoolManager, rate_limiter: PerBackendRateLimiter):
+        self._pool_manager = pool_manager
+        self._rate_limiter = rate_limiter
+
+    def pick_backend(self, canonical_model: str) -> Optional[Backend]:
+        """Pick the best available backend for a model.
+
+        Tries primary pool first, then fallback.
+        Within each pool, skips backends at RPM limit.
+        Returns None if no backend available.
+        """
+        # Try pools in order
+        for pool in ["primary", "fallback"]:
+            backends = self._pool_manager.get_available_backends(
+                canonical_model, pool=pool
+            )
+            for backend in backends:
+                # Rate-limit check
+                if self._rate_limiter.consume(
+                    backend.id, backend.rpm_limit
+                ):
+                    return backend
+                # Skip this backend, try next
+                logger.debug(
+                    "backend_rate_limited",
+                    backend_id=backend.id,
+                    pool=pool,
+                    model=canonical_model,
+                )
+
+            if not backends:
+                logger.debug("pool_exhausted", pool=pool, model=canonical_model)
+            else:
+                logger.debug("pool_rpm_exhausted", pool=pool, model=canonical_model)
+
+        return None
+
+    def get_all_pools_exhausted_info(self, canonical_model: str) -> bool:
+        """Check if ALL pools are exhausted for a model."""
+        return not self._pool_manager.is_any_pool_available(canonical_model)
@@ -1,327 +0,0 @@
-<!DOCTYPE html>
-<html lang="zh-CN">
-<head>
-  <meta charset="UTF-8">
-  <meta name="viewport" content="width=device-width, initial-scale=1.0">
-  <title>NVIDIA Sidecar — 实时仪表盘</title>
-  <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.7/dist/chart.umd.min.js"></script>
-  <style>
-    * { margin: 0; padding: 0; box-sizing: border-box; }
-    body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; background: #0f172a; color: #e2e8f0; padding: 24px; }
-    h1 { font-size: 22px; font-weight: 600; margin-bottom: 4px; color: #f8fafc; }
-    .subtitle { color: #94a3b8; font-size: 13px; margin-bottom: 24px; }
-    .grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(380px, 1fr)); gap: 20px; margin-bottom: 24px; }
-    .card { background: #1e293b; border-radius: 12px; padding: 20px; border: 1px solid #334155; }
-    .card h2 { font-size: 15px; font-weight: 600; color: #94a3b8; margin-bottom: 14px; text-transform: uppercase; letter-spacing: 0.05em; }
-    .card canvas { max-height: 220px; }
-    .stat-row { display: flex; gap: 16px; flex-wrap: wrap; }
-    .stat { flex: 1; min-width: 100px; background: #0f172a; border-radius: 8px; padding: 12px; text-align: center; border: 1px solid #334155; }
-    .stat .value { font-size: 28px; font-weight: 700; color: #38bdf8; }
-    .stat .label { font-size: 11px; color: #64748b; margin-top: 4px; text-transform: uppercase; }
-    .stat.warn .value { color: #f59e0b; }
-    .stat.danger .value { color: #ef4444; }
-    .retreat-badge { display: inline-block; padding: 2px 10px; border-radius: 999px; font-size: 12px; font-weight: 600; }
-    .retreat-badge.normal { background: #065f46; color: #6ee7b7; }
-    .retreat-badge.retreat { background: #78350f; color: #fbbf24; }
-    .retreat-badge.recover { background: #1e3a5f; color: #60a5fa; }
-    .config-panel { background: #1e293b; border-radius: 12px; padding: 20px; border: 1px solid #334155; }
-    .config-panel h2 { font-size: 15px; font-weight: 600; color: #94a3b8; margin-bottom: 14px; text-transform: uppercase; letter-spacing: 0.05em; }
-    .config-row { display: flex; align-items: center; gap: 12px; margin-bottom: 12px; flex-wrap: wrap; }
-    .config-row label { min-width: 100px; font-size: 13px; color: #cbd5e1; }
-    .config-row input, .config-row select { background: #0f172a; border: 1px solid #334155; border-radius: 6px; color: #e2e8f0; padding: 6px 10px; font-size: 13px; }
-    .config-row input[type="range"] { width: 140px; }
-    .config-row button { background: #38bdf8; color: #0f172a; border: none; border-radius: 6px; padding: 6px 16px; font-size: 13px; font-weight: 600; cursor: pointer; }
-    .config-row button:hover { background: #7dd3fc; }
-    .config-row button:disabled { background: #475569; cursor: not-allowed; }
-    .toast { position: fixed; top: 16px; right: 16px; padding: 10px 20px; border-radius: 8px; font-size: 13px; z-index: 999; animation: fadeInOut 3s; }
-    .toast.success { background: #065f46; color: #6ee7b7; }
-    .toast.error { background: #7f1d1d; color: #fca5a5; }
-    @keyframes fadeInOut { 0% { opacity: 0; transform: translateY(-8px); } 10% { opacity: 1; transform: translateY(0); } 80% { opacity: 1; } 100% { opacity: 0; } }
-    .disconnected { background: #7f1d1d; color: #fca5a5; padding: 4px 10px; border-radius: 4px; font-size: 12px; display: inline-block; margin-left: 8px; }
-    .connected { background: #065f46; color: #6ee7b7; padding: 4px 10px; border-radius: 4px; font-size: 12px; display: inline-block; margin-left: 8px; }
-
-    /* BIZ-46 Phase3: 队列柱状图 300ms 平滑动画 */
-    .queue-bar { transition: height 0.3s ease; }
-
-    /* BIZ-46 Phase3: SSE 断连 5s 半透明遮罩 */
-    #reconnect-mask {
-      display: none;
-      position: fixed;
-      top: 0; left: 0; right: 0; bottom: 0;
-      background: rgba(15, 23, 42, 0.85);
-      z-index: 1000;
-      justify-content: center;
-      align-items: center;
-      flex-direction: column;
-    }
-    #reconnect-mask.visible { display: flex; }
-    #reconnect-mask .mask-icon { font-size: 48px; margin-bottom: 16px; }
-    #reconnect-mask .mask-text { color: #94a3b8; font-size: 16px; font-weight: 500; }
-    #reconnect-mask .mask-sub { color: #64748b; font-size: 13px; margin-top: 8px; }
-  </style>
-</head>
-<body>
-  <!-- BIZ-46 Phase3: SSE 断连遮罩 -->
-  <div id="reconnect-mask">
-    <div class="mask-icon">⚠️</div>
-    <div class="mask-text">数据暂不可用</div>
-    <div class="mask-sub">SSE 连接中断，正在重连…</div>
-  </div>
-
-  <h1>🚀 NVIDIA Sidecar 实时仪表盘
-    <span id="conn-status" class="connected">已连接</span>
-  </h1>
-  <p class="subtitle">令牌桶限流 · 优先级队列 · 避退模式 · 实时监控</p>
-
-  <!-- 状态卡片 -->
-  <div class="stat-row" style="margin-bottom: 24px;">
-    <div class="stat"><div class="value" id="val-total">0</div><div class="label">总请求</div></div>
-    <div class="stat"><div class="value" id="val-nvidia">0</div><div class="label">NVIDIA 请求</div></div>
-    <div class="stat"><div class="value" id="val-rate">0</div><div class="label">当前 RPM</div></div>
-    <div class="stat"><div class="value" id="val-429">0%</div><div class="label">上游 429 率</div></div>
-    <div class="stat"><div class="value" id="val-retreat">正常</div><div class="label">避退状态</div></div>
-    <div class="stat"><div class="value" id="val-uptime">0s</div><div class="label">运行时间</div></div>
-  </div>
-
-  <!-- 图表 -->
-  <div class="grid">
-    <div class="card">
-      <h2>📊 令牌桶使用率</h2>
-      <canvas id="chart-tokens"></canvas>
-    </div>
-    <div class="card">
-      <!-- BIZ-46 Phase3: 队列图标题显示总排队数 -->
-      <h2>📈 队列深度 <span id="queue-total" style="font-size:13px;color:#38bdf8;">(共 0)</span></h2>
-      <canvas id="chart-queue"></canvas>
-    </div>
-    <div class="card">
-      <h2>📉 请求吞吐量 (最近 20 点)</h2>
-      <canvas id="chart-throughput"></canvas>
-    </div>
-    <div class="card">
-      <h2>⚙️ 速率历史</h2>
-      <canvas id="chart-rate"></canvas>
-    </div>
-  </div>
-
-  <!-- 配置面板 -->
-  <div class="config-panel">
-    <h2>🔧 实时配置</h2>
-    <div class="config-row">
-      <label>速率 (RPM)</label>
-      <input type="range" id="cfg-rate-rpm" min="1" max="100" value="40" oninput="document.getElementById('cfg-rate-val').textContent=this.value">
-      <span id="cfg-rate-val" style="min-width:30px;">40</span>
-    </div>
-    <div class="config-row">
-      <label>队列上限</label>
-      <input type="number" id="cfg-queue-max" value="500" min="1" max="2000" style="width:80px;">
-    </div>
-    <div class="config-row">
-      <button onclick="applyConfig()">应用配置</button>
-    </div>
-  </div>
-
-<script>
-// SSE 连接
-let evtSource = null;
-let dataHistory = { throughput: [], rates: [] };
-const MAX_HISTORY = 20;
-let lastSSETime = Date.now();
-
-// BIZ-46 Phase3: SSE 断连 5s 遮罩
-function checkReconnect() {
-  const mask = document.getElementById('reconnect-mask');
-  if (Date.now() - lastSSETime > 5000) {
-    mask.classList.add('visible');
-  }
-}
-setInterval(checkReconnect, 1000);
-
-function connectSSE() {
-  if (evtSource) evtSource.close();
-  evtSource = new EventSource('/api/dashboard/stream');
-  evtSource.onmessage = (e) => {
-    try {
-      const snap = JSON.parse(e.data);
-      lastSSETime = Date.now();
-      // 隐藏断连遮罩
-      document.getElementById('reconnect-mask').classList.remove('visible');
-      updateDashboard(snap);
-      document.getElementById('conn-status').className = 'connected';
-      document.getElementById('conn-status').textContent = '已连接';
-    } catch (err) {
-      document.getElementById('conn-status').className = 'disconnected';
-      document.getElementById('conn-status').textContent = '解析错误';
-    }
-  };
-  evtSource.onerror = () => {
-    document.getElementById('conn-status').className = 'disconnected';
-    document.getElementById('conn-status').textContent = '断开 - 重连中';
-  };
-}
-
-// 初始化 Chart.js
-const ctxTokens = document.getElementById('chart-tokens').getContext('2d');
-const chartTokens = new Chart(ctxTokens, {
-  type: 'doughnut',
-  data: {
-    labels: ['已用令牌', '可用令牌'],
-    datasets: [{ data: [0, 40], backgroundColor: ['#ef4444', '#22c55e'], borderWidth: 0 }]
-  },
-  options: { responsive: true, maintainAspectRatio: true, cutout: '65%', plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } },
-    // BIZ-46 Phase3: 300ms 平滑动画
-    animation: { duration: 300 } }
-});
-
-const ctxQueue = document.getElementById('chart-queue').getContext('2d');
-const chartQueue = new Chart(ctxQueue, {
-  type: 'bar',
-  data: {
-    labels: ['URGENT', 'HIGH', 'NORMAL', 'LOW'],
-    datasets: [{ label: '排队数', data: [0, 0, 0, 0], backgroundColor: ['#ef4444', '#f59e0b', '#38bdf8', '#a78bfa'] }]
-  },
-  options: { responsive: true, maintainAspectRatio: true,
-    scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } },
-    plugins: { legend: { display: false } },
-    // BIZ-46 Phase3: 300ms 平滑动画
-    animation: { duration: 300 } }
-});
-
-const ctxThroughput = document.getElementById('chart-throughput').getContext('2d');
-const chartThroughput = new Chart(ctxThroughput, {
-  type: 'line',
-  data: { labels: [], datasets: [
-    { label: '成功', data: [], borderColor: '#22c55e', backgroundColor: '#22c55e20', fill: false, tension: 0.3, pointRadius: 2 },
-    { label: '429', data: [], borderColor: '#f59e0b', backgroundColor: '#f59e0b20', fill: false, tension: 0.3, pointRadius: 2 },
-    { label: '直通', data: [], borderColor: '#a78bfa', backgroundColor: '#a78bfa20', fill: false, tension: 0.3, pointRadius: 2 },
-  ]},
-  options: { responsive: true, maintainAspectRatio: true,
-    scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } },
-    plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } },
-    animation: { duration: 300 } }
-});
-
-const ctxRate = document.getElementById('chart-rate').getContext('2d');
-const chartRate = new Chart(ctxRate, {
-  type: 'line',
-  data: { labels: [], datasets: [
-    { label: '有效 RPM', data: [], borderColor: '#38bdf8', fill: false, tension: 0.3, pointRadius: 2 },
-    { label: '基准 RPM', data: [], borderColor: '#64748b', fill: false, tension: 0.3, pointRadius: 2, borderDash: [4, 4] },
-  ]},
-  options: { responsive: true, maintainAspectRatio: true,
-    scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } },
-    plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } },
-    animation: { duration: 300 } }
-});
-
-function updateDashboard(snap) {
-  const r = snap.requests || {};
-  const tb = snap.token_bucket || {};
-  const rt = snap.retreat || {};
-
-  document.getElementById('val-total').textContent = (r.total || 0).toLocaleString();
-  document.getElementById('val-nvidia').textContent = (r.nvidia || 0).toLocaleString();
-  document.getElementById('val-rate').textContent = Math.round(rt.effective_rpm || 40);
-  document.getElementById('val-429').textContent = ((rt.upstream_429_rate || 0) * 100).toFixed(1) + '%';
-  document.getElementById('val-uptime').textContent = fmtDuration(snap.uptime_seconds || 0);
-
-  const retreatEl = document.getElementById('val-retreat');
-  const state = rt.state || 'normal';
-  retreatEl.textContent = state === 'retreat' ? '⚠️ 避退' : state === 'recover' ? '↗ 恢复中' : '✅ 正常';
-  retreatEl.style.color = state === 'retreat' ? '#f59e0b' : state === 'recover' ? '#60a5fa' : '#22c55e';
-
-  chartTokens.data.datasets[0].data = [
-    Math.round((tb.capacity || 40) - (tb.tokens || 40)),
-    Math.round(tb.tokens || 0)
-  ];
-  chartTokens.update();
-
-  const qs = snap.queue || {};
-  const perPriority = qs.per_priority || {};
-  const totalQueued = perPriority.URGENT + perPriority.HIGH + perPriority.NORMAL + perPriority.LOW || qs.current_size || 0;
-  chartQueue.data.datasets[0].data = [
-    perPriority.URGENT || 0,
-    perPriority.HIGH || 0,
-    perPriority.NORMAL || 0,
-    perPriority.LOW || 0
-  ];
-  chartQueue.update();
-
-  // BIZ-46 Phase3: 队列图标题显示总排队数
-  document.getElementById('queue-total').textContent = '(共 ' + totalQueued + ')';
-
-  const now = new Date().toLocaleTimeString();
-  const prev = dataHistory.throughput.length > 0 ? dataHistory.throughput[dataHistory.throughput.length - 1].nvidia : 0;
-  const throughput = Math.max(0, (r.nvidia || 0) - prev);
-
-  dataHistory.throughput.push({ time: now, nvidia: throughput, ratelimited: r.ratelimited || 0, passthrough: r.passthrough || 0 });
-  dataHistory.rates.push({ time: now, effective: rt.effective_rpm || 40, base: rt.base_rpm || 40 });
-  if (dataHistory.throughput.length > MAX_HISTORY) dataHistory.throughput.shift();
-  if (dataHistory.rates.length > MAX_HISTORY) dataHistory.rates.shift();
-
-  chartThroughput.data.labels = dataHistory.throughput.map(d => d.time);
-  chartThroughput.data.datasets[0].data = dataHistory.throughput.map(d => d.nvidia);
-  chartThroughput.data.datasets[1].data = dataHistory.throughput.map(d => d.ratelimited);
-  chartThroughput.data.datasets[2].data = dataHistory.throughput.map(d => d.passthrough);
-  chartThroughput.update();
-
-  chartRate.data.labels = dataHistory.rates.map(d => d.time);
-  chartRate.data.datasets[0].data = dataHistory.rates.map(d => d.effective);
-  chartRate.data.datasets[1].data = dataHistory.rates.map(d => d.base);
-  chartRate.update();
-}
-
-function fmtDuration(s) {
-  if (s < 60) return s + 's';
-  if (s < 3600) return Math.floor(s/60) + 'm ' + (s%60) + 's';
-  return Math.floor(s/3600) + 'h ' + Math.floor((s%3600)/60) + 'm';
-}
-
-async function applyConfig() {
-  const btn = document.querySelector('.config-row button');
-  btn.disabled = true;
-  try {
-    const resp = await fetch('/api/admin/config', {
-      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
-      body: JSON.stringify({
-        rate_rpm: parseInt(document.getElementById('cfg-rate-rpm').value),
-        queue_max_size: parseInt(document.getElementById('cfg-queue-max').value),
-      })
-    });
-    const result = await resp.json();
-    showToast(resp.ok ? 'success' : 'error', resp.ok ? '配置已更新' : (result.detail || '配置更新失败'));
-  } catch (err) {
-    showToast('error', '请求失败: ' + err.message);
-  }
-  btn.disabled = false;
-}
-
-function showToast(type, msg) {
-  const t = document.createElement('div');
-  t.className = 'toast ' + type;
-  t.textContent = msg;
-  document.body.appendChild(t);
-  setTimeout(() => t.remove(), 3000);
-}
-
-// BIZ-46 Phase3: 页面加载时同步当前配置值
-async function loadConfig() {
-  try {
-    const resp = await fetch('/api/admin/config');
-    if (resp.ok) {
-      const config = await resp.json();
-      document.getElementById('cfg-rate-rpm').value = config.rate_rpm || 40;
-      document.getElementById('cfg-rate-val').textContent = config.rate_rpm || 40;
-      document.getElementById('cfg-queue-max').value = config.queue_max_size || 500;
-    }
-  } catch (e) {
-    console.warn('配置加载失败（可能需要 Admin Token）', e);
-  }
-}
-
-loadConfig();
-connectSSE();
-</script>
-</body>
-</html>
@@ -0,0 +1 @@
+# Sidecar V2 storage module
@@ -0,0 +1,252 @@
+"""CRUD operations for Backend (provider) management."""
+
+import json
+import time
+from typing import Optional
+
+from storage.db import get_connection, generate_id
+from storage.models import Backend, ModelMapping
+from crypto import encrypt, decrypt
+
+
+def create_backend(backend: Backend) -> Backend:
+    """Create a new backend. Encrypts API key before storage."""
+    if not backend.id:
+        backend.id = generate_id("bkd")
+
+    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    backend.created_at = now
+    backend.updated_at = now
+
+    api_key_encrypted = encrypt(backend.api_key_plain)
+
+    with get_connection() as conn:
+        conn.execute(
+            """INSERT INTO backends (id, name, label, api_base_url, api_key_encrypted,
+               api, timeout_seconds, rpm_limit, pool, enabled, status, model_mappings_json,
+               source, cooldown_until, consecutive_429_count, metadata_json, created_at, updated_at)
+               VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+            (
+                backend.id, backend.name, backend.label, backend.api_base_url,
+                api_key_encrypted, backend.api, backend.timeout_seconds,
+                backend.rpm_limit, backend.pool, 1 if backend.enabled else 0,
+                backend.status, json.dumps(_mappings_to_dict(backend.model_mappings)),
+                backend.source, backend.cooldown_until,
+                backend.consecutive_429_count,
+                json.dumps(backend.metadata), backend.created_at, backend.updated_at,
+            ),
+        )
+        conn.commit()
+
+    return backend
+
+
+def get_backend(backend_id: str, decrypt_key: bool = True) -> Optional[Backend]:
+    """Get a single backend by ID."""
+    with get_connection() as conn:
+        row = conn.execute(
+            "SELECT * FROM backends WHERE id = ?", (backend_id,)
+        ).fetchone()
+
+    if row is None:
+        return None
+
+    return _row_to_backend(row, decrypt_key=decrypt_key)
+
+
+def list_backends(
+    pool: Optional[str] = None,
+    enabled_only: bool = False,
+    decrypt_key: bool = False,
+) -> list[Backend]:
+    """List backends, optionally filtered by pool."""
+    with get_connection() as conn:
+        if pool:
+            rows = conn.execute(
+                "SELECT * FROM backends WHERE pool = ? ORDER BY created_at",
+                (pool,),
+            ).fetchall()
+        else:
+            rows = conn.execute(
+                "SELECT * FROM backends ORDER BY pool, created_at"
+            ).fetchall()
+
+    backends = [_row_to_backend(r, decrypt_key=decrypt_key) for r in rows]
+    if enabled_only:
+        backends = [b for b in backends if b.enabled]
+    return backends
+
+
+def update_backend(backend_id: str, updates: dict) -> Optional[Backend]:
+    """Update backend fields. If api_key_plain is provided, re-encrypt."""
+    current = get_backend(backend_id, decrypt_key=True)
+    if current is None:
+        return None
+
+    # Apply updates
+    allowed = {
+        "name", "label", "api_base_url", "api", "timeout_seconds",
+        "rpm_limit", "pool", "enabled", "status", "source",
+        "cooldown_until", "consecutive_429_count", "metadata",
+    }
+    for key, value in updates.items():
+        if key in allowed:
+            setattr(current, key, value)
+
+    current.updated_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+
+    # Handle API key update
+    api_key_encrypted = None
+    if "api_key_plain" in updates and updates["api_key_plain"]:
+        current.api_key_plain = updates["api_key_plain"]
+        api_key_encrypted = encrypt(updates["api_key_plain"])
+
+    # Handle model_mappings update
+    mappings_json = None
+    if "model_mappings" in updates:
+        current.model_mappings = updates["model_mappings"]
+        mappings_json = json.dumps(_mappings_to_dict(current.model_mappings))
+
+    with get_connection() as conn:
+        # Build dynamic UPDATE
+        set_clauses = [
+            "name = ?", "label = ?", "api_base_url = ?", "api = ?",
+            "timeout_seconds = ?", "rpm_limit = ?", "pool = ?", "enabled = ?",
+            "status = ?", "source = ?", "cooldown_until = ?",
+            "consecutive_429_count = ?", "metadata_json = ?", "updated_at = ?",
+        ]
+        params = [
+            current.name, current.label, current.api_base_url, current.api,
+            current.timeout_seconds, current.rpm_limit, current.pool,
+            1 if current.enabled else 0, current.status, current.source,
+            current.cooldown_until, current.consecutive_429_count,
+            json.dumps(current.metadata), current.updated_at,
+        ]
+        if api_key_encrypted:
+            set_clauses.append("api_key_encrypted = ?")
+            params.append(api_key_encrypted)
+        if mappings_json is not None:
+            set_clauses.append("model_mappings_json = ?")
+            params.append(mappings_json)
+        params.append(backend_id)
+
+        conn.execute(
+            f"UPDATE backends SET {', '.join(set_clauses)} WHERE id = ?",
+            params,
+        )
+        conn.commit()
+
+    return get_backend(backend_id, decrypt_key=False)
+
+
+def delete_backend(backend_id: str) -> bool:
+    """Delete a backend. Returns True if deleted."""
+    with get_connection() as conn:
+        cursor = conn.execute("DELETE FROM backends WHERE id = ?", (backend_id,))
+        conn.commit()
+        return cursor.rowcount > 0
+
+
+def set_backend_status(backend_id: str, status: str) -> bool:
+    """Quickly set backend status (healthy/cooling/error/disabled)."""
+    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    with get_connection() as conn:
+        cursor = conn.execute(
+            "UPDATE backends SET status = ?, updated_at = ? WHERE id = ?",
+            (status, now, backend_id),
+        )
+        conn.commit()
+        return cursor.rowcount > 0
+
+
+def set_backend_cooldown(backend_id: str, cooldown_until: str, count: int) -> bool:
+    """Set cooldown state on a backend."""
+    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    with get_connection() as conn:
+        cursor = conn.execute(
+            """UPDATE backends SET status = 'cooling', cooldown_until = ?,
+               consecutive_429_count = ?, updated_at = ? WHERE id = ?""",
+            (cooldown_until, count, now, backend_id),
+        )
+        conn.commit()
+        return cursor.rowcount > 0
+
+
+def clear_backend_cooldown(backend_id: str) -> bool:
+    """Clear cooldown (back to healthy)."""
+    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    with get_connection() as conn:
+        cursor = conn.execute(
+            """UPDATE backends SET status = 'healthy', cooldown_until = NULL,
+               consecutive_429_count = 0, updated_at = ? WHERE id = ?""",
+            (now, backend_id),
+        )
+        conn.commit()
+        return cursor.rowcount > 0
+
+
+def get_pool_stats() -> dict:
+    """Get summary stats per pool."""
+    with get_connection() as conn:
+        rows = conn.execute(
+            """SELECT pool, COUNT(*) as total,
+               SUM(CASE WHEN enabled = 1 THEN 1 ELSE 0 END) as enabled,
+               SUM(CASE WHEN status = 'healthy' THEN 1 ELSE 0 END) as healthy,
+               SUM(CASE WHEN status = 'cooling' THEN 1 ELSE 0 END) as cooling,
+               SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as error
+               FROM backends GROUP BY pool"""
+        ).fetchall()
+    stats = {}
+    for row in rows:
+        stats[row["pool"]] = {
+            "total": row["total"],
+            "enabled": row["enabled"],
+            "healthy": row["healthy"],
+            "cooling": row["cooling"],
+            "error": row["error"],
+        }
+    return stats
+
+
+def _row_to_backend(row, decrypt_key: bool = True) -> Backend:
+    """Convert a DB row to a Backend instance."""
+    mappings_raw = row["model_mappings_json"] or "{}"
+    mappings_dict = json.loads(mappings_raw)
+
+    model_mappings = {}
+    for canonical_name, mm in mappings_dict.items():
+        model_mappings[canonical_name] = ModelMapping.from_dict(mm)
+
+    backend = Backend(
+        id=row["id"],
+        name=row["name"],
+        label=row["label"],
+        api_base_url=row["api_base_url"],
+        api_key_encrypted=row["api_key_encrypted"] or "",
+        api=row["api"],
+        timeout_seconds=row["timeout_seconds"],
+        rpm_limit=row["rpm_limit"],
+        pool=row["pool"],
+        enabled=bool(row["enabled"]),
+        status=row["status"],
+        model_mappings=model_mappings,
+        source=row["source"],
+        cooldown_until=row["cooldown_until"],
+        consecutive_429_count=row["consecutive_429_count"],
+        metadata=json.loads(row["metadata_json"] or "{}"),
+        created_at=row["created_at"],
+        updated_at=row["updated_at"],
+    )
+
+    if decrypt_key and backend.api_key_encrypted:
+        from crypto import try_decrypt_existing
+        plain = try_decrypt_existing(backend.api_key_encrypted)
+        if plain:
+            backend.api_key_plain = plain
+
+    return backend
+
+
+def _mappings_to_dict(mappings: dict[str, ModelMapping]) -> dict:
+    """Convert ModelMapping dict to JSON-safe dict."""
+    return {k: v.to_dict() for k, v in mappings.items()}
@@ -0,0 +1,55 @@
+"""System configuration KV store operations."""
+
+import time
+from typing import Optional, Any
+
+from storage.db import get_connection
+
+
+def get_config(key: str) -> Optional[str]:
+    """Get a single config value."""
+    with get_connection() as conn:
+        row = conn.execute(
+            "SELECT value FROM system_config WHERE key = ?", (key,)
+        ).fetchone()
+    return row["value"] if row else None
+
+
+def set_config(key: str, value: str, description: str = "") -> None:
+    """Set or update a config value."""
+    now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    with get_connection() as conn:
+        conn.execute(
+            """INSERT INTO system_config (key, value, description, updated_at)
+               VALUES (?, ?, ?, ?)
+               ON CONFLICT(key) DO UPDATE SET
+               value = excluded.value,
+               description = excluded.description,
+               updated_at = excluded.updated_at""",
+            (key, value, description, now),
+        )
+        conn.commit()
+
+
+def delete_config(key: str) -> bool:
+    """Delete a config value."""
+    with get_connection() as conn:
+        cursor = conn.execute(
+            "DELETE FROM system_config WHERE key = ?", (key,)
+        )
+        conn.commit()
+        return cursor.rowcount > 0
+
+
+def list_configs() -> list[dict]:
+    """List all system config entries."""
+    with get_connection() as conn:
+        rows = conn.execute("SELECT * FROM system_config ORDER BY key").fetchall()
+    return [dict(row) for row in rows]
+
+
+def get_all_configs_as_dict() -> dict[str, str]:
+    """Get all configs as a simple dict."""
+    with get_connection() as conn:
+        rows = conn.execute("SELECT key, value FROM system_config").fetchall()
+    return {row["key"]: row["value"] for row in rows}
@@ -0,0 +1,74 @@
+"""Cooldown event logging."""
+
+import time
+from typing import Optional
+
+from storage.db import get_connection, generate_id
+from storage.models import CooldownEvent
+
+
+def log_cooldown_event(
+    backend_id: str,
+    consecutive_count: int,
+    cooldown_seconds: int,
+    response_summary: str = "",
+) -> CooldownEvent:
+    """Record a cooldown event."""
+    event = CooldownEvent(
+        id=generate_id("cev"),
+        backend_id=backend_id,
+        consecutive_count=consecutive_count,
+        cooldown_seconds=cooldown_seconds,
+        response_summary=response_summary,
+        started_at=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
+    )
+
+    with get_connection() as conn:
+        conn.execute(
+            """INSERT INTO cooldown_events
+               (id, backend_id, consecutive_count, cooldown_seconds,
+                response_summary, started_at)
+               VALUES (?, ?, ?, ?, ?, ?)""",
+            (event.id, event.backend_id, event.consecutive_count,
+             event.cooldown_seconds, event.response_summary, event.started_at),
+        )
+        conn.commit()
+
+    return event
+
+
+def end_cooldown_event(backend_id: str) -> bool:
+    """Mark the latest open cooldown event as ended."""
+    ended_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
+    with get_connection() as conn:
+        # Find the latest event for this backend that hasn't ended
+        cursor = conn.execute(
+            """UPDATE cooldown_events SET ended_at = ?
+               WHERE backend_id = ? AND ended_at IS NULL
+               ORDER BY started_at DESC LIMIT 1""",
+            (ended_at, backend_id),
+        )
+        conn.commit()
+        return cursor.rowcount > 0
+
+
+def get_cooldown_history(
+    backend_id: Optional[str] = None,
+    limit: int = 50,
+) -> list[dict]:
+    """Get cooldown event history."""
+    with get_connection() as conn:
+        if backend_id:
+            rows = conn.execute(
+                """SELECT * FROM cooldown_events
+                   WHERE backend_id = ?
+                   ORDER BY started_at DESC LIMIT ?""",
+                (backend_id, limit),
+            ).fetchall()
+        else:
+            rows = conn.execute(
+                """SELECT * FROM cooldown_events
+                   ORDER BY started_at DESC LIMIT ?""",
+                (limit,),
+            ).fetchall()
+    return [dict(row) for row in rows]
@@ -0,0 +1,193 @@
+"""SQLite database connection management with WAL mode."""
+
+import os
+import sqlite3
+import uuid
+import structlog
+from contextlib import contextmanager
+from typing import Generator
+
+from config import config
+
+logger = structlog.get_logger()
+
+# Module-level DB path
+_DB_PATH: str = ""
+
+
+def init_db(db_path: str = "") -> None:
+    """Initialize the database connection and ensure WAL mode.
+
+    Creates the data directory if needed and verifies integrity.
+    """
+    global _DB_PATH
+    _DB_PATH = db_path or config.db_path
+
+    # Ensure data directory exists
+    os.makedirs(os.path.dirname(_DB_PATH), exist_ok=True)
+
+    # Test connection and enable WAL
+    conn = _get_raw_connection()
+    try:
+        conn.execute("PRAGMA journal_mode=WAL")
+        conn.execute("PRAGMA wal_autocheckpoint=1000")
+        conn.execute("PRAGMA foreign_keys=ON")
+        conn.execute("PRAGMA busy_timeout=5000")
+        logger.info("db_initialized", path=_DB_PATH, mode="WAL")
+    finally:
+        conn.close()
+
+
+def _get_raw_connection() -> sqlite3.Connection:
+    """Get a raw sqlite3 connection."""
+    conn = sqlite3.connect(_DB_PATH, check_same_thread=False)
+    conn.row_factory = sqlite3.Row
+    conn.execute("PRAGMA journal_mode=WAL")
+    conn.execute("PRAGMA foreign_keys=ON")
+    return conn
+
+
+@contextmanager
+def get_connection() -> Generator[sqlite3.Connection, None, None]:
+    """Get a database connection with WAL enabled."""
+    conn = _get_raw_connection()
+    try:
+        yield conn
+    finally:
+        conn.close()
+
+
+def generate_id(prefix: str = "") -> str:
+    """Generate a unique ID with optional prefix."""
+    uid = uuid.uuid4().hex[:12]
+    return f"{prefix}_{uid}" if prefix else uid
+
+
+def create_tables() -> None:
+    """Create all tables if they don't exist."""
+    with get_connection() as conn:
+        conn.executescript(_DDL)
+        conn.commit()
+        logger.info("tables_created")
+
+
+def run_integrity_check() -> bool:
+    """Run PRAGMA integrity_check and return True if OK."""
+    with get_connection() as conn:
+        result = conn.execute("PRAGMA integrity_check").fetchone()
+        ok = result[0] == "ok"
+        if not ok:
+            logger.error("integrity_check_failed", result=result[0])
+        return ok
+
+
+def get_db_sizes() -> dict:
+    """Get database and WAL file sizes."""
+    result = {"db_bytes": 0, "wal_bytes": 0}
+    db_path = _DB_PATH
+    if os.path.exists(db_path):
+        result["db_bytes"] = os.path.getsize(db_path)
+    wal_path = db_path + "-wal"
+    if os.path.exists(wal_path):
+        result["wal_bytes"] = os.path.getsize(wal_path)
+    return result
+
+
+def wal_checkpoint(mode: str = "TRUNCATE") -> None:
+    """Execute WAL checkpoint."""
+    with get_connection() as conn:
+        conn.execute(f"PRAGMA wal_checkpoint({mode})")
+
+
+_DDL = """
+-- Backend configuration table (core)
+CREATE TABLE IF NOT EXISTS backends (
+    id TEXT PRIMARY KEY,
+    name TEXT NOT NULL,
+    label TEXT DEFAULT '',
+    api_base_url TEXT NOT NULL,
+    api_key_encrypted TEXT NOT NULL,
+    api TEXT NOT NULL DEFAULT 'openai-completions',
+    timeout_seconds INTEGER NOT NULL DEFAULT 120,
+    rpm_limit INTEGER NOT NULL DEFAULT 40,
+    pool TEXT NOT NULL DEFAULT 'primary'
+        CHECK(pool IN ('primary', 'fallback')),
+    enabled INTEGER NOT NULL DEFAULT 1,
+    status TEXT NOT NULL DEFAULT 'healthy'
+        CHECK(status IN ('healthy', 'cooling', 'error', 'disabled')),
+    model_mappings_json TEXT DEFAULT '{}',
+    source TEXT NOT NULL DEFAULT 'webui'
+        CHECK(source IN ('webui', 'env', 'import')),
+    cooldown_until TEXT,
+    consecutive_429_count INTEGER DEFAULT 0,
+    metadata_json TEXT DEFAULT '{}',
+    created_at TEXT NOT NULL DEFAULT (datetime('now')),
+    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+-- Usage logs (hour-bucketed, UPSERT-safe)
+CREATE TABLE IF NOT EXISTS backend_usage_logs (
+    id TEXT PRIMARY KEY,
+    backend_id TEXT NOT NULL REFERENCES backends(id) ON DELETE CASCADE,
+    model TEXT DEFAULT 'unknown',
+    prompt_tokens INTEGER DEFAULT 0,
+    completion_tokens INTEGER DEFAULT 0,
+    total_tokens INTEGER DEFAULT 0,
+    cost REAL DEFAULT 0.0,
+    request_count INTEGER DEFAULT 0,
+    error_count INTEGER DEFAULT 0,
+    avg_latency_ms INTEGER DEFAULT 0,
+    ttft_ms INTEGER DEFAULT 0,
+    hour_bucket TEXT NOT NULL,
+    created_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+CREATE UNIQUE INDEX IF NOT EXISTS idx_usage_backend_hour
+    ON backend_usage_logs(backend_id, hour_bucket);
+
+-- Cooldown event log
+CREATE TABLE IF NOT EXISTS cooldown_events (
+    id TEXT PRIMARY KEY,
+    backend_id TEXT NOT NULL REFERENCES backends(id) ON DELETE CASCADE,
+    consecutive_count INTEGER NOT NULL DEFAULT 1,
+    cooldown_seconds INTEGER NOT NULL,
+    response_summary TEXT DEFAULT '',
+    started_at TEXT NOT NULL DEFAULT (datetime('now')),
+    ended_at TEXT
+);
+CREATE INDEX IF NOT EXISTS idx_cooldown_backend_time
+    ON cooldown_events(backend_id, started_at);
+
+-- Backend health state
+CREATE TABLE IF NOT EXISTS backend_health (
+    backend_id TEXT PRIMARY KEY REFERENCES backends(id) ON DELETE CASCADE,
+    state TEXT NOT NULL DEFAULT 'healthy'
+        CHECK(state IN ('healthy', 'degraded', 'down')),
+    last_latency_ms INTEGER DEFAULT 0,
+    last_status_code INTEGER DEFAULT 200,
+    success_rate_5m REAL DEFAULT 1.0,
+    consecutive_failures INTEGER DEFAULT 0,
+    last_check_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+-- System configuration KV store
+CREATE TABLE IF NOT EXISTS system_config (
+    key TEXT PRIMARY KEY,
+    value TEXT NOT NULL,
+    description TEXT DEFAULT '',
+    updated_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+
+-- Daily aggregated stats
+CREATE TABLE IF NOT EXISTS daily_stats (
+    id TEXT PRIMARY KEY,
+    date TEXT NOT NULL,
+    pool TEXT NOT NULL CHECK(pool IN ('primary', 'fallback')),
+    total_requests INTEGER DEFAULT 0,
+    total_errors INTEGER DEFAULT 0,
+    total_tokens INTEGER DEFAULT 0,
+    total_cost REAL DEFAULT 0.0,
+    unique_backends INTEGER DEFAULT 0,
+    created_at TEXT NOT NULL DEFAULT (datetime('now'))
+);
+CREATE UNIQUE INDEX IF NOT EXISTS idx_daily_date_pool ON daily_stats(date, pool);
+"""
@@ -0,0 +1,161 @@
+"""Data models for Sidecar V2 — backend-centric, Canonical Name routing."""
+
+from dataclasses import dataclass, field, asdict
+from typing import Optional
+import json
+
+
+@dataclass
+class ModelMapping:
+    """A single model mapping within a backend: Canonical Name → native_id + properties."""
+
+    native_id: str
+    reasoning: bool = False
+    reasoning_effort: bool = False
+    input_modalities: list[str] = field(default_factory=lambda: ["text"])
+    cost: dict = field(default_factory=lambda: {
+        "input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0
+    })
+    context_window: int = 128000
+    max_tokens: int = 65536
+    compat: dict = field(default_factory=dict)
+
+    def to_dict(self) -> dict:
+        return asdict(self)
+
+    @classmethod
+    def from_dict(cls, d: dict) -> "ModelMapping":
+        defaults = {
+            "native_id": "",
+            "reasoning": False,
+            "reasoning_effort": False,
+            "input_modalities": ["text"],
+            "cost": {"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0},
+            "context_window": 128000,
+            "max_tokens": 65536,
+            "compat": {},
+        }
+        defaults.update(d)
+        return cls(**{k: v for k, v in defaults.items() if k in cls.__dataclass_fields__})
+
+
+@dataclass
+class Backend:
+    """A physical API backend (API Key + URL).
+
+    Represents a single API key endpoint. Multiple backends can serve the same
+    Canonical Models through their model_mappings.
+    """
+
+    id: str = ""
+    name: str = ""
+    label: str = ""  # e.g., "nvidia", "siliconflow" — WebUI tag only
+    api_base_url: str = ""
+    api_key_encrypted: str = ""
+    api: str = "openai-completions"
+    timeout_seconds: int = 120
+    rpm_limit: int = 40
+    pool: str = "primary"  # primary | fallback
+    enabled: bool = True
+    status: str = "healthy"  # healthy | cooling | error | disabled
+    model_mappings: dict[str, ModelMapping] = field(default_factory=dict)
+    source: str = "webui"  # webui | env | import
+    cooldown_until: Optional[str] = None
+    consecutive_429_count: int = 0
+    metadata: dict = field(default_factory=dict)
+    created_at: str = ""
+    updated_at: str = ""
+
+    # Runtime fields (not persisted)
+    api_key_plain: str = ""  # decrypted at load time, not serialized to DB
+
+    def has_model(self, canonical_name: str) -> bool:
+        """Check if backend supports a given Canonical Model."""
+        return canonical_name in self.model_mappings
+
+    def get_native_id(self, canonical_name: str) -> str:
+        """Get this backend's native model ID for a Canonical Name."""
+        mm = self.model_mappings.get(canonical_name)
+        return mm.native_id if mm else canonical_name
+
+    def get_model_cost(self, canonical_name: str) -> dict:
+        """Get cost info for a Canonical Model on this backend."""
+        mm = self.model_mappings.get(canonical_name)
+        return mm.cost if mm else {"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0}
+
+    def to_dict(self, mask_key: bool = True) -> dict:
+        """Convert to dict for API responses."""
+        d = asdict(self)
+        # Remove runtime-only fields
+        d.pop("api_key_plain", None)
+        d.pop("api_key_encrypted", None)
+
+        # Mask API key
+        if mask_key and self.api_key_plain:
+            d["api_key"] = _mask_key(self.api_key_plain)
+        elif self.api_key_plain:
+            d["api_key"] = self.api_key_plain
+        else:
+            d["api_key"] = ""
+
+        # Convert model_mappings to dict for serialization
+        d["model_mappings"] = {
+            k: v.to_dict() for k, v in self.model_mappings.items()
+        }
+        return d
+
+
+def _mask_key(key: str) -> str:
+    if len(key) <= 10:
+        return key[:2] + "****"
+    return key[:6] + "****" + key[-4:]
+
+
+@dataclass
+class CooldownEvent:
+    id: str = ""
+    backend_id: str = ""
+    consecutive_count: int = 1
+    cooldown_seconds: int = 60
+    response_summary: str = ""
+    started_at: str = ""
+    ended_at: Optional[str] = None
+
+
+@dataclass
+class BackendHealth:
+    backend_id: str = ""
+    state: str = "healthy"  # healthy | degraded | down
+    last_latency_ms: int = 0
+    last_status_code: int = 200
+    success_rate_5m: float = 1.0
+    consecutive_failures: int = 0
+    last_check_at: str = ""
+
+
+@dataclass
+class UsageLog:
+    id: str = ""
+    backend_id: str = ""
+    model: str = "unknown"
+    prompt_tokens: int = 0
+    completion_tokens: int = 0
+    total_tokens: int = 0
+    cost: float = 0.0
+    request_count: int = 0
+    error_count: int = 0
+    avg_latency_ms: int = 0
+    ttft_ms: int = 0
+    hour_bucket: str = ""
+
+
+@dataclass
+class DailyStats:
+    id: str = ""
+    date: str = ""
+    pool: str = "primary"
+    total_requests: int = 0
+    total_errors: int = 0
+    total_tokens: int = 0
+    total_cost: float = 0.0
+    unique_backends: int = 0
@@ -0,0 +1,155 @@
+"""Usage logging and daily statistics aggregation."""
+
+import time
+from typing import Optional
+
+from storage.db import get_connection, generate_id
+
+
+def record_usage(
+    backend_id: str,
+    model: str,
+    prompt_tokens: int,
+    completion_tokens: int,
+    cost: float,
+    latency_ms: int,
+    ttft_ms: int = 0,
+    is_error: bool = False,
+) -> None:
+    """Record a single request's usage, hour-bucketed with UPSERT."""
+    hour_bucket = time.strftime("%Y-%m-%dT%H:00:00Z", time.gmtime())
+    uid = generate_id("use")
+
+    with get_connection() as conn:
+        # Try update existing hour bucket
+        cursor = conn.execute(
+            """UPDATE backend_usage_logs SET
+               prompt_tokens = prompt_tokens + ?,
+               completion_tokens = completion_tokens + ?,
+               total_tokens = total_tokens + ?,
+               cost = cost + ?,
+               request_count = request_count + 1,
+               error_count = error_count + ?,
+               avg_latency_ms = CAST((avg_latency_ms * request_count + ?) / (request_count + 1) AS INTEGER),
+               ttft_ms = CASE WHEN ? > 0 THEN CAST((ttft_ms * request_count + ?) / (request_count + 1) AS INTEGER) ELSE ttft_ms END
+               WHERE backend_id = ? AND hour_bucket = ?""",
+            (
+                prompt_tokens, completion_tokens,
+                prompt_tokens + completion_tokens,
+                cost,
+                1 if is_error else 0,
+                latency_ms,
+                ttft_ms, ttft_ms,
+                backend_id, hour_bucket,
+            ),
+        )
+        if cursor.rowcount == 0:
+            # Insert new hour bucket
+            conn.execute(
+                """INSERT INTO backend_usage_logs
+                   (id, backend_id, model, prompt_tokens, completion_tokens,
+                    total_tokens, cost, request_count, error_count,
+                    avg_latency_ms, ttft_ms, hour_bucket)
+                   VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
+                (
+                    uid, backend_id, model,
+                    prompt_tokens, completion_tokens,
+                    prompt_tokens + completion_tokens,
+                    cost, 1, 1 if is_error else 0,
+                    latency_ms, ttft_ms, hour_bucket,
+                ),
+            )
+        conn.commit()
+
+
+def get_hourly_usage(
+    backend_id: Optional[str] = None,
+    since: Optional[str] = None,
+    limit: int = 168,
+) -> list[dict]:
+    """Get hourly usage data, optionally filtered by backend and time range."""
+    with get_connection() as conn:
+        if backend_id and since:
+            rows = conn.execute(
+                """SELECT * FROM backend_usage_logs
+                   WHERE backend_id = ? AND hour_bucket >= ?
+                   ORDER BY hour_bucket DESC LIMIT ?""",
+                (backend_id, since, limit),
+            ).fetchall()
+        elif backend_id:
+            rows = conn.execute(
+                """SELECT * FROM backend_usage_logs
+                   WHERE backend_id = ? ORDER BY hour_bucket DESC LIMIT ?""",
+                (backend_id, limit),
+            ).fetchall()
+        elif since:
+            rows = conn.execute(
+                """SELECT * FROM backend_usage_logs
+                   WHERE hour_bucket >= ? ORDER BY hour_bucket DESC LIMIT ?""",
+                (since, limit),
+            ).fetchall()
+        else:
+            rows = conn.execute(
+                """SELECT * FROM backend_usage_logs
+                   ORDER BY hour_bucket DESC LIMIT ?""",
+                (limit,),
+            ).fetchall()
+    return [dict(row) for row in rows]
+
+
+def get_total_stats() -> dict:
+    """Get aggregate stats across all backends."""
+    with get_connection() as conn:
+        row = conn.execute(
+            """SELECT
+               SUM(request_count) as total_requests,
+               SUM(error_count) as total_errors,
+               SUM(total_tokens) as total_tokens,
+               SUM(prompt_tokens) as total_prompt_tokens,
+               SUM(completion_tokens) as total_completion_tokens,
+               SUM(cost) as total_cost
+               FROM backend_usage_logs"""
+        ).fetchone()
+    if row is None:
+        return {
+            "total_requests": 0, "total_errors": 0,
+            "total_tokens": 0, "total_prompt_tokens": 0,
+            "total_completion_tokens": 0, "total_cost": 0.0,
+        }
+    return dict(row)
+
+
+def aggregate_daily_stats(date: str) -> None:
+    """Aggregate hourly usage into daily stats table."""
+    with get_connection() as conn:
+        # Aggregate per pool
+        conn.execute("""DELETE FROM daily_stats WHERE date = ?""", (date,))
+        conn.execute(
+            """INSERT INTO daily_stats (id, date, pool, total_requests,
+               total_errors, total_tokens, total_cost, unique_backends)
+               SELECT
+                   ? || '-' || b.pool,
+                   ?,
+                   b.pool,
+                   SUM(u.request_count),
+                   SUM(u.error_count),
+                   SUM(u.total_tokens),
+                   SUM(u.cost),
+                   COUNT(DISTINCT u.backend_id)
+               FROM backend_usage_logs u
+               JOIN backends b ON u.backend_id = b.id
+               WHERE u.hour_bucket LIKE ?
+               GROUP BY b.pool""",
+            (generate_id("day"), date, date + "%"),
+        )
+        conn.commit()
+
+
+def get_daily_stats(days: int = 30) -> list[dict]:
+    """Get daily aggregated stats."""
+    with get_connection() as conn:
+        rows = conn.execute(
+            """SELECT * FROM daily_stats ORDER BY date DESC LIMIT ?""",
+            (days,),
+        ).fetchall()
+    return [dict(row) for row in rows]
@@ -1 +0,0 @@
-# nvidia_sidecar tests
@@ -1,207 +0,0 @@
-"""
-避退模式并发/死锁回归测试 (BIZ-46 Phase3 6)
-
-覆盖多线程场景下的 AdaptiveTokenBucket 线程安全性：
- 并发 record_response + evaluate_retreat
- 并发 consume + record_response + evaluate_retreat
- 高负载下避退状态转换正确性
-
-设计文档: docs/architecture/BIZ-46_Phase3_Architecture_Design.md 6
-"""
-
-from __future__ import annotations
-
-import threading
-import time
-
-import pytest
-
-from nvidia_sidecar.rate_limiter import AdaptiveTokenBucket, RetreatState
-
-
-class TestRetreatConcurrency:
-    """避退模式并发安全回归测试。"""
-
-    @pytest.mark.asyncio
-    async def test_concurrent_record_and_evaluate(self) -> None:
-        """多线程同时 record_response + evaluate_retreat 不死锁。
-
-        4 个线程同时操作：
-        - 2 个线程执行 record_response (1000 次)
-        - 2 个线程执行 evaluate_retreat (1000 次)
-
-        所有线程必须在 10s 内完成，否则判定为死锁。
-        """
-        bucket = AdaptiveTokenBucket(rate=40 / 60, capacity=40)
-        errors: list[Exception] = []
-
-        def worker_record() -> None:
-            for i in range(1000):
-                try:
-                    bucket.record_response(is_429=(i % 10 == 0))
-                except Exception as e:
-                    errors.append(e)
-
-        def worker_evaluate() -> None:
-            for _ in range(1000):
-                try:
-                    bucket.evaluate_retreat()
-                except Exception as e:
-                    errors.append(e)
-
-        threads = [
-            threading.Thread(target=worker_record),
-            threading.Thread(target=worker_record),
-            threading.Thread(target=worker_evaluate),
-            threading.Thread(target=worker_evaluate),
-        ]
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join(timeout=10)
-
-        alive_threads = [t for t in threads if t.is_alive()]
-        assert not alive_threads, (
-            f"{len(alive_threads)} 个线程未完成，疑似死锁"
-        )
-        assert not errors, f"并发错误: {errors}"
-
-    @pytest.mark.asyncio
-    async def test_concurrent_consume_and_retreat(self) -> None:
-        """多线程同时 consume + record_response + evaluate_retreat 不死锁。
-
-        覆盖 _lock (TokenBucket) 和 _retreat_lock (AdaptiveTokenBucket)
-        同时被不同线程持有时的交叉锁场景。
-        """
-        bucket = AdaptiveTokenBucket(rate=40 / 60, capacity=40)
-        errors: list[Exception] = []
-
-        def worker_consume() -> None:
-            for _ in range(500):
-                try:
-                    bucket.consume(tokens=1)
-                except Exception as e:
-                    errors.append(e)
-
-        def worker_retreat() -> None:
-            for _ in range(500):
-                try:
-                    bucket.record_response(is_429=False)
-                    bucket.evaluate_retreat()
-                except Exception as e:
-                    errors.append(e)
-
-        threads = [
-            threading.Thread(target=worker_consume),
-            threading.Thread(target=worker_consume),
-            threading.Thread(target=worker_retreat),
-            threading.Thread(target=worker_retreat),
-        ]
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join(timeout=10)
-
-        alive_threads = [t for t in threads if t.is_alive()]
-        assert not alive_threads, (
-            f"{len(alive_threads)} 个线程未完成，疑似死锁"
-        )
-        assert not errors, f"并发错误: {errors}"
-
-    @pytest.mark.asyncio
-    async def test_retreat_state_transitions_under_load(self) -> None:
-        """高负载下避退状态转换正确。
-
-        1. 注入 100 个 429 → 验证进入 RETREAT
-        2. 注入 200 个成功 → 手动推进时间 → 验证恢复
-        """
-        bucket = AdaptiveTokenBucket(
-            rate=40 / 60,
-            capacity=40,
-            retreat_window_seconds=0.1,
-            retreat_429_threshold=0.05,
-            retreat_factor=0.75,
-            retreat_min_rpm=5.0,
-            recover_window_seconds=0.01,
-        )
-
-        # 阶段 1：模拟高 429 率
-        for _ in range(100):
-            bucket.record_response(is_429=True)
-
-        state = bucket.evaluate_retreat()
-        assert state == RetreatState.RETREAT, (
-            f"高 429 率应触发避退，实际: {state}"
-        )
-        assert bucket.get_effective_rate_rpm() < bucket.get_base_rate_rpm(), (
-            f"避退后速率应低于基准，实际: "
-            f"{bucket.get_effective_rate_rpm()} vs {bucket.get_base_rate_rpm()}"
-        )
-
-        # 阶段 2：模拟恢复
-        time.sleep(0.15)  # 等待 429 从短窗口中过期
-        for _ in range(200):
-            bucket.record_response(is_429=False)
-
-        for _ in range(10):
-            state = bucket.evaluate_retreat()
-
-        assert state in (RetreatState.RECOVER, RetreatState.NORMAL), (
-            f"恢复后应为 RECOVER 或 NORMAL，实际: {state}"
-        )
-
-    @pytest.mark.asyncio
-    async def test_try_consume_concurrency_safety(self) -> None:
-        """并发 try_consume 不死锁。"""
-        bucket = AdaptiveTokenBucket(rate=40 / 60, capacity=40)
-        errors: list[Exception] = []
-        results: list[bool] = []
-
-        def worker() -> None:
-            for _ in range(200):
-                try:
-                    got = bucket.try_consume(tokens=1, timeout=0.1)
-                    results.append(got)
-                except Exception as e:
-                    errors.append(e)
-
-        threads = [threading.Thread(target=worker) for _ in range(8)]
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join(timeout=10)
-
-        alive = [t for t in threads if t.is_alive()]
-        assert not alive, f"{len(alive)} 个线程未完成，疑似死锁"
-        assert not errors, f"并发错误: {errors}"
-        successful = sum(1 for r in results if r)
-        assert successful > 0, (
-            f"令牌桶应至少成功消费一些令牌，成功: {successful}/{len(results)}"
-        )
-
-    @pytest.mark.asyncio
-    async def test_high_load_state_coherence(self) -> None:
-        """高负载下令牌桶状态一致性：消费总量 ≤ 初始 token + 补充量。"""
-        bucket = AdaptiveTokenBucket(rate=10.0, capacity=100)
-        consumed_count: list[int] = [0]
-        lock = threading.Lock()
-
-        def worker() -> None:
-            local_consumed = 0
-            for _ in range(50):
-                if bucket.consume(tokens=1):
-                    local_consumed += 1
-                time.sleep(0.001)
-            with lock:
-                consumed_count[0] += local_consumed
-
-        threads = [threading.Thread(target=worker) for _ in range(10)]
-        for t in threads:
-            t.start()
-        for t in threads:
-            t.join(timeout=15)
-
-        max_expected = 100 + int(10.0 * 5)
-        assert consumed_count[0] <= max_expected, (
-            f"消费量异常: {consumed_count[0]}，应 ≤ {max_expected}"
-        )
@@ -1,325 +0,0 @@
-"""
-NVIDIA Sidecar — WebUI 后端 API
-
-提供仪表盘 SSE 实时推送 + 配置热重载 API。
-
-BIZ-46 Phase3:
- 架构解耦：移除反向导入 server，改用 Depends(get_context) (§1)
- SSE 共享缓存：1s TTL snapshot cache，多客户端不重复构建 (§3)
- Dashboard UX：页面加载同步配置 + 队列深度标题 (§7)
-"""
-
-from __future__ import annotations
-
-import asyncio
-import json
-import os
-import time
-from pathlib import Path
-from typing import Any, AsyncGenerator
-
-import structlog
-from fastapi import APIRouter, Depends, HTTPException, Request
-from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
-from fastapi.security import HTTPAuthorizationCredentials, HTTPBearer
-from pydantic import BaseModel
-
-from nvidia_sidecar.context import SidecarContext
-
-webui_router: APIRouter = APIRouter(prefix="/api", tags=["webui"])
-logger: structlog.stdlib.BoundLogger = structlog.get_logger("nvidia_sidecar.webui")
-
-STATIC_DIR: Path = Path(__file__).parent / "static"
-
-# dashboard.html 缓存（严维序评审 #6 / 梁思筑评审 #8：避免每次请求读磁盘）
-_dashboard_html_cache: tuple[str, float] | None = None
-_DASHBOARD_CACHE_TTL: float = 300.0  # 5 分钟
-
-# Admin API 认证（严维序评审 #1）
-_ADMIN_TOKEN: str | None = os.environ.get("SIDECAR_ADMIN_TOKEN")
-_admin_auth_scheme: HTTPBearer = HTTPBearer(auto_error=False)
-
-
-def _get_ctx(request: Request) -> SidecarContext:
-    """获取 SidecarContext（webui 路由级注入，避免循环导入 server）。"""
-    return request.app.state.sidecar  # type: ignore[no-any-return]
-
-
-# ---------------------------------------------------------------------------
-# 配置热重载模型
-# ---------------------------------------------------------------------------
-
-class ConfigPatch(BaseModel):
-    """可在线修改的配置字段。"""
-    rate_rpm: int | None = None
-    queue_max_size: int | None = None
-    fallback_enabled_passthrough: bool | None = None
-
-
-# ---------------------------------------------------------------------------
-# SSE 快照构建（BIZ-46 Phase3: 1s TTL 共享缓存）
-# ---------------------------------------------------------------------------
-
-async def _build_snapshot(ctx: SidecarContext) -> dict[str, Any]:
-    """构建当前状态快照（从 SidecarContext 读取，含队列深度）。
-
-    BIZ-46 Phase3: 不再通过反向导入 server 访问全局变量。
-    """
-    try:
-        bucket_status = ctx.token_bucket.get_status()
-        now = time.time()
-
-        queue_data: dict[str, Any] = {"current_size": 0, "per_priority": {}}
-        try:
-            queue_stats = await ctx.priority_queue.get_stats()
-            queue_data = {
-                "max_size": queue_stats.get("max_size", 0),
-                "current_size": queue_stats.get("current_size", 0),
-                "per_priority": queue_stats.get("depth_by_priority", {}),
-                "total_enqueued": queue_stats.get("total_enqueued", 0),
-                "total_dequeued": queue_stats.get("total_dequeued", 0),
-                "total_dropped": queue_stats.get("total_dropped", 0),
-            }
-        except Exception:
-            logger.warning(
-                "queue_stats_unavailable",
-                message="队列统计获取失败，仪表盘队列深度可能不准确",
-            )
-
-        return {
-            "timestamp": now,
-            "uptime_seconds": ctx.uptime_seconds,
-            "token_bucket": bucket_status,
-            "queue": queue_data,
-            "retreat": {
-                "state": ctx.token_bucket.get_retreat_state(),
-                "effective_rpm": round(ctx.token_bucket.get_effective_rate_rpm(), 1),
-                "base_rpm": round(ctx.token_bucket.get_base_rate_rpm(), 1),
-                "upstream_429_rate": round(ctx.token_bucket.get_429_rate(), 4),
-            },
-            "requests": {
-                "total": ctx.stats.get("total_requests", 0),
-                "nvidia": ctx.stats.get("nvidia_requests", 0),
-                "passthrough": ctx.stats.get("passthrough_requests", 0),
-                "ratelimited": ctx.stats.get("ratelimited_requests", 0),
-            },
-            "errors": {
-                "queue_full_rejects": ctx.stats.get("queue_full_rejects", 0),
-                "upstream_errors": ctx.stats.get("upstream_errors", 0),
-            },
-        }
-    except Exception:
-        logger.exception("snapshot_build_error")
-        return {"error": "snapshot_unavailable", "timestamp": time.time()}
-
-
-async def _build_snapshot_cached(ctx: SidecarContext) -> dict[str, Any]:
-    """带 1s TTL 的共享快照缓存（BIZ-46 Phase3 §3）。
-
-    多个 SSE 客户端共享同一份快照，避免重复计算和锁竞争。
-
-    性能收益：
-    - 1 客户端: 1 次/s 计算（无变化）
-    - 5 客户端: ~5 次/s → 1 次/s
-    - 20 客户端: ~20 次/s → 1 次/s
-    """
-    now_cache = time.monotonic()
-    if ctx.snapshot_cache is not None:
-        data, ts = ctx.snapshot_cache
-        if now_cache - ts < ctx.SNAPSHOT_CACHE_TTL:
-            return data
-
-    async with ctx.snapshot_cache_lock:
-        # Double-check（避免多个协程同时 miss 后重复构建）
-        if ctx.snapshot_cache is not None:
-            data, ts = ctx.snapshot_cache
-            if now_cache - ts < ctx.SNAPSHOT_CACHE_TTL:
-                return data
-
-        snapshot = await _build_snapshot(ctx)
-        ctx.snapshot_cache = (snapshot, now_cache)
-        return snapshot
-
-
-# ---------------------------------------------------------------------------
-# 仪表盘 SSE 推送
-# ---------------------------------------------------------------------------
-
-async def _dashboard_stream(request: Request, ctx: SidecarContext) -> StreamingResponse:
-    """SSE 实时推送 Sidecar 完整状态快照（每秒一次）。
-
-    供 dashboard.html 的 EventSource 消费。
-
-    BIZ-46 Phase3: 使用共享缓存 _build_snapshot_cached，多客户端不重复计算。
-    """
-    async def event_generator() -> AsyncGenerator[str, None]:
-        first_frame = True
-        while True:
-            if await request.is_disconnected():
-                break
-            try:
-                snapshot: dict[str, Any] = await _build_snapshot_cached(ctx)
-                payload_sse = f"data: {json.dumps(snapshot, ensure_ascii=False)}\n\n"
-                if first_frame:
-                    payload_sse = f"retry: 3000\n{payload_sse}"
-                    first_frame = False
-                yield payload_sse
-            except Exception:
-                logger.exception("dashboard_sse_error")
-                yield f"data: {json.dumps({'error': 'internal'})}\n\n"
-            await asyncio.sleep(1.0)
-
-    return StreamingResponse(
-        event_generator(),
-        media_type="text/event-stream",
-        headers={
-            "Cache-Control": "no-cache",
-            "X-Accel-Buffering": "no",
-        },
-    )
-
-
-# ---------------------------------------------------------------------------
-# 配置热重载
-# ---------------------------------------------------------------------------
-
-async def get_config(ctx: SidecarContext) -> dict[str, Any]:
-    """获取当前完整配置（从 SidecarContext 读取）。"""
-    config = ctx.config
-    effective_rpm = float(ctx.token_bucket.get_effective_rate_rpm())
-    return {
-        "listen_host": config.listen_host,
-        "listen_port": config.listen_port,
-        "metrics_port": config.metrics_port,
-        "upstream_url": config.upstream_url,
-        "upstream_api_key": _mask_api_key(config.upstream_api_key),
-        "rate_rpm": round(effective_rpm, 1),
-        "bucket_capacity": config.bucket_capacity,
-        "request_timeout": config.request_timeout,
-        "queue_max_size": config.queue_max_size,
-        "low_priority_timeout": config.low_priority_timeout,
-        "fallback_enabled_passthrough": config.fallback_enabled_passthrough,
-        "log_level": config.log_level,
-    }
-
-
-async def update_config(body: ConfigPatch, ctx: SidecarContext) -> JSONResponse:
-    """在线修改配置项并即时生效。"""
-    config = ctx.config
-    changed: list[str] = []
-
-    if body.rate_rpm is not None:
-        if body.rate_rpm <= 0:
-            raise HTTPException(status_code=400, detail="rate_rpm must be > 0")
-        config.rate_rpm = body.rate_rpm
-        ctx.token_bucket.set_rate(body.rate_rpm / 60.0)
-        changed.append("rate_rpm")
-
-    if body.queue_max_size is not None:
-        if body.queue_max_size <= 0:
-            raise HTTPException(status_code=400, detail="queue_max_size must be > 0")
-        ok, msg = ctx.priority_queue.set_max_size(body.queue_max_size)
-        if not ok:
-            raise HTTPException(status_code=400, detail=msg)
-        config.queue_max_size = body.queue_max_size
-        changed.append("queue_max_size")
-        logger.info("queue_max_size_updated", detail=msg)
-
-    if body.fallback_enabled_passthrough is not None:
-        config.fallback_enabled_passthrough = body.fallback_enabled_passthrough
-        changed.append("fallback_enabled_passthrough")
-
-    logger.info("config_updated", changed=changed)
-    return JSONResponse(
-        content={"status": "ok", "changed": changed},
-    )
-
-
-def _mask_api_key(key: str) -> str:
-    """对 API Key 进行脱敏处理，仅保留前 4 位以供识别。
-
-    严维序评审 #2 / 沈路明评审 #3：防止 API Key 明文泄露。
-    """
-    if not key:
-        return ""
-    if len(key) <= 4:
-        return key[:2] + "****"
-    return key[:4] + "****"
-
-
-# ---------------------------------------------------------------------------
-# 路由注册
-# ---------------------------------------------------------------------------
-
-@webui_router.get("/dashboard/stream")
-async def dashboard_stream(
-    request: Request,
-    ctx: SidecarContext = Depends(_get_ctx),
-) -> StreamingResponse:
-    """SSE 仪表盘实时推送端点（BIZ-46 Phase3: 使用共享缓存）。"""
-    return await _dashboard_stream(request, ctx)
-
-
-async def _verify_admin_auth(
-    credentials: HTTPAuthorizationCredentials | None = Depends(_admin_auth_scheme),
-) -> None:
-    """Admin API Bearer Token 认证（严维序评审 #1）。
-
-    若设置了 SIDECAR_ADMIN_TOKEN 环境变量，则要求请求携带匹配的 Bearer Token。
-    未设置时跳过认证（开发/测试环境）。
-    """
-    if _ADMIN_TOKEN is None:
-        return  # 未配置认证 token，允许无认证访问
-    if credentials is None:
-        raise HTTPException(status_code=401, detail="需要 Bearer Token 认证（Admin API）")
-    if credentials.credentials != _ADMIN_TOKEN:
-        raise HTTPException(status_code=403, detail="Admin Token 无效")
-
-
-@webui_router.get("/admin/config")
-async def admin_get_config(
-    _auth: None = Depends(_verify_admin_auth),
-    ctx: SidecarContext = Depends(_get_ctx),
-) -> JSONResponse:
-    """获取当前配置（需要 Admin 认证）。"""
-    return JSONResponse(content=await get_config(ctx))
-
-
-@webui_router.post("/admin/config")
-async def admin_update_config(
-    body: ConfigPatch,
-    _auth: None = Depends(_verify_admin_auth),
-    ctx: SidecarContext = Depends(_get_ctx),
-) -> JSONResponse:
-    """在线修改配置（热重载，需要 Admin 认证）。"""
-    return await update_config(body, ctx)
-
-
-# ---------------------------------------------------------------------------
-# 仪表盘静态页面
-# ---------------------------------------------------------------------------
-
-def _get_dashboard_html() -> str:
-    """获取仪表盘 HTML（带缓存，严维序评审 #6 / 梁思筑评审 #8）。
-
-    首次加载后缓存 5 分钟，避免每次请求读磁盘。
-    """
-    global _dashboard_html_cache
-    now = time.monotonic()
-    if _dashboard_html_cache is not None:
-        cached_content, cached_at = _dashboard_html_cache
-        if now - cached_at < _DASHBOARD_CACHE_TTL:
-            return cached_content
-
-    dashboard_path = STATIC_DIR / "dashboard.html"
-    if dashboard_path.is_file():
-        content = dashboard_path.read_text(encoding="utf-8")
-        _dashboard_html_cache = (content, now)
-        return content
-    return "<h1>dashboard.html not found</h1>"
-
-
-@webui_router.get("/dashboard", include_in_schema=False)
-async def dashboard_page() -> HTMLResponse:
-    """仪表盘 HTML 页面（含缓存策略）。"""
-    return HTMLResponse(content=_get_dashboard_html())