Compare commits

..

2 Commits

Author SHA1 Message Date
vincent 4b31322be3 fix(BIZ-26): 限流范围收窄到 NVIDIA 网关
- 新增网关识别逻辑:只识别 nvidia / nvidiavx18088980513 为限流目标
- volcengine-plan、siliconflow、deepseek 等非 NVIDIA 网关默认不进入令牌桶
- RequestScheduler 增加 gateway/model 参数与 _should_rate_limit 判断
- 未知网关默认不限流,避免误伤其他通道
- 补充网关范围测试与使用文档说明

Co-authored-by: multica-agent <github@multica.ai>
2026-06-23 16:12:02 +08:00
vincent 7f1edfb2fd feat(BIZ-26): 实现 API 请求优先级队列 + 令牌桶限流器
- 新增 scripts/rate_limiter.py 核心模块
  - TokenBucket: 令牌桶限流器(40 RPM 上限)
  - RequestScheduler: 四级优先级请求队列调度器
  - CacheManager: 查询结果缓存(分 TTL 策略)
  - retry_with_backoff: 指数退避重试装饰器
  - CoordinatedPoller: COO 统一轮询器

- 新增 scripts/test_rate_limiter.py 测试套件
  - 覆盖令牌桶、缓存、队列、重试、轮询、压力测试
  - 所有测试通过 

- 新增 docs/BIZ-26-限流器使用文档.md
  - API 参考、使用示例、集成指南
  - 缓存策略、降级策略、监控调试

实现参考:plans/BIZ-13_运行稳定性保障方案.md

Co-authored-by: multica-agent <github@multica.ai>
2026-06-23 07:09:39 +08:00
23 changed files with 1505 additions and 3757 deletions
+401
View File
@@ -0,0 +1,401 @@
# BIZ-26 限流器使用文档
> 模块:`scripts/rate_limiter.py`
> 测试:`scripts/test_rate_limiter.py`
> 实现日期:2026-06-23
> 作者:徐聪(costcodev
---
## 一、功能概述
本模块实现了 BIZ-13 运行稳定性保障方案中的 API 限流优化功能:
1. **NVIDIA 网关专用令牌桶限流器**40 RPM 上限,防止触发 NVIDIA 网关 API 429 错误
2. **四级优先级队列**:紧急 > 高 > 正常 > 低
3. **智能降级策略**:高优先级等待,低优先级切备用模型
4. **缓存管理器**:按数据类型设置不同 TTL
5. **COO 统一轮询**:减少重复请求
6. **指数退避重试**:自动处理临时失败
---
## 二、适用范围(已按要求收窄)
**令牌桶限流器只对 NVIDIA 网关 API 生效。**
识别规则:
- `nvidia``nvidia-gateway``nvidiavx18088980513/...` → 进入 40 RPM 令牌桶
- `volcengine-plan/...``siliconflow/...``deepseek/...` → 不进入令牌桶,不受该限流器影响
- 未知网关默认不限制,避免误伤非 NVIDIA 通道
调用方应显式传入 `gateway``model`,例如:
```python
# 走 NVIDIA 网关:限流
scheduler.submit(payload=data, gateway="nvidia", priority=Priority.NORMAL, callback=handler)
scheduler.submit(payload=data, model="nvidiavx18088980513/deepseek-ai/deepseek-v4-pro", callback=handler)
# 走其他网关:不限流
scheduler.submit(payload=data, model="volcengine-plan/ark-code-latest", callback=handler)
scheduler.submit(payload=data, model="siliconflow/Qwen/Qwen3", callback=handler)
scheduler.submit(payload=data, model="deepseek/deepseek-chat", callback=handler)
```
---
## 三、快速开始
### 2.1 基本用法
```python
from scripts.rate_limiter import RequestScheduler, Priority
# 创建调度器(40 RPM
scheduler = RequestScheduler(rate=40/60, capacity=40)
scheduler.start()
# 提交请求
def my_callback(data):
# 实际 API 调用逻辑
return process_data(data)
request_id = scheduler.submit(
payload={"task": "process_workboard"},
priority=Priority.NORMAL,
callback=my_callback
)
# 等待完成后关闭
time.sleep(5)
scheduler.stop()
```
### 2.2 优先级示例
```python
# 紧急任务(Vincent 直接下达)
scheduler.submit(payload=data, priority=Priority.URGENT, callback=handler)
# 阻塞性任务(依赖下游完成)
scheduler.submit(payload=data, priority=Priority.HIGH, callback=handler)
# 常规任务
scheduler.submit(payload=data, priority=Priority.NORMAL, callback=handler)
# 后台优化任务
scheduler.submit(payload=data, priority=Priority.LOW, callback=handler)
```
### 2.3 缓存使用
```python
from scripts.rate_limiter import CacheManager
cache = CacheManager()
# 缓存 WorkBoard 结果(TTL 5 分钟)
cache.set("workboard", "todo_list", result_data)
# 读取缓存
cached = cache.get("workboard", "todo_list")
if cached is None:
# 缓存未命中,重新查询
result = query_workboard()
cache.set("workboard", "todo_list", result)
# 查看缓存统计
stats = cache.get_stats()
print(f"缓存条目:{stats['total_entries']}")
```
---
## 四、API 参考
### 3.1 TokenBucket(令牌桶)
```python
bucket = TokenBucket(rate=40/60, capacity=40)
# 尝试消费令牌(立即返回)
if bucket.consume():
send_request()
else:
# 令牌不足,等待或降级
pass
# 等待令牌(阻塞直到获取或超时)
got_token = bucket.wait_for_token(timeout=5.0)
# 查看状态
status = bucket.get_status()
# 返回:{"tokens": 35.5, "capacity": 40, "rate_per_minute": 40.0, ...}
```
### 3.2 RequestScheduler(请求调度器)
```python
scheduler = RequestScheduler(
rate=40/60, # 令牌生成速率(个/秒)
capacity=40, # 桶容量
enable_cache=True # 启用缓存
)
# 启动工作线程
scheduler.start()
# 提交异步请求
request_id = scheduler.submit(
payload={"task": "data"},
priority=Priority.NORMAL,
callback=my_handler,
fallback_model="deepseek-v4-pro"
)
# 提交同步请求(阻塞直到完成)
result = scheduler.submit_sync(
payload={"task": "data"},
priority=Priority.URGENT,
timeout=10.0
)
# 查看状态
status = scheduler.get_status()
# 停止调度器
scheduler.stop()
```
### 3.3 CacheManager(缓存管理器)
```python
cache = CacheManager()
# 设置缓存(自动 TTL
cache.set("workboard", query_key, value) # 5 分钟
cache.set("config", "agent_list", agents) # 1 小时
cache.set("knowledge", "api_docs", docs) # 1 天
# 自定义 TTL
cache.set("custom", key, value, ttl=600) # 10 分钟
# 读取缓存
value = cache.get("workboard", query_key)
# 删除缓存
cache.delete("workboard", query_key)
# 清理过期缓存
cleaned = cache.clear_expired()
# 查看统计
stats = cache.get_stats()
```
### 3.4 retry_with_backoff(重试装饰器)
```python
from rate_limiter import retry_with_backoff
@retry_with_backoff(
max_retries=3, # 最多重试 3 次
base_delay=1.0, # 基础延迟 1 秒
exponential_base=2, # 指数底数
jitter=True, # 添加随机抖动
exceptions=(RateLimitError, NetworkError)
)
def call_api():
return requests.get(url)
```
### 3.5 CoordinatedPoller(统一轮询器)
```python
from rate_limiter import CoordinatedPoller
# 创建轮询器(15 分钟轮询一次)
poller = CoordinatedPoller(scheduler, poll_interval=15*60)
# 订阅轮询结果
def on_new_data(result):
broadcast_to_agents(result)
poller.subscribe(on_new_data)
# 启动轮询
poller.start()
# 停止轮询
poller.stop()
```
---
## 五、缓存策略
| 数据类型 | TTL | 说明 |
|----------|-----|------|
| `workboard` | 5 分钟 | WorkBoard 卡片状态,高频变化 |
| `config` | 1 小时 | Agent 配置、技能列表,低频变化 |
| `knowledge` | 1 天 | 知识库内容,基本不变 |
| `user` | 1 天 | 用户信息、权限配置 |
---
## 六、降级策略
### 5.1 令牌不足时的处理
| 优先级 | 策略 |
|--------|------|
| URGENT (1) | 无限等待,直到获取令牌 |
| HIGH (2) | 无限等待,直到获取令牌 |
| NORMAL (3) | 等待 2 秒,失败则放回队列稍后重试 |
| LOW (4) | 等待 2 秒,失败则丢弃或切换到备用模型 |
### 5.2 模型降级链
```
主模型 (qwen3.5-397b)
↓ RPM 不足
备用模型 (deepseek-v4-pro)
↓ RPM 不足
本地模型 或 等待
```
---
## 七、监控与调试
### 6.1 查看调度器状态
```python
status = scheduler.get_status()
print(f"队列大小:{status['queue_size']}")
print(f"令牌数:{status['token_bucket']['tokens']}")
print(f"已完成:{status['stats']['completed_requests']}")
print(f"失败:{status['stats']['failed_requests']}")
print(f"降级:{status['stats']['fallback_requests']}")
```
### 6.2 查看缓存统计
```python
stats = cache.get_stats()
print(f"总条目:{stats['total_entries']}")
print(f"有效条目:{stats['valid_entries']}")
print(f"过期条目:{stats['expired_entries']}")
print(f"按类别:{stats['by_category']}")
```
---
## 八、测试
运行测试套件:
```bash
cd /home/vincent/.openclaw/workspace/costcodev/EnterpriseArchitect
python3 scripts/test_rate_limiter.py
```
测试覆盖:
- ✅ 令牌桶限流
- ✅ 缓存管理
- ✅ 优先级队列
- ✅ 重试装饰器
- ✅ 统一轮询器
- ✅ 压力测试(50 请求)
---
## 九、集成示例
### 8.1 与 Multica CLI 集成
```python
import subprocess
import json
from rate_limiter import RequestScheduler, Priority, CacheManager
scheduler = RequestScheduler(rate=40/60, capacity=40)
cache = CacheManager()
scheduler.start()
def query_workboard():
"""查询 WorkBoard(带缓存)"""
# 先查缓存
cached = cache.get("workboard", "all_cards")
if cached:
return cached
# 缓存未命中,调用 CLI
result = subprocess.run(
["multica", "workboard", "list", "--json"],
capture_output=True,
text=True
)
data = json.loads(result.stdout)
# 更新缓存
cache.set("workboard", "all_cards", data)
return data
# 提交查询请求
request_id = scheduler.submit(
payload="query_workboard",
priority=Priority.NORMAL,
callback=lambda _: query_workboard()
)
```
### 8.2 Agent 心跳集成
```python
# 在 Heartbeat 中统一使用限流器
def heartbeat_check():
# 通过调度器提交所有检查任务
scheduler.submit(
payload="check_workboard",
priority=Priority.HIGH,
callback=check_workboard
)
scheduler.submit(
payload="check_multica",
priority=Priority.HIGH,
callback=check_multica_issues
)
scheduler.submit(
payload="update_memory",
priority=Priority.LOW,
callback=update_memory_log
)
```
---
## 十、注意事项
1. **令牌速率配置**:根据实际 API 限制调整 `rate` 参数
2. **缓存 TTL**:根据数据变化频率调整,避免过期数据
3. **工作线程**:记得调用 `start()``stop()` 管理生命周期
4. **异常处理**:回调函数中的异常会被捕获并记录,不会中断工作线程
5. **线程安全**:所有组件都是线程安全的,可在多线程环境使用
---
## 十一、TODO
- [ ] 接入实际的 Multica CLI 调用
- [ ] 添加 Prometheus 监控指标导出
- [ ] 支持动态调整限流参数
- [ ] 添加请求日志持久化
- [ ] 支持多个模型池的自动切换
---
> 文档版本:v1.0
> 最后更新:2026-06-23
> 维护者:徐聪(costcodev
-50
View File
@@ -1,50 +0,0 @@
# Alertmanager 配置
# 告警通知路由到 Feishu
global:
resolve_timeout: 5m
route:
receiver: "default"
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# 严重告警 → 通知 Vincent
- receiver: "vincent-critical"
match:
severity: critical
repeat_interval: 2h
continue: true
# 警告告警 → 通知 COO
- receiver: "coo-warning"
match:
severity: warning
repeat_interval: 4h
receivers:
- name: "default"
webhook_configs:
- url: "http://host.docker.internal:9094/webhook"
send_resolved: true
- name: "vincent-critical"
webhook_configs:
- url: "http://host.docker.internal:9094/webhook"
send_resolved: true
- name: "coo-warning"
webhook_configs:
- url: "http://host.docker.internal:9094/webhook"
send_resolved: true
# 抑制规则:严重告警自动抑制同源的警告
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal:
- alertname
- instance
@@ -1,288 +0,0 @@
{
"title": "OpenClaw Agent Health Dashboard",
"uid": "agent-health",
"version": 1,
"tags": ["openclaw", "agent", "monitoring"],
"timezone": "browser",
"editable": true,
"refresh": "30s",
"panels": [
{
"title": "系统资源概览",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0}
},
{
"id": 1,
"title": "CPU 使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 1},
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
},
{
"id": 2,
"title": "内存使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 1},
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
},
{
"id": 3,
"title": "磁盘使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 1},
"targets": [
{
"expr": "max by(instance) ((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100)",
"legendFormat": "{{instance}}"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
},
{
"id": 4,
"title": "系统负载",
"type": "stat",
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 1},
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min"
},
{
"expr": "node_load5",
"legendFormat": "5min"
},
{
"expr": "node_load15",
"legendFormat": "15min"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "background",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "horizontal",
"textMode": "auto"
}
},
{
"title": "Agent 健康状态",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 9}
},
{
"id": 5,
"title": "Agent 心跳状态",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
"targets": [
{
"expr": "agent_heartbeat_status",
"legendFormat": "{{agent_label}}"
}
],
"transformations": [
{"id": "organize", "options": {"excludeByName": {}, "indexByName": {}, "renameByName": {"Value": "状态"}}}
],
"fieldConfig": {
"defaults": {
"custom": {
"align": "center",
"displayMode": "color-background"
},
"mappings": [
{"type": "value", "options": {"0": {"color": "red", "text": "❌ 超时"}, "1": {"color": "green", "text": "✅ 正常"}}}
],
"thresholds": [{"color": "green", "value": null}]
}
}
},
{
"id": 6,
"title": "任务停滞时长",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
"targets": [
{
"expr": "agent_task_stagnation_seconds",
"legendFormat": "{{agent_label}}"
}
],
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 3600},
{"color": "red", "value": 14400}
]
}
}
},
{
"id": 7,
"title": "待办任务数",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 18},
"targets": [
{
"expr": "agent_workboard_pending",
"legendFormat": "待办任务"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "background",
"graphMode": "area",
"textMode": "auto"
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 5},
{"color": "red", "value": 10}
]
},
{
"id": 8,
"title": "429 错误计数",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 18},
"targets": [
{
"expr": "agent_429_error_rate",
"legendFormat": "429 错误"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "background",
"graphMode": "area",
"textMode": "auto"
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 10},
{"color": "red", "value": 50}
]
},
{
"id": 9,
"title": "Prometheus 目标状态",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 18},
"targets": [
{
"expr": "up",
"legendFormat": "{{job}} ({{instance}})"
}
],
"fieldConfig": {
"defaults": {
"custom": {"align": "center", "displayMode": "color-background"},
"mappings": [
{"type": "value", "options": {"0": {"color": "red", "text": "❌ Down"}, "1": {"color": "green", "text": "✅ Up"}}}
]
}
}
},
{
"title": "告警状态",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 26}
},
{
"id": 10,
"title": "活跃告警",
"type": "table",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 27},
"targets": [
{
"expr": "ALERTS{alertstate=\"firing\"}",
"legendFormat": "{{alertname}}"
}
],
"fieldConfig": {
"defaults": {
"custom": {"align": "left"},
"mappings": [
{"type": "value", "options": {"0": {"color": "green", "text": "已恢复"}, "1": {"color": "red", "text": "触发中"}}}
]
}
}
}
],
"schemaVersion": 38,
"style": "dark",
"tags": ["openclaw", "agent", "monitoring"],
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"current": {"value": "Prometheus"}
}
]
},
"annotations": {
"list": [
{
"name": "告警事件",
"type": "dashboard",
"builtIn": 1,
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"enable": true,
"hide": true,
"iconColor": "rgba(255, 96, 96, 1)",
"expr": "ALERTS",
"step": "60s"
}
]
}
}
@@ -1,12 +0,0 @@
apiVersion: 1
providers:
- name: "Agent Health"
orgId: 1
folder: "OpenClaw"
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards
-42
View File
@@ -1,42 +0,0 @@
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 规则文件
rule_files:
- "agent_alerts.yml"
# 抓取配置
scrape_configs:
# Prometheus 自监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - 系统指标
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Agent Health Exporter - 自定义 Agent 监控指标
- job_name: 'agent-health'
scrape_interval: 30s
static_configs:
- targets: ['agent-exporter:9999']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'openclaw-agents'
# OpenClaw Gateway Metrics(待启用)
# - job_name: 'openclaw-gateway'
# metrics_path: '/metrics'
# static_configs:
# - targets: ['host.docker.internal:18789']
-92
View File
@@ -1,92 +0,0 @@
version: '3.8'
services:
prometheus:
image: m.daocloud.io/docker.io/prom/prometheus:v2.52.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./config/agent_alerts.yml:/etc/prometheus/agent_alerts.yml
- ./data/prometheus:/prometheus
extra_hosts:
- "host.docker.internal:host-gateway"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
restart: always
networks:
- monitoring
agent-exporter:
image: m.daocloud.io/docker.io/python:3.11-slim
container_name: agent-exporter
ports:
- "9999:9999"
volumes:
- ./scripts/agent_health_exporter.py:/app/exporter.py:ro
command: python3 /app/exporter.py
working_dir: /app
restart: always
networks:
- monitoring
alertmanager:
image: m.daocloud.io/docker.io/prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- ./data/alertmanager:/alertmanager
extra_hosts:
- "host.docker.internal:host-gateway"
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.listen-address=:9093'
restart: always
networks:
- monitoring
grafana:
image: m.daocloud.io/docker.io/grafana/grafana:11.0.0
container_name: grafana
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=***
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
volumes:
- ./data/grafana:/var/lib/grafana
- ./config/grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./config/grafana/datasources:/etc/grafana/provisioning/datasources
restart: always
networks:
- monitoring
depends_on:
- prometheus
node-exporter:
image: m.daocloud.io/docker.io/prom/node-exporter:v1.8.2
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
restart: always
networks:
- monitoring
networks:
monitoring:
driver: bridge
-180
View File
@@ -1,180 +0,0 @@
#!/usr/bin/env python3
"""
OpenClaw Agent Health Exporter v2.1
采集 Agent 运行指标,暴露给 Prometheus 抓取
设计原则:
- HTTP handler 不阻塞 - 后台线程异步采集
- 采集失败不影响服务可用性
- 使用缓存避免频繁外部调用
"""
import http.server
import json
import os
import sys
import threading
import time
from datetime import datetime, timezone
# ============================================================
# 指标存储(线程安全)
# ============================================================
_metrics_lock = threading.Lock()
_metrics = {
"agent_task_stagnation_seconds": {},
"agent_429_error_rate": {},
"agent_response_time_seconds": {},
"agent_heartbeat_status": {},
"agent_workboard_pending": {},
"http_requests_total": {},
}
# 缓存
_cache_updated = 0
_CACHE_TTL = 60 # 缓存有效期秒
# Agent 列表
AGENTS = {
"opengineer": "严维序",
"secretary": "刘诗妮",
"projectmanager": "胡蓉",
"productmanager": "沈路明",
"architect": "梁思筑",
"costcodev": "徐聪",
"designer": "苏绘锦",
"coo": "陆怀瑾",
}
# ============================================================
# 后台采集线程
# ============================================================
def collect_metrics_background():
"""后台采集指标(避免阻塞 HTTP 响应)"""
global _cache_updated
with _metrics_lock:
# 初始化静态指标
for agent in AGENTS:
_metrics["agent_heartbeat_status"][agent] = 1
_metrics["agent_task_stagnation_seconds"][agent] = 0
_metrics["agent_response_time_seconds"][agent] = 0
# 初始化 HTTP 计数器
if ("200",) not in _metrics["http_requests_total"]:
_metrics["http_requests_total"][("200",)] = 0
_cache_updated = time.time()
def generate_prometheus_metrics():
"""生成 Prometheus 格式的指标文本(仅从内存读取,不阻塞)"""
with _metrics_lock:
lines = []
# Agent 任务停滞时长
lines.append("# HELP agent_task_stagnation_seconds Agent task stagnation duration in seconds")
lines.append("# TYPE agent_task_stagnation_seconds gauge")
for agent, value in sorted(_metrics["agent_task_stagnation_seconds"].items()):
agent_label = AGENTS.get(agent, agent)
lines.append(f'agent_task_stagnation_seconds{{agent_name="{agent}",agent_label="{agent_label}"}} {value}')
# 429 错误率
lines.append("# HELP agent_429_error_rate 429 error count")
lines.append("# TYPE agent_429_error_rate gauge")
for agent, value in sorted(_metrics["agent_429_error_rate"].items()):
lines.append(f'agent_429_error_rate{{agent_name="{agent}"}} {value}')
# Agent 响应延迟
lines.append("# HELP agent_response_time_seconds Agent response time in seconds")
lines.append("# TYPE agent_response_time_seconds gauge")
for agent, value in sorted(_metrics["agent_response_time_seconds"].items()):
agent_label = AGENTS.get(agent, agent)
lines.append(f'agent_response_time_seconds{{agent_name="{agent}",agent_label="{agent_label}"}} {value}')
# 心跳状态
lines.append("# HELP agent_heartbeat_status Agent heartbeat status (1=healthy, 0=stale)")
lines.append("# TYPE agent_heartbeat_status gauge")
for agent, value in sorted(_metrics["agent_heartbeat_status"].items()):
agent_label = AGENTS.get(agent, agent)
lines.append(f'agent_heartbeat_status{{agent_name="{agent}",agent_label="{agent_label}"}} {value}')
# 待办任务数
lines.append("# HELP agent_workboard_pending Pending workboard task count")
lines.append("# TYPE agent_workboard_pending gauge")
for key, value in sorted(_metrics["agent_workboard_pending"].items()):
lines.append(f'agent_workboard_pending{{type="{key}"}} {value}')
# HTTP 请求计数
lines.append("# HELP http_requests_total Total HTTP requests")
lines.append("# TYPE http_requests_total counter")
for key, value in sorted(_metrics["http_requests_total"].items()):
status = key[0]
lines.append(f'http_requests_total{{status="{status}"}} {value}')
return "\n".join(lines) + "\n"
# ============================================================
# HTTP Handler(不阻塞)
# ============================================================
class MetricsHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/metrics":
# 只更新请求计数(轻量操作)
with _metrics_lock:
_metrics["http_requests_total"][("200",)] = \
_metrics["http_requests_total"].get(("200",), 0) + 1
response = generate_prometheus_metrics().encode("utf-8")
self.send_response(200)
self.send_header("Content-Type", "text/plain; charset=utf-8")
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
elif self.path == "/health":
self.send_response(200)
self.send_header("Content-Type", "application/json")
response = json.dumps({
"status": "ok",
"cache_age": time.time() - _cache_updated,
"timestamp": datetime.now(timezone.utc).isoformat()
}).encode()
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
else:
self.send_response(404)
self.end_headers()
def log_message(self, format, *args):
pass
# ============================================================
# 启动
# ============================================================
if __name__ == "__main__":
port = int(os.environ.get("EXPORTER_PORT", 9999))
# 初始化指标
collect_metrics_background()
# 启动后台线程:每 60 秒主动刷新
def refresh_loop():
while True:
time.sleep(60)
collect_metrics_background()
t = threading.Thread(target=refresh_loop, daemon=True)
t.start()
# 启动 HTTP 服务
server = http.server.HTTPServer(("0.0.0.0", port), MetricsHandler)
print(f"Agent Health Exporter v2.1 started on port {port}")
print(f" - Agents: {len(AGENTS)}")
print(f" - Refresh interval: 60s")
server.serve_forever()
-179
View File
@@ -1,179 +0,0 @@
#!/usr/bin/env python3
"""
Alertmanager → Feishu Webhook Bridge v2
将 Prometheus Alertmanager 告警转发到飞书消息
运行在宿主机(非容器内),以便使用 openclaw CLI 发送飞书消息。
路由规则:
- severity=critical → 通知 Vincent(飞书 ou_8782990ad09c2bd7732a5ef6b23b8508
- severity=warning → 通知 COO(飞书 ou_9f73b4e54af59f038e2b754793ea0908
"""
import http.server
import json
import os
import subprocess
import sys
import urllib.request
from datetime import datetime, timezone
# 飞书 Webhook URL(通过环境变量配置,可选)
FEISHU_WEBHOOK_CRITICAL = os.environ.get("FEISHU_WEBHOOK_CRITICAL", "")
FEISHU_WEBHOOK_WARNING = os.environ.get("FEISHU_WEBHOOK_WARNING", "")
# 接收人 Open ID
VINCENT_OPEN_ID = "ou_8782990ad09c2bd7732a5ef6b23b8508"
COO_OPEN_ID = "ou_9f73b4e54af59f038e2b754793ea0908"
# Grafana 面板 URL
GRAFANA_URL = "http://192.168.1.99:3001/d/agent-health"
def send_feishu_message_via_openclaw(open_id, title, content_block, severity):
"""通过 OpenClaw 飞书通道发送消息"""
card = build_feishu_card(title, content_block, severity)
payload = json.dumps({
"receive_id": open_id,
"msg_type": "interactive",
"content": json.dumps(card),
})
try:
result = subprocess.run(
["openclaw", "message", "send",
"--channel", "feishu",
"--target", open_id,
"--message", payload],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0:
print(f"[bridge] Feishu sent to {open_id[:20]}...")
else:
print(f"[bridge] Feishu error: {result.stderr[:200]}", file=sys.stderr)
except Exception as e:
print(f"[bridge] Feishu exception: {e}", file=sys.stderr)
def send_feishu_webhook(webhook_url, title, content_block, severity):
"""通过飞书 Webhook URL 发送"""
if not webhook_url:
return
card = build_feishu_card(title, content_block, severity)
payload = json.dumps({"msg_type": "interactive", "content": json.dumps(card)}).encode("utf-8")
try:
req = urllib.request.Request(
webhook_url,
data=payload,
headers={"Content-Type": "application/json"},
method="POST"
)
with urllib.request.urlopen(req, timeout=10) as resp:
print(f"[bridge] Webhook sent: {resp.status}")
except Exception as e:
print(f"[bridge] Webhook error: {e}", file=sys.stderr)
def build_feishu_card(title, content, severity):
"""构建飞书消息卡片"""
color_map = {
"critical": "red",
"warning": "yellow",
"info": "blue",
}
color = color_map.get(severity, "blue")
return {
"config": {"wide_screen_mode": True},
"header": {
"title": {"tag": "plain_text", "content": f"🚨 {title}"},
"template": color,
},
"elements": [
{"tag": "markdown", "content": content},
{
"tag": "note",
"elements": [
{"tag": "plain_text", "content": f"BIZ-28 监控告警 | {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC')}"}
]
}
]
}
def handle_alert(alert_data):
"""处理告警并发通知"""
alerts = alert_data.get("alerts", [])
for alert in alerts:
labels = alert.get("labels", {})
annotations = alert.get("annotations", {})
status = alert.get("status", "firing")
severity = labels.get("severity", "warning")
alertname = labels.get("alertname", "Unknown")
summary = annotations.get("summary", alertname)
description = annotations.get("description", "")
title = f"[{severity.upper()}] {summary}"
content = (
f"**告警名称**: {alertname}\n"
f"**状态**: {'🔥 触发中' if status == 'firing' else '✅ 已恢复'}\n"
f"**严重级别**: {severity}\n"
f"**详情**: {description}\n\n"
f"**监控面板**: {GRAFANA_URL}\n"
f"**告警时间**: {alert.get('startsAt', '')}"
)
if severity == "critical":
# 严重告警 → 通知 Vincent
if FEISHU_WEBHOOK_CRITICAL:
send_feishu_webhook(FEISHU_WEBHOOK_CRITICAL, title, content, severity)
send_feishu_message_via_openclaw(VINCENT_OPEN_ID, title, content, severity)
elif severity == "warning":
# 警告告警 → 通知 COO
if FEISHU_WEBHOOK_WARNING:
send_feishu_webhook(FEISHU_WEBHOOK_WARNING, title, content, severity)
send_feishu_message_via_openclaw(COO_OPEN_ID, title, content, severity)
class WebhookHandler(http.server.BaseHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(content_length)
try:
alert_data = json.loads(body)
handle_alert(alert_data)
self.send_response(200)
self.send_header("Content-Type", "application/json")
response = json.dumps({"status": "ok"}).encode()
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
except Exception as e:
print(f"[bridge] Handler error: {e}", file=sys.stderr)
self.send_response(500)
self.end_headers()
def do_GET(self):
if self.path == "/health":
self.send_response(200)
self.send_header("Content-Type", "application/json")
response = json.dumps({"status": "ok"}).encode()
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
else:
self.send_response(404)
self.end_headers()
def log_message(self, format, *args):
pass
if __name__ == "__main__":
port = int(os.environ.get("WEBHOOK_PORT", 9094))
server = http.server.HTTPServer(("0.0.0.0", port), WebhookHandler)
print(f"[bridge] Alert Webhook Bridge started on port {port}")
server.serve_forever()
@@ -1,210 +0,0 @@
# BIZ-25 定时心跳检查 cron 任务部署方案
> **版本:** v1.0
> **编制:** 严维序(opengineer
> **日期:** 2026-06-24
> **状态:** 已部署
> **父方案:** [BIZ-13 运行稳定性保障方案](./BIZ-13_运行稳定性保障方案.md)
---
## 一、概述
本方案是 BIZ-13 Phase1 的执行层方案,负责将 HEARTBEAT.md 模板+共享脚本部署为可运行的定时心跳检查机制。
### 部署架构
```
┌─────────────────────────────────────────────────────┐
│ OpenClaw Gateway Cron │
│ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Agent A │ │ Agent B │ │ Agent C │ │
│ │ 心跳(10/15m)│ │ 心跳(15m) │ │ 心跳(15m) │ │
│ └─────┬──────┘ └─────┬──────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ shared/scripts/heartbeat_helper.py │ │
│ │ + multica_proxy.py │ │
│ │ + rate_limiter.py │ │
│ └──────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ 三源任务检查: WorkBoard + Multica + 文档 │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
```
---
## 二、Agent 心跳频率分类
根据 BIZ-13 方案定义:
| 分类 | 频率 | Agent | 数量 |
|------|------|-------|------|
| **高频** | **10 分钟** | 陆怀瑾 (coo), 刘诗妮 (secretary) | 2 |
| **常规** | **15 分钟** | 严维序 (opengineer), 沈路明 (productmanager), 胡蓉 (projectmanager), 梁思筑 (architect), 苏锦绘 (designer), 徐聪 (costcodev), 文墨言 (contentspecialist), 程伯予 (cvexpert), 许言 (prompt-engineer), 钟帧韵 (mediaspecialist), 陆云帆 (taobaospecialist), 顾析策 (marketanalysis), 苏慎 (lawyer) | 13 |
---
## 三、部署清单
### 3.1 ✅ 已完成 — HEARTBEAT.md 模板
所有 15 个 Agent 的工作区均已部署 HEARTBEAT.md
| 工作区 | 频率 | 核心内容 |
|--------|------|----------|
| `coo/` | 10 min | BIZ-38 模板 + 全局积压巡检 |
| `secretary/` | 10 min | BIZ-38 模板 |
| `opengineer/` | 10 min | BIZ-38 模板 + 三源检查 |
| `projectmanager/` | 10 min | BIZ-38 模板 |
| `costcodev/` | 10 min | BIZ-38 模板 |
| 其余 10 个 Agent | 15 min | 标准模板 + 三源检查 |
### 3.2 ✅ 已完成 — 共享心跳脚本
路径:`shared/scripts/`
| 文件 | 用途 | 状态 |
|------|------|------|
| `rate_limiter.py` | 缓存管理 + 请求调度 + 协调轮询 | ✅ 已部署 |
| `multica_proxy.py` | Multica CLI 代理 + 缓存封装 | ✅ 已部署 |
| `heartbeat_helper.py` | 三源任务检查 + 超时检测 + 心跳入口 | ✅ 已部署 |
### 3.3 ⬜ 本次部署 — OpenClaw Cron 任务
使用 OpenClaw Gateway cron 系统创建定时任务,通过 `agentTurn` 隔离会话实现各 Agent 的周期性心跳触发。
#### Cron Job 规格
```yaml
每个 Agent:
schedule:
kind: cron
expr: "*/10 * * * *" # 高频 Agent
# expr: "*/15 * * * *" # 常规 Agent
tz: "Asia/Shanghai"
sessionTarget: "isolated"
payload:
kind: "agentTurn"
message: "运行心跳检查。执行你的 HEARTBEAT.md 中的三源任务检查。"
```
---
## 四、部署执行记录
### 执行时间:2026-06-24 00:14 CST
#### 创建的 Cron Job 清单
| Agent | 频率 | Cron Session | 状态 |
|-------|------|-------------|------|
| coo (陆怀瑾) | 10 min | isolated agentTurn | ✅ |
| secretary (刘诗妮) | 10 min | isolated agentTurn | ✅ |
| opengineer (严维序) | 10 min | isolated agentTurn | ✅ |
| projectmanager (胡蓉) | 10 min | isolated agentTurn | ✅ |
| costcodev (徐聪) | 10 min | isolated agentTurn | ✅ |
| productmanager (沈路明) | 15 min | isolated agentTurn | ✅ |
| architect (梁思筑) | 15 min | isolated agentTurn | ✅ |
| designer (苏锦绘) | 15 min | isolated agentTurn | ✅ |
| contentspecialist (文墨言) | 15 min | isolated agentTurn | ✅ |
| cvexpert (程伯予) | 15 min | isolated agentTurn | ✅ |
| prompt-engineer (许言) | 15 min | isolated agentTurn | ✅ |
| mediaspecialist (钟帧韵) | 15 min | isolated agentTurn | ✅ |
| taobaospecialist (陆云帆) | 15 min | isolated agentTurn | ✅ |
| marketanalysis (顾析策) | 15 min | isolated agentTurn | ✅ |
| lawyer (苏慎) | 15 min | isolated agentTurn | ✅ |
---
## 五、心跳检查内容
每次心跳触发后,Agent 在隔离会话中执行以下检查:
### 5.1 三源任务检查
```mermaid
flowchart TD
A[心跳触发] --> B[检查 WorkBoard 待办卡片]
A --> C[检查 Multica 待办 Issues]
A --> D[检查本地待办文档]
B --> E{有待办?}
C --> E
D --> E
E -->|有| F[自动执行任务]
E -->|无| G[结束心跳]
F --> H[任务完成?]
H -->|是| I[更新状态]
H -->|否| J[通知 COO]
```
### 5.2 超时检测
- 进行中任务超过 20 分钟无进展 → 标记"疑似超时"
- 确认超时 → 自动恢复流程
### 5.3 依赖检查
- 认领任务前检查 `depends_on`
- 依赖未满足 → 保持 todo,不认领
### 5.4 轮次控制
- 单任务最大 50 轮
- 接近 80%40 轮)→ 预警
- 达到上限 → 暂停,通知 COO
---
## 六、风险与规避
| 风险 | 影响 | 应对 |
|------|------|------|
| 心跳任务自身卡死 | 监控失效 | rate_limiter.py 缓存 + 超时保护 |
| 新增 Agent 未配心跳 | 遗漏 | 本方案作为部署 SOP 参考 |
| 会话隔离导致上下文丢失 | 心跳重复 | 心跳仅做检查,不承担复杂任务 |
| Agent 不在线 | 心跳无响应 | 系统事件 fallbackCOO 巡检兜底 |
---
## 七、验证方法
```bash
# 检查 cron job 列表
openclaw cron list
# 手动触发一次心跳 for a specific agent
openclaw cron run <job-id>
# 检查心跳脚本健康状态
python3 shared/scripts/heartbeat_helper.py <agent_id> --health
```
---
## 八、修复记录
### v1.1 — 2026-06-24
| 问题 | 修复 |
|------|------|
| cron delivery 报 Feishu 投递错误 | delivery 从 `announce` 改为 `none`(原方案未指定 delivery,不影响功能) |
| Multica workspace_id 未传递 | `multica_proxy.py` 新增 `_inject_workspace_id()`,自动在所有 multica CLI 调用注入 `--workspace-id` |
| AGENT_CONFIGS 仅 5 个 Agent | `heartbeat_helper.py` 扩展至全部 15 个 Agent |
| COO HEARTBEAT 显示未部署 | 更新 BIZ-38 集成清单表 |
## 九、后续优化方向
- [ ] 监控面板集成(BIZ-28 Phase3
- [ ] 心跳结果聚合展示
- [ ] Agent 健康状态告警
- [ ] 自动 Agent 发现(新增 Agent 自动配置心跳)
---
> **运维记录**:严维序 2026-06-24
> 所有 15 个 Agent 的 HEARTBEAT.md 已部署,共享脚本已就位,cron 定时器已配置。
+772
View File
@@ -0,0 +1,772 @@
"""
BIZ-26: API 请求优先级队列 + 令牌桶限流器
实现方案参考:plans/BIZ-13_运行稳定性保障方案.md
功能清单:
1. 四级优先级请求队列(紧急 > 高 > 正常 > 低)
2. 令牌桶限流器(40 RPM 上限)
3. 超限自动降级和等待策略
4. 请求合并(COO 统一轮询)
5. 查询结果缓存(WorkBoard 5 分钟、配置 1 小时、知识库 1 天)
作者:徐聪(costcodev
日期:2026-06-23
"""
import time
import threading
import queue
import hashlib
import json
from typing import Any, Callable, Dict, List, Optional, Tuple
from dataclasses import dataclass, field
from enum import IntEnum
from datetime import datetime, timedelta
# ============================================================================
# 网关识别:只对 NVIDIA 网关限流
# ============================================================================
NVIDIA_GATEWAY_ALIASES = {
"nvidia",
"nvidia-gateway",
"nvidia_gateway",
"nvidiavx18088980513",
}
UNLIMITED_GATEWAY_ALIASES = {
"volcengine",
"volcengine-plan",
"siliconflow",
"deepseek",
"deepseek-api",
}
def normalize_gateway_name(value: Optional[str]) -> Optional[str]:
"""
归一化网关/模型名称。
输入可以是:
- provider: nvidia / volcengine-plan / siliconflow / deepseek
- model: nvidiavx18088980513/deepseek-ai/deepseek-v4-pro
- model: volcengine-plan/ark-code-latest
返回 provider 前缀的小写形式。未知则返回 None。
"""
if not value:
return None
text = str(value).strip().lower()
if not text:
return None
return text.split("/", 1)[0]
def is_nvidia_gateway(value: Optional[str]) -> bool:
"""判断请求是否走 NVIDIA 网关。未知网关默认不限流。"""
provider = normalize_gateway_name(value)
if provider is None:
return False
if provider in NVIDIA_GATEWAY_ALIASES:
return True
if provider in UNLIMITED_GATEWAY_ALIASES:
return False
return provider.startswith("nvidia")
# ============================================================================
# 优先级枚举
# ============================================================================
class Priority(IntEnum):
"""请求优先级:数值越小优先级越高"""
URGENT = 1 # 紧急:Vincent 直接任务
HIGH = 2 # 高:阻塞性任务
NORMAL = 3 # 正常:常规任务
LOW = 4 # 低:后台优化任务
# ============================================================================
# 请求数据类
# ============================================================================
@dataclass(order=True)
class Request:
"""优先级队列中的请求项"""
priority: int
timestamp: float = field(compare=False)
request_id: str = field(compare=False)
payload: Any = field(compare=False)
callback: Optional[Callable] = field(compare=False, default=None)
fallback_model: Optional[str] = field(compare=False, default=None)
gateway: Optional[str] = field(compare=False, default=None)
model: Optional[str] = field(compare=False, default=None)
def __post_init__(self):
if self.timestamp is None:
self.timestamp = time.time()
if self.request_id is None:
self.request_id = self._generate_id()
@staticmethod
def _generate_id() -> str:
"""生成请求 ID"""
return hashlib.md5(f"{time.time()}-{threading.current_thread().ident}".encode()).hexdigest()[:12]
# ============================================================================
# 令牌桶限流器
# ============================================================================
class TokenBucket:
"""
NVIDIA 网关专用令牌桶限流器
注意:令牌桶本身只负责节流算法;是否启用由 RequestScheduler._should_rate_limit()
按 gateway/model 判断。volcengine-plan、siliconflow、DeepSeek 等非 NVIDIA 网关不会进入此桶。
参数:
rate: 令牌生成速率(个/秒),默认 40 RPM = 0.67 个/秒
capacity: 桶容量(最大令牌数),默认 40
"""
def __init__(self, rate: float = 40/60, capacity: int = 40):
self.rate = rate # 令牌/秒
self.capacity = capacity
self.tokens = capacity
self.last_update = time.time()
self._lock = threading.Lock()
def _refill(self) -> None:
"""补充令牌(内部调用,需要持有锁)"""
now = time.time()
elapsed = now - self.last_update
new_tokens = elapsed * self.rate
self.tokens = min(self.capacity, self.tokens + new_tokens)
self.last_update = now
def consume(self, tokens: int = 1) -> bool:
"""
尝试消费令牌
返回:
True: 成功消费
False: 令牌不足
"""
with self._lock:
self._refill()
if self.tokens >= tokens:
self.tokens -= tokens
return True
return False
def wait_for_token(self, timeout: Optional[float] = None) -> bool:
"""
等待直到有可用令牌
参数:
timeout: 最大等待时间(秒),None 表示无限等待
返回:
True: 成功获取令牌
False: 超时
"""
start_time = time.time()
while True:
if self.consume():
return True
if timeout is not None:
elapsed = time.time() - start_time
if elapsed >= timeout:
return False
# 计算等待时间(直到下一个令牌生成)
with self._lock:
self._refill()
if self.tokens < 1:
wait_time = (1 - self.tokens) / self.rate
else:
wait_time = 0.01
# 等待后重试
time_to_wait = min(wait_time, 0.1) # 最多等待 100ms
if timeout is not None:
remaining = timeout - (time.time() - start_time)
if remaining <= 0:
return False
time_to_wait = min(time_to_wait, remaining)
time.sleep(time_to_wait)
def get_status(self) -> Dict[str, Any]:
"""获取限流器状态"""
with self._lock:
self._refill()
return {
"tokens": round(self.tokens, 2),
"capacity": self.capacity,
"rate_per_second": round(self.rate, 3),
"rate_per_minute": round(self.rate * 60, 1),
"utilization": round(1 - self.tokens / self.capacity, 2)
}
# ============================================================================
# 缓存管理器
# ============================================================================
@dataclass
class CacheEntry:
"""缓存条目"""
value: Any
expires_at: float
created_at: float = field(default_factory=time.time)
access_count: int = field(default=0)
class CacheManager:
"""
查询结果缓存管理器
缓存策略:
- WorkBoard 状态:5 分钟
- Agent 配置:1 小时
- 知识库内容:1 天
- 用户信息:1 天
"""
# 默认 TTL 配置(秒)
DEFAULT_TTL = {
"workboard": 5 * 60, # 5 分钟
"config": 1 * 60 * 60, # 1 小时
"knowledge": 24 * 60 * 60, # 1 天
"user": 24 * 60 * 60, # 1 天
}
def __init__(self):
self._cache: Dict[str, CacheEntry] = {}
self._lock = threading.Lock()
def _generate_key(self, category: str, query: Any) -> str:
"""生成缓存键"""
query_str = json.dumps(query, sort_keys=True) if not isinstance(query, str) else query
return hashlib.md5(f"{category}:{query_str}".encode()).hexdigest()
def get(self, category: str, query: Any) -> Optional[Any]:
"""
获取缓存
参数:
category: 缓存类别(workboard/config/knowledge/user
query: 查询条件(用于生成缓存键)
返回:
缓存值,如果不存在或已过期则返回 None
"""
key = self._generate_key(category, query)
with self._lock:
entry = self._cache.get(key)
if entry is None:
return None
# 检查是否过期
if time.time() > entry.expires_at:
del self._cache[key]
return None
# 更新访问计数
entry.access_count += 1
return entry.value
def set(self, category: str, query: Any, value: Any, ttl: Optional[int] = None) -> None:
"""
设置缓存
参数:
category: 缓存类别
query: 查询条件
value: 缓存值
ttl: 存活时间(秒),None 表示使用默认值
"""
key = self._generate_key(category, query)
if ttl is None:
ttl = self.DEFAULT_TTL.get(category, 300) # 默认 5 分钟
with self._lock:
self._cache[key] = CacheEntry(
value=value,
expires_at=time.time() + ttl
)
def delete(self, category: str, query: Any) -> bool:
"""删除缓存"""
key = self._generate_key(category, query)
with self._lock:
if key in self._cache:
del self._cache[key]
return True
return False
def clear_expired(self) -> int:
"""清理所有过期缓存,返回清理数量"""
now = time.time()
with self._lock:
expired_keys = [k for k, v in self._cache.items() if now > v.expires_at]
for key in expired_keys:
del self._cache[key]
return len(expired_keys)
def get_stats(self) -> Dict[str, Any]:
"""获取缓存统计"""
now = time.time()
with self._lock:
total = len(self._cache)
expired = sum(1 for v in self._cache.values() if now > v.expires_at)
# 按类别统计
by_category: Dict[str, int] = {}
for key, entry in self._cache.items():
# 从 key 中提取 category(格式:category:hash
category = key.split(":")[0] if ":" in key else "unknown"
by_category[category] = by_category.get(category, 0) + 1
return {
"total_entries": total,
"expired_entries": expired,
"valid_entries": total - expired,
"by_category": by_category
}
def clear(self) -> None:
"""清空所有缓存"""
with self._lock:
self._cache.clear()
# ============================================================================
# 请求调度器
# ============================================================================
class RequestScheduler:
"""
请求调度器:结合优先级队列和令牌桶限流
功能:
1. 接收不同优先级的请求
2. 按优先级和 FIF0 顺序调度
3. 通过令牌桶控制发送速率
4. 支持降级策略(低优先级切备用模型)
"""
def __init__(
self,
rate: float = 40/60,
capacity: int = 40,
enable_cache: bool = True
):
self.token_bucket = TokenBucket(rate=rate, capacity=capacity)
self.cache = CacheManager() if enable_cache else None
# 优先级队列(使用 heap 实现)
self.request_queue: queue.PriorityQueue[Request] = queue.PriorityQueue()
# 工作线程
self._worker_thread: Optional[threading.Thread] = None
self._running = False
self._lock = threading.Lock()
# 统计信息
self.stats = {
"total_requests": 0,
"completed_requests": 0,
"failed_requests": 0,
"fallback_requests": 0,
"cache_hits": 0,
"cache_misses": 0,
}
def start(self) -> None:
"""启动调度器工作线程"""
if self._running:
return
self._running = True
self._worker_thread = threading.Thread(target=self._worker_loop, daemon=True)
self._worker_thread.start()
def stop(self) -> None:
"""停止调度器"""
self._running = False
if self._worker_thread:
self._worker_thread.join(timeout=5.0)
def _worker_loop(self) -> None:
"""工作线程主循环"""
while self._running:
try:
# 从队列获取请求(带超时)
request = self.request_queue.get(timeout=1.0)
self._process_request(request)
except queue.Empty:
continue
except Exception as e:
# 记录错误但不中断工作线程
print(f"[RequestScheduler] Worker error: {e}")
def _extract_gateway_hint(self, request: Request) -> Optional[str]:
"""从 request.gateway / request.model / payload 中提取网关提示。"""
if request.gateway:
return request.gateway
if request.model:
return request.model
if isinstance(request.payload, dict):
for key in ("gateway", "provider", "model", "model_id"):
value = request.payload.get(key)
if value:
return str(value)
return None
def _should_rate_limit(self, request: Request) -> bool:
"""
只对 NVIDIA 网关请求启用令牌桶。
设计原则:未知网关默认不限制,避免误伤 volcengine-plan / siliconflow / DeepSeek
等其他 API 网关。要被限流,调用方必须显式传 gateway/model,且能识别为 NVIDIA。
"""
return is_nvidia_gateway(self._extract_gateway_hint(request))
def _process_request(self, request: Request) -> None:
"""
处理单个请求
策略:
1. 高优先级(URGENT/HIGH):等待令牌
2. 低优先级(NORMAL/LOW):尝试获取令牌,失败则降级或丢弃
"""
self.stats["total_requests"] += 1
# 只对 NVIDIA 网关请求启用令牌桶;其他网关直接执行
if not self._should_rate_limit(request):
self._execute_request(request)
return
# NVIDIA 网关请求:尝试获取令牌
if request.priority <= Priority.HIGH:
# 高优先级:无限等待
got_token = self.token_bucket.wait_for_token(timeout=None)
else:
# 低优先级:最多等待 2 秒
got_token = self.token_bucket.wait_for_token(timeout=2.0)
if got_token:
# 成功获取令牌,执行请求
self._execute_request(request)
else:
# 未能获取令牌,执行降级策略
self._handle_fallback(request)
def _execute_request(self, request: Request) -> None:
"""执行请求"""
try:
if request.callback:
result = request.callback(request.payload)
self.stats["completed_requests"] += 1
return result
else:
self.stats["completed_requests"] += 1
except Exception as e:
self.stats["failed_requests"] += 1
print(f"[RequestScheduler] Request {request.request_id} failed: {e}")
raise
def _handle_fallback(self, request: Request) -> None:
"""处理降级(令牌不足)"""
self.stats["fallback_requests"] += 1
if request.priority == Priority.LOW:
# 低优先级:直接丢弃或切换到备用模型
print(f"[RequestScheduler] Low priority request {request.request_id} dropped due to rate limit")
else:
# 正常优先级:放回队列稍后重试
request.timestamp = time.time()
self.request_queue.put(request)
def submit(
self,
payload: Any,
priority: Priority = Priority.NORMAL,
callback: Optional[Callable] = None,
fallback_model: Optional[str] = None,
request_id: Optional[str] = None,
gateway: Optional[str] = None,
model: Optional[str] = None
) -> str:
"""
提交请求到调度队列
参数:
payload: 请求数据
priority: 优先级
callback: 回调函数
fallback_model: 备用模型名称
request_id: 请求 ID(可选,默认自动生成)
返回:
请求 ID
"""
req = Request(
priority=priority,
timestamp=time.time(),
request_id=request_id,
payload=payload,
callback=callback,
fallback_model=fallback_model,
gateway=gateway,
model=model
)
self.request_queue.put(req)
return req.request_id
def submit_sync(
self,
payload: Any,
priority: Priority = Priority.NORMAL,
timeout: Optional[float] = None
) -> Any:
"""
同步提交并等待结果
参数:
payload: 请求数据
priority: 优先级
timeout: 超时时间(秒)
返回:
请求结果
"""
result_holder = {"result": None, "error": None, "done": False}
condition = threading.Condition()
def callback(data):
with condition:
try:
# 实际执行逻辑(这里只是一个占位符)
result_holder["result"] = data
except Exception as e:
result_holder["error"] = e
finally:
result_holder["done"] = True
condition.notify_all()
# 提交请求
self.submit(payload=payload, priority=priority, callback=lambda _: callback(payload))
# 等待结果
with condition:
if not result_holder["done"]:
condition.wait(timeout=timeout)
if result_holder["error"]:
raise result_holder["error"]
return result_holder["result"]
def get_queue_size(self) -> int:
"""获取当前队列大小"""
return self.request_queue.qsize()
def get_status(self) -> Dict[str, Any]:
"""获取调度器状态"""
return {
"running": self._running,
"queue_size": self.get_queue_size(),
"token_bucket": self.token_bucket.get_status(),
"cache": self.cache.get_stats() if self.cache else None,
"stats": self.stats.copy()
}
# ============================================================================
# 重试装饰器
# ============================================================================
def retry_with_backoff(
max_retries: int = 3,
base_delay: float = 1.0,
exponential_base: int = 2,
jitter: bool = True,
exceptions: Tuple = (Exception,)
):
"""
指数退避重试装饰器
参数:
max_retries: 最大重试次数
base_delay: 基础延迟(秒)
exponential_base: 指数底数
jitter: 是否添加随机抖动
exceptions: 需要重试的异常类型
"""
import random
def decorator(func: Callable) -> Callable:
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except exceptions as e:
last_exception = e
if attempt == max_retries:
break
# 计算延迟时间
delay = base_delay * (exponential_base ** attempt)
if jitter:
delay += random.uniform(0, base_delay)
print(f"[retry_with_backoff] Attempt {attempt + 1} failed: {e}. Retrying in {delay:.2f}s...")
time.sleep(delay)
raise last_exception
return wrapper
return decorator
# ============================================================================
# COO 统一轮询器(请求合并)
# ============================================================================
class CoordinatedPoller:
"""
COO 统一轮询器:替代各 Agent 独立轮询
功能:
1. 定期轮询 WorkBoard
2. 广播结果给所有订阅者
3. 减少总请求数(40 RPM × N → 40 RPM
"""
def __init__(self, scheduler: RequestScheduler, poll_interval: int = 15*60):
self.scheduler = scheduler
self.poll_interval = poll_interval # 轮询间隔(秒)
self._subscribers: List[Callable] = []
self._running = False
self._worker: Optional[threading.Thread] = None
def subscribe(self, callback: Callable) -> None:
"""订阅轮询结果"""
self._subscribers.append(callback)
def unsubscribe(self, callback: Callable) -> None:
"""取消订阅"""
if callback in self._subscribers:
self._subscribers.remove(callback)
def start(self) -> None:
"""启动轮询器"""
if self._running:
return
self._running = True
self._worker = threading.Thread(target=self._poll_loop, daemon=True)
self._worker.start()
def stop(self) -> None:
"""停止轮询器"""
self._running = False
if self._worker:
self._worker.join(timeout=5.0)
def _poll_loop(self) -> None:
"""轮询主循环"""
while self._running:
try:
# 执行轮询(这里只是一个框架,实际逻辑需要接入 multica CLI
result = self._perform_poll()
# 广播给所有订阅者
for subscriber in self._subscribers:
try:
subscriber(result)
except Exception as e:
print(f"[CoordinatedPoller] Subscriber callback error: {e}")
except Exception as e:
print(f"[CoordinatedPoller] Poll error: {e}")
# 等待下一个轮询周期
time.sleep(self.poll_interval)
def _perform_poll(self) -> Dict[str, Any]:
"""
执行实际轮询
TODO: 接入 multica CLI:
- multica issue list --status in_progress
- multica workboard list
"""
# 这里应该调用 multica CLI
# 当前只是返回一个示例结果
return {
"timestamp": datetime.now().isoformat(),
"issues": [],
"workboard_cards": []
}
# ============================================================================
# 使用示例
# ============================================================================
if __name__ == "__main__":
# 创建调度器(40 RPM
scheduler = RequestScheduler(rate=40/60, capacity=40)
scheduler.start()
# 示例:提交不同优先级的请求
def sample_callback(data):
print(f"Processing: {data}")
time.sleep(0.5) # 模拟处理时间
return "OK"
# 紧急请求
scheduler.submit(
payload={"task": "urgent_task"},
priority=Priority.URGENT,
callback=sample_callback
)
# 正常请求
scheduler.submit(
payload={"task": "normal_task"},
priority=Priority.NORMAL,
callback=sample_callback
)
# 低优先级请求
scheduler.submit(
payload={"task": "low_priority_task"},
priority=Priority.LOW,
callback=sample_callback
)
# 等待处理完成
time.sleep(2)
# 查看状态
print("\n=== Scheduler Status ===")
print(json.dumps(scheduler.get_status(), indent=2))
# 停止调度器
scheduler.stop()
print("\n示例运行完成")
+332
View File
@@ -0,0 +1,332 @@
#!/usr/bin/env python3
"""
BIZ-26 限流器测试脚本
测试场景:
1. 令牌桶限流功能
2. 优先级队列调度
3. 缓存管理器
4. 重试机制
5. 429 错误模拟
运行方式:
python3 scripts/test_rate_limiter.py
"""
import sys
import time
import threading
from datetime import datetime
# 添加脚本目录到路径
sys.path.insert(0, "/home/vincent/.openclaw/workspace/costcodev/EnterpriseArchitect/scripts")
from rate_limiter import (
TokenBucket,
CacheManager,
RequestScheduler,
Priority,
retry_with_backoff,
CoordinatedPoller,
is_nvidia_gateway,
)
def test_token_bucket():
"""测试令牌桶限流器"""
print("=" * 60)
print("测试 1: 令牌桶限流器")
print("=" * 60)
# 创建限流器:40 RPM = 0.67 令牌/秒
bucket = TokenBucket(rate=40/60, capacity=40)
print(f"\n初始状态:{bucket.get_status()}")
# 快速消费 10 个令牌
print("\n快速消费 10 个令牌...")
success_count = 0
for i in range(10):
if bucket.consume():
success_count += 1
print(f"成功消费:{success_count}/10")
print(f"消费后状态:{bucket.get_status()}")
# 测试等待获取令牌
print("\n测试等待获取令牌...")
start = time.time()
got_token = bucket.wait_for_token(timeout=2.0)
elapsed = time.time() - start
print(f"等待耗时:{elapsed:.3f}s, 获取成功:{got_token}")
print(f"等待后状态:{bucket.get_status()}")
print("\n✅ 令牌桶测试完成\n")
def test_cache_manager():
"""测试缓存管理器"""
print("=" * 60)
print("测试 2: 缓存管理器")
print("=" * 60)
cache = CacheManager()
# 测试 WorkBoard 缓存(TTL 5 分钟)
print("\n1. 设置 WorkBoard 缓存(TTL 5 分钟)")
cache.set("workboard", {"query": "status=todo"}, [{"id": "card1", "title": "Test"}])
# 立即读取
result = cache.get("workboard", {"query": "status=todo"})
print(f" 立即读取:{result is not None}")
# 测试配置缓存(TTL 1 小时)
print("\n2. 设置配置缓存(TTL 1 小时)")
cache.set("config", "agent_list", ["costcodev", "secretary", "coo"])
result = cache.get("config", "agent_list")
print(f" 读取配置:{result}")
# 测试缓存统计
print("\n3. 缓存统计")
stats = cache.get_stats()
print(f" 总条目数:{stats['total_entries']}")
print(f" 按类别:{stats['by_category']}")
# 测试缓存删除
print("\n4. 删除缓存")
deleted = cache.delete("workboard", {"query": "status=todo"})
print(f" 删除成功:{deleted}")
result = cache.get("workboard", {"query": "status=todo"})
print(f" 删除后读取:{result is None}")
print("\n✅ 缓存管理器测试完成\n")
def test_priority_queue():
"""测试优先级队列调度"""
print("=" * 60)
print("测试 3: 优先级队列调度(简化版,不启动工作线程)")
print("=" * 60)
scheduler = RequestScheduler(rate=40/60, capacity=40, enable_cache=True)
# 模拟请求处理结果
results = []
def record_result(data):
results.append((time.time(), data))
return data
# 提交不同优先级的请求(不启动工作线程,只测试队列)
print("\n提交请求(按顺序):")
scheduler.submit(
payload={"task": "normal_1"},
priority=Priority.NORMAL,
callback=record_result
)
print(" 1. 正常优先级:normal_1")
scheduler.submit(
payload={"task": "urgent_1"},
priority=Priority.URGENT,
callback=record_result
)
print(" 2. 紧急优先级:urgent_1")
scheduler.submit(
payload={"task": "low_1"},
priority=Priority.LOW,
callback=record_result
)
print(" 3. 低优先级:low_1")
scheduler.submit(
payload={"task": "high_1"},
priority=Priority.HIGH,
callback=record_result
)
print(" 4. 高优先级:high_1")
# 查看队列大小
print(f"\n队列大小:{scheduler.get_queue_size()}")
# 查看状态
status = scheduler.get_status()
print(f"初始令牌数:{status['token_bucket']['tokens']}")
print("\n✅ 优先级队列测试完成(仅提交,未处理)\n")
def test_retry_decorator():
"""测试重试装饰器"""
print("=" * 60)
print("测试 4: 重试装饰器")
print("=" * 60)
attempt_count = [0]
@retry_with_backoff(max_retries=3, base_delay=0.1, jitter=False)
def flaky_function():
attempt_count[0] += 1
if attempt_count[0] < 3:
raise Exception(f"模拟失败 (尝试 {attempt_count[0]})")
return f"成功 (尝试 {attempt_count[0]})"
print("\n调用易失败函数(前 2 次失败,第 3 次成功)...")
start = time.time()
result = flaky_function()
elapsed = time.time() - start
print(f"结果:{result}")
print(f"总尝试次数:{attempt_count[0]}")
print(f"总耗时:{elapsed:.3f}s")
print("\n✅ 重试装饰器测试完成\n")
def test_coordinated_poller():
"""测试统一轮询器"""
print("=" * 60)
print("测试 5: COO 统一轮询器(简化版,短间隔测试)")
print("=" * 60)
scheduler = RequestScheduler(rate=40/60, capacity=40)
poller = CoordinatedPoller(scheduler, poll_interval=2) # 2 秒轮询一次(测试用)
received_results = []
def on_poll_result(result):
received_results.append((datetime.now().strftime("%H:%M:%S"), result))
print(f" [{datetime.now().strftime('%H:%M:%S')}] 收到轮询结果")
poller.subscribe(on_poll_result)
print("\n启动轮询器(轮询间隔 2 秒,运行 5 秒后停止)...")
poller.start()
# 等待 5 秒
time.sleep(5)
poller.stop()
print(f"\n收到结果次数:{len(received_results)}")
for ts, result in received_results:
print(f" {ts}: {result['timestamp'][:19]}")
print("\n✅ 统一轮询器测试完成\n")
def test_rate_limit_stress():
"""压力测试:快速提交大量请求"""
print("=" * 60)
print("测试 6: 压力测试(40 RPM 限制下提交 50 个请求)")
print("=" * 60)
scheduler = RequestScheduler(rate=40/60, capacity=40, enable_cache=True)
scheduler.start()
completed = []
failed = []
lock = threading.Lock()
def callback(data):
with lock:
completed.append(data)
return data
print("\n快速提交 50 个请求...")
start_time = time.time()
for i in range(50):
priority = Priority.NORMAL if i % 10 != 0 else Priority.URGENT
scheduler.submit(
payload={"index": i, "provider": "nvidia"},
priority=priority,
callback=callback,
gateway="nvidia"
)
print("提交完成,等待处理...")
# 等待 10 秒
time.sleep(10)
elapsed = time.time() - start_time
# 查看统计
status = scheduler.get_status()
print(f"\n耗时:{elapsed:.2f}s")
print(f"队列大小:{status['queue_size']}")
print(f"已完成:{status['stats']['completed_requests']}")
print(f"失败:{status['stats']['failed_requests']}")
print(f"降级:{status['stats']['fallback_requests']}")
print(f"令牌桶状态:{status['token_bucket']}")
scheduler.stop()
print("\n✅ 压力测试完成\n")
def test_gateway_scope():
"""测试限流范围:只对 NVIDIA 网关生效"""
print("=" * 60)
print("测试 7: 网关范围识别(只限 NVIDIA)")
print("=" * 60)
assert is_nvidia_gateway("nvidia") is True
assert is_nvidia_gateway("nvidiavx18088980513/deepseek-ai/deepseek-v4-pro") is True
assert is_nvidia_gateway("volcengine-plan/ark-code-latest") is False
assert is_nvidia_gateway("siliconflow/Qwen/Qwen3") is False
assert is_nvidia_gateway("deepseek/deepseek-chat") is False
assert is_nvidia_gateway(None) is False
scheduler = RequestScheduler(rate=1/60, capacity=1, enable_cache=True)
# 先耗尽 NVIDIA 桶
scheduler.submit(payload={"provider": "nvidia", "i": 1}, priority=Priority.NORMAL, callback=lambda x: x, gateway="nvidia")
# 非 NVIDIA 请求应直接执行,不受桶状态影响
non_nv = {"provider": "volcengine-plan", "i": 2}
assert scheduler._should_rate_limit(type("R", (), {"gateway": "volcengine-plan", "model": None, "payload": non_nv})()) is False
print("✅ 网关范围识别测试完成:volcengine-plan/siliconflow/DeepSeek 不限流,NVIDIA 限流\n")
def main():
"""运行所有测试"""
print("\n")
print("" + "=" * 58 + "")
print("" + " " * 58 + "")
print("" + " BIZ-26 限流器测试套件".center(58) + "")
print("" + " API 请求优先级队列 + 令牌桶限流".center(58) + "")
print("" + " " * 58 + "")
print("" + "=" * 58 + "")
print()
try:
test_token_bucket()
test_cache_manager()
test_priority_queue()
test_retry_decorator()
test_coordinated_poller()
test_rate_limit_stress()
test_gateway_scope()
print("\n")
print("" + "=" * 58 + "")
print("" + " " * 58 + "")
print("" + " ✅ 所有测试完成".center(58) + "")
print("" + " " * 58 + "")
print("" + "=" * 58 + "")
print()
except KeyboardInterrupt:
print("\n\n⚠️ 测试被用户中断\n")
except Exception as e:
print(f"\n\n❌ 测试出错:{e}\n")
import traceback
traceback.print_exc()
sys.exit(1)
if __name__ == "__main__":
main()
-3
View File
@@ -1,3 +0,0 @@
__pycache__/
*.egg-info/
.mypy_cache/
-63
View File
@@ -1,63 +0,0 @@
# NVIDIA Sidecar 限流代理
为 NVIDIA API 提供**优先级排队 + 令牌桶限流**的透明代理层。
## 快速启动
```bash
pip install .
nvidia-sidecar
```
监听 `127.0.0.1:9190`,代理到 NVIDIA API。
## 环境变量
| 变量 | 默认值 | 说明 |
|------|--------|------|
| `SIDECAR_HOST` | `127.0.0.1` | 监听地址 |
| `SIDECAR_PORT` | `9190` | 监听端口 |
| `SIDECAR_METRICS_PORT` | `9191` | Metrics 端口 |
| `SIDECAR_UPSTREAM` | `https://integrate.api.nvidia.com/v1` | 上游 API 地址 |
| `SIDECAR_API_KEY` | — | NVIDIA API Key(必填) |
| `SIDECAR_RATE_RPM` | `40` | 每分钟请求数限制 |
| `SIDECAR_BUCKET_CAPACITY` | `40` | 令牌桶容量 |
| `SIDECAR_TIMEOUT` | `6000` | 上游请求超时(秒) |
| `SIDECAR_QUEUE_MAX` | `500` | 队列最大长度 |
| `SIDECAR_LOW_TIMEOUT` | `2.0` | 低优先级令牌等待超时(秒) |
| `SIDECAR_FALLBACK_PASSTHROUGH` | `true` | 队列满时是否直通上游 |
| `SIDECAR_LOG_LEVEL` | `INFO` | 日志级别 |
## YAML 配置
```yaml
listen_port: 9292
rate_rpm: 60
upstream_api_key: "nvapi-xxx"
```
```bash
nvidia-sidecar --config /etc/nvidia-sidecar.yaml
```
## API 端点
| 路径 | 方法 | 说明 |
|------|------|------|
| `/v1/chat/completions` | POST | OpenAI Chat Completions 代理 |
| `/v1/completions` | POST | OpenAI Completions 代理(legacy |
| `/v1/embeddings` | POST | OpenAI Embeddings 代理 |
| `/v1/models` | GET | 模型列表代理 |
| `/health` | GET | 健康检查 |
| `/metrics` | GET | 指标查询 |
## 架构
```
请求 → 网关识别 → [NVIDIA: 优先级排队 → 令牌桶限流] → httpx → NVIDIA API
→ [非 NVIDIA: 直通] → httpx → 上游
```
- **四级优先级**: URGENT > HIGH > NORMAL > LOW(通过 `X-Priority` header 指定)
- **队列满策略**: PASSTHROUGH(直通)/ REJECT503/ DROP_LOWEST(丢弃最低优先级)
- **令牌桶**: 40 RPM,线程安全,支持阻塞/非阻塞消费
-41
View File
@@ -1,41 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — 核心代理模块。
为 OpenAI Chat Completions 兼容 API 提供四层防护:
1. 请求接收(FastAPI
2. 网关识别 → 非 NVIDIA 直通
3. 优先级排队 → 令牌桶限流
4. httpx 异步转发到 NVIDIA 上游
"""
from __future__ import annotations
from nvidia_sidecar.config import SidecarConfig, load_config
from nvidia_sidecar.rate_limiter import (
Priority,
TokenBucket,
is_nvidia_gateway,
normalize_gateway_name,
)
from nvidia_sidecar.priority_queue import (
PriorityQueueItem,
PriorityRequestQueue,
QueueFullError,
QueueFullPassthrough,
QueueFullPolicy,
)
__version__ = "0.1.0"
__all__ = [
"SidecarConfig",
"load_config",
"Priority",
"TokenBucket",
"is_nvidia_gateway",
"normalize_gateway_name",
"PriorityQueueItem",
"PriorityRequestQueue",
"QueueFullError",
"QueueFullPassthrough",
"QueueFullPolicy",
]
-216
View File
@@ -1,216 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — 配置管理模块 (§3.1)
集中管理 Sidecar 运行参数,支持环境变量覆盖和 YAML 配置文件。
"""
from __future__ import annotations
import os
import warnings
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
@dataclass
class SidecarConfig:
"""Sidecar 运行配置数据类。
所有字段可通过环境变量覆盖,优先级:环境变量 > YAML 配置文件 > 默认值。
"""
# ---- 网络 ----
listen_host: str = field(
default="127.0.0.1",
metadata={"env": "SIDECAR_HOST"},
)
listen_port: int = field(
default=9190,
metadata={"env": "SIDECAR_PORT"},
)
metrics_port: int = field(
default=9191,
metadata={"env": "SIDECAR_METRICS_PORT"},
)
# ---- 上游 ----
upstream_url: str = field(
default="https://integrate.api.nvidia.com/v1",
metadata={"env": "SIDECAR_UPSTREAM"},
)
upstream_api_key: str = field(
default="",
metadata={"env": "SIDECAR_API_KEY"},
)
# ---- 限流 ----
rate_rpm: int = field(
default=40,
metadata={"env": "SIDECAR_RATE_RPM"},
)
bucket_capacity: int = field(
default=40,
metadata={"env": "SIDECAR_BUCKET_CAPACITY"},
)
# ---- 超时 ----
request_timeout: float = field(
default=6000.0,
metadata={"env": "SIDECAR_TIMEOUT"},
)
# ---- 队列 ----
queue_max_size: int = field(
default=500,
metadata={"env": "SIDECAR_QUEUE_MAX"},
)
low_priority_timeout: float = field(
default=2.0,
metadata={"env": "SIDECAR_LOW_TIMEOUT"},
)
# ---- 降级 ----
fallback_enabled_passthrough: bool = field(
default=True,
metadata={"env": "SIDECAR_FALLBACK_PASSTHROUGH"},
)
# ---- 日志 ----
log_level: str = field(
default="INFO",
metadata={"env": "SIDECAR_LOG_LEVEL"},
)
def _apply_env_overrides(config: SidecarConfig) -> SidecarConfig:
"""用环境变量覆盖配置字段。
遍历 SidecarConfig 的 dataclass fields,对每个声明了 ``metadata={"env": ...}``
的字段检查环境变量是否存在,存在则用对应类型转换后覆盖。
"""
import dataclasses as _dc
# 使用 typing.get_type_hints 解析 from __future__ import annotations
# 引入的字符串化类型注解 (PEP 563)
try:
resolved_types = __import__("typing").get_type_hints(type(config))
except Exception:
resolved_types = {}
for fld in _dc.fields(config):
env_key: str | None = fld.metadata.get("env")
if env_key is None:
continue
env_val = os.environ.get(env_key)
if env_val is None:
continue
target_type = resolved_types.get(fld.name, fld.type)
target_type_name: str = getattr(target_type, "__name__", str(target_type))
try:
if target_type is bool or target_type == "bool":
parsed: bool = env_val.strip().lower() in ("true", "1", "yes", "on")
setattr(config, fld.name, parsed)
elif target_type is int or target_type == "int":
setattr(config, fld.name, int(env_val))
elif target_type is float or target_type == "float":
setattr(config, fld.name, float(env_val))
else:
setattr(config, fld.name, env_val)
except (ValueError, TypeError) as exc:
warnings.warn(
f"无法将环境变量 {env_key}={env_val!r} 转换为 {target_type_name}: {exc}"
)
return config
def _validate_config(config: SidecarConfig) -> list[str]:
"""验证配置合理性,返回警告/问题列表。"""
issues: list[str] = []
# 端口冲突检查
if config.listen_port == config.metrics_port:
issues.append(
f"listen_port ({config.listen_port}) 与 metrics_port ({config.metrics_port}) 相同"
)
# rate_rpm 边界检查
if config.rate_rpm <= 0:
issues.append(
f"rate_rpm ({config.rate_rpm}) 无效,回退到默认值 40"
)
config.rate_rpm = 40
# queue_max_size 合理性
if config.queue_max_size <= 0:
issues.append(
f"queue_max_size ({config.queue_max_size}) 无效,回退到默认值 500"
)
config.queue_max_size = 500
# request_timeout 合理性
if config.request_timeout <= 0:
issues.append(
f"request_timeout ({config.request_timeout}) 无效,回退到默认值 6000"
)
config.request_timeout = 6000.0
return issues
def load_config(path: str | None = None) -> SidecarConfig:
"""加载 Sidecar 配置。
加载顺序(后者覆盖前者):
1. 默认值(SidecarConfig dataclass defaults
2. YAML 配置文件(如果 path 提供)
3. 环境变量覆盖
Args:
path: 可选 YAML 配置文件路径。为 None 时只使用默认值 + 环境变量。
Returns:
经过验证的 SidecarConfig 实例。
Raises:
FileNotFoundError: path 指定的文件不存在。
yaml.YAMLError: YAML 解析失败。
"""
config = SidecarConfig()
if path is not None:
import yaml
cfg_path = Path(path)
if not cfg_path.is_file():
raise FileNotFoundError(f"配置文件不存在: {cfg_path}")
try:
raw: dict[str, Any] = yaml.safe_load(cfg_path.read_text(encoding="utf-8")) or {}
except yaml.YAMLError as exc:
raise yaml.YAMLError(f"YAML 解析失败 ({cfg_path}): {exc}") from exc
# 覆盖已声明的字段
for fld_name in (
"listen_host", "listen_port", "metrics_port",
"upstream_url", "upstream_api_key",
"rate_rpm", "bucket_capacity",
"request_timeout",
"queue_max_size", "low_priority_timeout",
"fallback_enabled_passthrough",
"log_level",
):
if fld_name in raw:
setattr(config, fld_name, raw[fld_name])
# 环境变量覆盖(最高优先级)
config = _apply_env_overrides(config)
# 验证
issues = _validate_config(config)
for issue in issues:
warnings.warn(issue)
return config
-152
View File
@@ -1,152 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — 健康检查端点 (§3.6)
提供 Kubernetes / systemd 兼容的健康检查:
GET /health — 存活检查
GET /health/ready — 就绪检查(含上游连通性)
"""
from __future__ import annotations
import asyncio
import time
from dataclasses import dataclass
from typing import Any
import httpx
@dataclass
class HealthService:
"""健康检查服务。
封装存活检查和就绪检查的逻辑,供 server.py 路由调用。
"""
start_time: float = 0.0
version: str = "0.1.0"
def __post_init__(self) -> None:
if self.start_time == 0.0:
self.start_time = time.time()
@property
def uptime_seconds(self) -> float:
"""服务运行时长(秒)。"""
return time.time() - self.start_time
async def check_upstream(
self,
upstream_url: str,
timeout: float = 5.0,
api_key: str = "",
) -> bool:
"""检查上游连通性。
Args:
upstream_url: NVIDIA API base URL。
timeout: 超时秒数。
api_key: 可选的 API Key 用于认证。
Returns:
True 上游可达。
"""
try:
headers: dict[str, str] = {}
if api_key:
headers["authorization"] = f"Bearer {api_key}"
async with httpx.AsyncClient(timeout=timeout) as client:
resp = await client.get(
f"{upstream_url.rstrip('/')}/v1/models",
headers=headers,
)
return resp.status_code < 500
except Exception:
return False
def check_queue_healthy(
self,
current_size: int,
max_size: int,
threshold_ratio: float = 0.9,
) -> bool:
"""检查队列是否健康(未接近满载)。
Args:
current_size: 当前队列长度。
max_size: 队列最大容量。
threshold_ratio: 告警阈值比例,默认 0.9。
Returns:
True 队列健康。
"""
if max_size <= 0:
return True
return current_size < max_size * threshold_ratio
def check_token_bucket_healthy(
self,
available_tokens: float,
capacity: int,
threshold: float = 0.05,
) -> bool:
"""检查令牌桶是否健康(token 未耗尽)。
Args:
available_tokens: 当前可用令牌数。
capacity: 桶容量。
threshold: 令牌数低于此比例视为不健康。
Returns:
True 令牌桶健康。
"""
if capacity <= 0:
return False
return available_tokens > capacity * threshold
def liveness(self) -> dict[str, Any]:
"""存活检查响应。
Returns:
liveness JSON payload。
"""
return {
"status": "ok",
"uptime": round(self.uptime_seconds, 1),
"version": self.version,
}
async def readiness(
self,
upstream_url: str,
upstream_api_key: str = "",
queue_current_size: int = 0,
queue_max_size: int = 500,
available_tokens: float = 0.0,
bucket_capacity: int = 40,
) -> dict[str, Any]:
"""就绪检查响应。
Args:
upstream_url: 上游 API 地址。
upstream_api_key: API Key。
queue_current_size: 当前队列长度。
queue_max_size: 队列最大容量。
available_tokens: 当前令牌数。
bucket_capacity: 桶容量。
Returns:
readiness JSON payload。
"""
upstream_ok = await self.check_upstream(upstream_url, api_key=upstream_api_key)
queue_ok = self.check_queue_healthy(queue_current_size, queue_max_size)
token_ok = self.check_token_bucket_healthy(available_tokens, bucket_capacity)
all_ready = upstream_ok and queue_ok and token_ok
return {
"ready": all_ready,
"upstream_reachable": upstream_ok,
"queue_healthy": queue_ok,
"token_bucket_healthy": token_ok,
}
-272
View File
@@ -1,272 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — Prometheus 指标端点 (§3.5)
10 个指标,独立端口 :9191,与代理端口 :9190 分离。
"""
from __future__ import annotations
import time
import threading
from typing import Any
from prometheus_client import (
CollectorRegistry,
Counter,
Gauge,
Histogram,
generate_latest,
make_asgi_app,
)
class PrometheusMetrics:
"""Sidecar Prometheus 指标收集器。
线程安全,所有公开方法通过 ``threading.Lock`` 保护。
"""
def __init__(self, registry: CollectorRegistry | None = None) -> None:
"""初始化所有 10 个 Prometheus 指标。
Args:
registry: 可选自定义 RegistryNone 则使用默认全局 registry。
"""
self._registry: CollectorRegistry = registry or CollectorRegistry()
self._lock: threading.Lock = threading.Lock()
self._start_time: float = time.time()
# ---- 1. 总请求数(按优先级 + 状态分组) ----
self.requests_total: Counter = Counter(
"sidecar_requests_total",
"Total requests processed by priority and status",
labelnames=["priority", "status"],
registry=self._registry,
)
# ---- 2. 可用令牌数 ----
self.tokens_available: Gauge = Gauge(
"sidecar_tokens_available",
"Current number of available tokens",
registry=self._registry,
)
# ---- 3. 令牌生成速率 ----
self.tokens_rate: Gauge = Gauge(
"sidecar_tokens_rate",
"Current token generation rate (tokens per minute)",
registry=self._registry,
)
# ---- 4. 各优先级队列深度 ----
self.queue_depth: Gauge = Gauge(
"sidecar_queue_depth",
"Queue depth by priority",
labelnames=["priority"],
registry=self._registry,
)
# ---- 5. 队列等待时间 Histogram ----
self.queue_latency_seconds: Histogram = Histogram(
"sidecar_queue_latency_seconds",
"Request wait time in queue in seconds",
labelnames=["priority"],
buckets=(0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0),
registry=self._registry,
)
# ---- 6. 上游响应延迟 Histogram ----
self.upstream_latency_seconds: Histogram = Histogram(
"sidecar_upstream_latency_seconds",
"Upstream response latency in seconds",
labelnames=["model_id"],
buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0, 300.0, 600.0),
registry=self._registry,
)
# ---- 7. 上游错误计数 ----
self.upstream_errors_total: Counter = Counter(
"sidecar_upstream_errors_total",
"Upstream error count by status code and model",
labelnames=["status_code", "model_id"],
registry=self._registry,
)
# ---- 8. 降级直通次数 ----
self.fallback_passthrough_total: Counter = Counter(
"sidecar_fallback_passthrough_total",
"Total fallback / passthrough events (queue full or sidecar unavailable)",
registry=self._registry,
)
# ---- 9. 健康状态 ----
self.health_status: Gauge = Gauge(
"sidecar_health_status",
"Sidecar health: 0=unhealthy, 1=healthy",
registry=self._registry,
)
# ---- 10. 运行时长 ----
self.uptime_seconds: Gauge = Gauge(
"sidecar_uptime_seconds",
"Process uptime in seconds",
registry=self._registry,
)
# 避退模式指标(附加,不计入基础 10 个)
self.retreat_state: Gauge = Gauge(
"sidecar_retreat_state",
"Adaptive retreat state: 0=NORMAL, 1=RETREAT, 2=RECOVER",
registry=self._registry,
)
self.effective_rate_rpm: Gauge = Gauge(
"sidecar_effective_rate_rpm",
"Current effective rate in RPM (after retreat adjustments)",
registry=self._registry,
)
self.upstream_429_rate: Gauge = Gauge(
"sidecar_upstream_429_rate",
"Upstream 429 rate over the retreat observation window (0.0-1.0)",
registry=self._registry,
)
# 初始化
self.health_status.set(1)
# ---- ASGI app 生成 ----
def build_asgi_app(self) -> Any:
"""生成 Prometheus ASGI 应用,挂载到独立端口。
Returns:
可传给 uvicorn 的 ASGI app。
"""
return make_asgi_app(registry=self._registry)
# ---- 指标记录方法 ----
def record_request(self, priority: str, status: str) -> None:
"""记录一次请求。
Args:
priority: 优先级名(URGENT / HIGH / NORMAL / LOW)。
status: 状态(success / ratelimited / error)。
"""
with self._lock:
self.requests_total.labels(priority=priority, status=status).inc()
def record_queue_latency(self, priority: str, seconds: float) -> None:
"""记录排队延迟。
Args:
priority: 优先级名。
seconds: 排队等待秒数。
"""
with self._lock:
self.queue_latency_seconds.labels(priority=priority).observe(seconds)
def record_upstream(self, status_code: int, model_id: str) -> None:
"""记录上游响应。
Args:
status_code: HTTP 状态码。
model_id: 模型标识符。
"""
with self._lock:
self.upstream_latency_seconds.labels(model_id=model_id).observe(0.0)
def record_upstream_error(self, status_code: int, model_id: str) -> None:
"""记录上游错误。
Args:
status_code: 错误 HTTP 状态码。
model_id: 模型标识符。
"""
with self._lock:
self.upstream_errors_total.labels(
status_code=str(status_code), model_id=model_id
).inc()
def record_upstream_latency(self, model_id: str, seconds: float) -> None:
"""记录上游响应延迟。
Args:
model_id: 模型标识符。
seconds: 响应延迟秒数。
"""
with self._lock:
self.upstream_latency_seconds.labels(model_id=model_id).observe(seconds)
def update_token_status(self, tokens: float, rate_per_minute: float) -> None:
"""更新令牌桶状态。
Args:
tokens: 当前可用令牌数。
rate_per_minute: 每分钟速率。
"""
with self._lock:
self.tokens_available.set(tokens)
self.tokens_rate.set(rate_per_minute)
def update_queue_depth(self, depths: dict[str, int]) -> None:
"""更新各优先级队列深度。
Args:
depths: {priority_name: count} 映射。
"""
with self._lock:
# 先清零所有已知标签再设置,避免残留旧值
for pri in ("URGENT", "HIGH", "NORMAL", "LOW"):
self.queue_depth.labels(priority=pri).set(depths.get(pri, 0))
def increment_fallback(self) -> None:
"""降级直通计数 +1。"""
with self._lock:
self.fallback_passthrough_total.inc()
def set_health(self, healthy: bool) -> None:
"""设置健康状态。
Args:
healthy: True=健康, False=不健康。
"""
with self._lock:
self.health_status.set(1 if healthy else 0)
def update_uptime(self) -> None:
"""更新运行时长。"""
with self._lock:
self.uptime_seconds.set(time.time() - self._start_time)
# ---- 避退模式指标 ----
def update_retreat_metrics(
self,
retreat_state: str,
effective_rate_rpm: float,
upstream_429_rate: float,
) -> None:
"""更新避退模式指标。
Args:
retreat_state: "normal" / "retreat" / "recover".
effective_rate_rpm: 当前实际速率 (RPM)。
upstream_429_rate: 上游 429 率 (0.0-1.0)。
"""
state_map: dict[str, int] = {"normal": 0, "retreat": 1, "recover": 2}
with self._lock:
self.retreat_state.set(state_map.get(retreat_state, 0))
self.effective_rate_rpm.set(effective_rate_rpm)
self.upstream_429_rate.set(upstream_429_rate)
# ---- 导出 ----
def generate_latest(self) -> bytes:
"""生成 Prometheus 文本格式的指标数据。
Returns:
Prometheus 文本格式 bytes。
"""
with self._lock:
self.update_uptime()
return generate_latest(self._registry)
-226
View File
@@ -1,226 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — 四级优先级请求队列模块 (§3.3)
管理待处理的 NVIDIA API 请求,按优先级 + FIFO 出队。
支持三种队列满策略:PASSTHROUGH / REJECT / DROP_LOWEST。
"""
from __future__ import annotations
import asyncio
import heapq
import time
import uuid
from dataclasses import dataclass, field
from enum import Enum
from typing import Any
from nvidia_sidecar.rate_limiter import Priority
# ---------------------------------------------------------------------------
# 队列满策略
# ---------------------------------------------------------------------------
class QueueFullPolicy(str, Enum):
"""队列满时的处理策略。"""
PASSTHROUGH = "passthrough" # 直通上游,绕过排队(fail-open 子策略)
REJECT = "reject" # 返回 503 Service Unavailable
DROP_LOWEST = "drop_lowest" # 丢弃队列中最低优先级元素,插入新请求
# ---------------------------------------------------------------------------
# 队列元素
# ---------------------------------------------------------------------------
@dataclass(order=True)
class PriorityQueueItem:
"""优先级队列元素。
``sort_index`` 由 ``(priority, timestamp)`` 组成,
Python 的 ``__lt__`` 按字段顺序比较:先比 priority,再比 timestamp。
数值越小越优先(URGENT=1 优于 HIGH=2)。
"""
sort_index: tuple[int, float] = field(compare=True)
priority: Priority = field(compare=False)
request_id: str = field(compare=False)
payload: dict[str, Any] = field(compare=False)
enqueued_at: float = field(compare=False)
headers: dict[str, str] = field(default_factory=dict, compare=False)
# ---------------------------------------------------------------------------
# 优先级请求队列
# ---------------------------------------------------------------------------
class QueueFullError(Exception):
"""队列已满且策略为 REJECT 时抛出。"""
pass
class QueueFullPassthrough(Exception):
"""队列已满且策略为 PASSTHROUGH 时抛出,由调用方绕过队列直通上游。"""
pass
class PriorityRequestQueue:
"""异步线程安全的四级优先级请求队列。
内部使用 ``asyncio.Lock`` 保护并发操作,
基于 ``heapq`` + ``asyncio.Event`` 实现阻塞出队。
"""
def __init__(self, max_size: int = 500) -> None:
"""初始化优先级队列。
Args:
max_size: 队列最大容量。
Raises:
ValueError: max_size <= 0。
"""
if max_size <= 0:
raise ValueError(f"max_size 必须为正整数,当前值: {max_size}")
self.max_size: int = max_size
self._heap: list[PriorityQueueItem] = []
self._lock: asyncio.Lock = asyncio.Lock()
self._not_empty: asyncio.Event = asyncio.Event()
self._full_policy: QueueFullPolicy = QueueFullPolicy.PASSTHROUGH
# 统计
self._total_enqueued: int = 0
self._total_dequeued: int = 0
self._total_dropped: int = 0
# ---- 队列满策略 ----
def set_full_policy(self, policy: QueueFullPolicy) -> None:
"""设置队列满时的处理策略。
Args:
policy: QueueFullPolicy 枚举值。
"""
self._full_policy = policy
@property
def full_policy(self) -> QueueFullPolicy:
"""当前队列满策略。"""
return self._full_policy
# ---- 入队 ----
async def put(
self,
item: dict[str, Any],
priority: Priority = Priority.NORMAL,
headers: dict[str, str] | None = None,
) -> str:
"""将请求放入队列。
Args:
item: 请求体(JSON 序列化的 dict)。
priority: 请求优先级,默认 NORMAL。
headers: 原始请求 headers。
Returns:
分配的唯一 request_id。
Raises:
QueueFullError: 队列满且策略为 REJECT。
"""
request_id = str(uuid.uuid4())
headers = headers or {}
queue_item = PriorityQueueItem(
sort_index=(int(priority), time.monotonic()),
priority=priority,
request_id=request_id,
payload=item,
enqueued_at=time.monotonic(),
headers=headers,
)
async with self._lock:
queue_size = len(self._heap)
if queue_size >= self.max_size:
if self._full_policy == QueueFullPolicy.REJECT:
raise QueueFullError(
f"队列已满 ({queue_size}/{self.max_size}),策略: reject"
)
elif self._full_policy == QueueFullPolicy.DROP_LOWEST:
# 丢弃 heap 中优先级最低(值最大)的元素
# heap 是最小堆,找最大值需要遍历
max_val_item = max(self._heap, key=lambda x: x.sort_index)
self._heap.remove(max_val_item)
heapq.heapify(self._heap)
self._total_dropped += 1
# PASSTHROUGH 策略:不插入队列,抛异常让调用方绕过排队
else:
raise QueueFullPassthrough(
f"队列已满 ({queue_size}/{self.max_size}),策略: passthrough"
)
heapq.heappush(self._heap, queue_item)
self._total_enqueued += 1
self._not_empty.set()
return request_id
# ---- 出队 ----
async def get(self, timeout: float = 1.0) -> PriorityQueueItem | None:
"""从队列取出下一个元素(阻塞、优先级排序)。
Args:
timeout: 阻塞等待的最大秒数,默认 1.0。
Returns:
优先级最高的队列元素;超时无元素时返回 None。
"""
deadline = time.monotonic() + timeout
while True:
async with self._lock:
if self._heap:
item = heapq.heappop(self._heap)
self._total_dequeued += 1
if not self._heap:
self._not_empty.clear()
return item
# 队列为空,等待新元素入队
remaining = deadline - time.monotonic()
if remaining <= 0:
return None
try:
await asyncio.wait_for(
self._not_empty.wait(),
timeout=remaining,
)
except asyncio.TimeoutError:
return None
# ---- 状态查询 ----
async def get_queue_size(self) -> int:
"""返回当前队列长度。"""
async with self._lock:
return len(self._heap)
async def get_stats(self) -> dict[str, Any]:
"""返回队列统计信息。"""
async with self._lock:
depth_by_priority: dict[str, int] = {}
for item in self._heap:
key = item.priority.name
depth_by_priority[key] = depth_by_priority.get(key, 0) + 1
return {
"max_size": self.max_size,
"current_size": len(self._heap),
"total_enqueued": self._total_enqueued,
"total_dequeued": self._total_dequeued,
"total_dropped": self._total_dropped,
"depth_by_priority": depth_by_priority,
"full_policy": self._full_policy.value,
"utilization": len(self._heap) / self.max_size if self.max_size > 0 else 0.0,
}
-48
View File
@@ -1,48 +0,0 @@
[project]
name = "nvidia_sidecar"
version = "0.1.0"
description = "NVIDIA Sidecar 限流代理 — 为 NVIDIA API 提供优先级排队 + 令牌桶限流"
readme = "README.md"
license = { text = "MIT" }
requires-python = ">=3.12"
dependencies = [
"fastapi>=0.115",
"uvicorn[standard]>=0.34",
"httpx>=0.28",
"PyYAML>=6.0",
"structlog>=24.4",
"prometheus-client>=0.21",
"pydantic>=2.0",
]
[project.optional-dependencies]
dev = [
"pytest>=8.3",
"pytest-asyncio>=0.24",
"httpx>=0.28",
"mypy>=1.14",
"types-PyYAML",
]
[project.scripts]
nvidia-sidecar = "nvidia_sidecar.server:main"
[build-system]
requires = ["setuptools>=75", "wheel"]
build-backend = "setuptools.build_meta"
[tool.setuptools]
packages = ["nvidia_sidecar"]
[tool.setuptools.package-dir]
# Flat layout: __init__.py + all .py files at project root
"nvidia_sidecar" = "."
[tool.mypy]
python_version = "3.12"
strict = true
warn_return_any = true
warn_unused_configs = true
[[tool.mypy.overrides]]
module = "structlog.*"
ignore_missing_imports = true
-438
View File
@@ -1,438 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — 令牌桶 + 网关识别模块 (§3.2)
从 BIZ-26 rate_limiter.py 提取核心限流逻辑,去除多线程调度器、缓存管理等。
保留:Priority, TokenBucket, is_nvidia_gateway, normalize_gateway_name。
"""
from __future__ import annotations
import time
import threading
from enum import IntEnum
from typing import Any
# ---------------------------------------------------------------------------
# 优先级枚举
# ---------------------------------------------------------------------------
class Priority(IntEnum):
"""请求优先级(数值越小优先级越高)。"""
URGENT = 1
HIGH = 2
NORMAL = 3
LOW = 4
# ---------------------------------------------------------------------------
# NVIDIA 网关别名集
# ---------------------------------------------------------------------------
NVIDIA_GATEWAY_ALIASES: set[str] = {
"nvidia",
"nvidia-gateway",
"nvidiavx",
"nvidiavx18088980513",
}
def is_nvidia_gateway(value: str | None) -> bool:
"""判断给定网关名/模型全路径是否属于 NVIDIA 网关。
Args:
value: 网关名(如 ``"nvidia"``)或模型全路径前缀
(如 ``"nvidia/deepseek-ai/deepseek-v4-pro"``)。
None 时直接返回 False。
Returns:
True 当 value 的 provider 部分匹配已知 NVIDIA 别名。
"""
if value is None:
return False
# 提取 provider 前缀:取 "/" 前第一个部分
provider = value.split("/", 1)[0].lower().strip()
return provider in NVIDIA_GATEWAY_ALIASES
def normalize_gateway_name(value: str | None) -> str | None:
"""规范化网关名:提取 provider 前缀并转为小写。
Args:
value: 网关名或模型全路径。None 时返回 None。
Returns:
provider 前缀的小写形式,或 None。
"""
if value is None:
return None
return value.split("/", 1)[0].lower().strip()
# ---------------------------------------------------------------------------
# 令牌桶(线程安全)
# ---------------------------------------------------------------------------
class TokenBucket:
"""线程安全的令牌桶实现。
支持固定速率令牌补充和消费,带有溢出保护和可选的阻塞等待。
"""
def __init__(self, rate: float = 40 / 60, capacity: int = 40) -> None:
"""初始化令牌桶。
Args:
rate: 令牌补充速率(令牌/秒)。默认 40/60 ≈ 0.667 token/s40 RPM)。
capacity: 桶最大容量(令牌数)。默认 40。
"""
self._rate: float = float(rate)
self._capacity: int = int(capacity)
self._tokens: float = float(capacity) # 启动时桶满
self._last_refill: float = time.monotonic()
self._lock: threading.Lock = threading.Lock()
# ---- 内部方法 ----
def _refill(self) -> None:
"""补充令牌(调用方需持有 _lock)。
根据距上次补充的时间差计算新增令牌数,不超过 capacity。
"""
now = time.monotonic()
elapsed = now - self._last_refill
if elapsed > 0 and self._rate > 0:
new_tokens = elapsed * self._rate
self._tokens = min(self._tokens + new_tokens, float(self._capacity))
self._last_refill = now
# ---- 公开方法 ----
def consume(self, tokens: int = 1) -> bool:
"""尝试立即消费令牌(非阻塞)。
Args:
tokens: 要消费的令牌数,默认 1。
Returns:
True 消费成功;False 令牌不足。
"""
if tokens <= 0:
return True
with self._lock:
self._refill()
if self._tokens >= tokens:
self._tokens -= tokens
return True
return False
def try_consume(self, tokens: int = 1, timeout: float = 2.0) -> bool:
"""尝试在指定时间内消费令牌(阻塞)。
Args:
tokens: 要消费的令牌数,默认 1。
timeout: 最大等待秒数,默认 2.0。
Returns:
True 在超时前成功消费;False 超时。
"""
if tokens <= 0:
return True
deadline = time.monotonic() + timeout
while True:
with self._lock:
self._refill()
if self._tokens >= tokens:
self._tokens -= tokens
return True
# 释放锁后计算剩余等待时间
remaining = deadline - time.monotonic()
if remaining <= 0:
return False
# 等待到下一个令牌应该补充的时间点
sleep_time = min(remaining, max(0.05, 1.0 / self._rate) if self._rate > 0 else remaining)
time.sleep(sleep_time)
def wait_for_token(self, timeout: float | None = None) -> bool:
"""等待并尝试消费 1 个令牌。
Args:
timeout: 最大等待秒数;None 表示无限等待(不推荐)。
Returns:
True 成功消费;False 超时。
"""
return self.try_consume(tokens=1, timeout=timeout if timeout is not None else float("inf"))
def get_status(self) -> dict[str, Any]:
"""获取令牌桶当前状态。
Returns:
包含 tokens, capacity, rate_per_minute, utilization 的字典。
"""
with self._lock:
self._refill()
rate_per_minute = self._rate * 60.0
utilization = 0.0 if self._capacity == 0 else (
(self._capacity - self._tokens) / self._capacity
)
return {
"tokens": round(self._tokens, 2),
"capacity": self._capacity,
"rate_per_minute": round(rate_per_minute, 1),
"utilization": round(utilization, 4),
}
# ---- 属性 ----
@property
def rate(self) -> float:
"""当前令牌补充速率(令牌/秒)。"""
return self._rate
@property
def capacity(self) -> int:
"""桶容量。"""
return self._capacity
# ---- 动态速率调整(供 AdaptiveTokenBucket 使用) ----
def set_rate(self, rate: float) -> None:
"""动态调整令牌补充速率(令牌/秒)。
Args:
rate: 新速率(令牌/秒)。
"""
with self._lock:
self._refill() # 先补充现有令牌再切换速率
self._rate = float(rate)
# ---------------------------------------------------------------------------
# 避退模式:AdaptiveTokenBucket (§ADR-009)
# ---------------------------------------------------------------------------
class RetreatState:
"""避退状态机常量。"""
NORMAL: str = "normal"
RETREAT: str = "retreat"
RECOVER: str = "recover"
class AdaptiveTokenBucket(TokenBucket):
"""自适应避退令牌桶(ADR-009)。
监控上游 429 率(60s 滑动窗口),自动调整发射速率:
- 429 率 < 5% → NORMAL,保持基准速率
- 429 率 5-10% → RETREAT,速率 × 0.75
- 429 率 10-20% → RETREAT,再次降速
- 429 率 > 20% → RETREAT,最低 5 RPM + 告警
- 连续 120s 429 率 < 2% → RECOVER,逐步 +2 RPM 恢复
线程安全,继承 TokenBucket 的所有公共接口。
"""
# ADR-009 参数(可通过构造函数覆盖)
RETREAT_WINDOW_SECONDS: float = 60.0
RETREAT_429_THRESHOLD: float = 0.05
RETREAT_FACTOR: float = 0.75
RETREAT_MIN_RPM: float = 5.0
RECOVER_WINDOW_SECONDS: float = 120.0
RECOVER_429_THRESHOLD: float = 0.02
RECOVER_INCREMENT_RPM: float = 2.0
def __init__(
self,
rate: float = 40 / 60,
capacity: int = 40,
*,
retreat_window_seconds: float = 60.0,
retreat_429_threshold: float = 0.05,
retreat_factor: float = 0.75,
retreat_min_rpm: float = 5.0,
recover_window_seconds: float = 120.0,
recover_429_threshold: float = 0.02,
recover_increment_rpm: float = 2.0,
) -> None:
"""初始化自适应避退令牌桶。
Args:
rate: 基准令牌补充速率(令牌/秒)。默认 40/60 ≈ 0.667 token/s。
capacity: 桶最大容量。默认 40。
retreat_window_seconds: 429 率滑动窗口大小(秒)。
retreat_429_threshold: 触发避退的 429 率阈值。
retreat_factor: 每次避退速率乘数。
retreat_min_rpm: 避退最低 RPM。
recover_window_seconds: 恢复观察窗口大小(秒)。
recover_429_threshold: 触发恢复的 429 率阈值。
recover_increment_rpm: 每次恢复增加的 RPM。
"""
super().__init__(rate=rate, capacity=capacity)
# 基准速率(不变)
self._base_rate: float = float(rate)
# 避退参数
self.RETREAT_WINDOW_SECONDS = retreat_window_seconds
self.RETREAT_429_THRESHOLD = retreat_429_threshold
self.RETREAT_FACTOR = retreat_factor
self.RETREAT_MIN_RPM = retreat_min_rpm
self.RECOVER_WINDOW_SECONDS = recover_window_seconds
self.RECOVER_429_THRESHOLD = recover_429_threshold
self.RECOVER_INCREMENT_RPM = recover_increment_rpm
# 避退状态机
self._retreat_state: str = RetreatState.NORMAL
# 429 滑动窗口:[(timestamp, is_429), ...]
self._429_window: list[tuple[float, bool]] = []
# 上次状态变更时间
self._last_state_change: float = time.monotonic()
# 避退状态锁
self._retreat_lock: threading.Lock = threading.Lock()
# ---- 429 反馈 ----
def record_response(self, is_429: bool) -> None:
"""记录一次上游响应是否为 429。
Args:
is_429: True 表示上游返回了 429。
"""
now = time.monotonic()
with self._retreat_lock:
self._429_window.append((now, is_429))
# 清理超出观察窗口的旧记录
cutoff = now - max(
self.RETREAT_WINDOW_SECONDS,
self.RECOVER_WINDOW_SECONDS,
)
self._429_window = [
(ts, flag) for ts, flag in self._429_window
if ts >= cutoff
]
def get_429_rate(self, window_seconds: float | None = None) -> float:
"""获取指定窗口内的 429 率。
Args:
window_seconds: 滑动窗口大小;None 使用 RETREAT_WINDOW_SECONDS。
Returns:
0.0-1.0 之间的 429 率。
"""
ws = window_seconds or self.RETREAT_WINDOW_SECONDS
now = time.monotonic()
with self._retreat_lock:
in_window = [flag for ts, flag in self._429_window if now - ts <= ws]
if not in_window:
return 0.0
return sum(1 for f in in_window if f) / len(in_window)
# ---- 避退状态评估 ----
def evaluate_retreat(self) -> str:
"""评估并更新避退状态,返回新状态名。
每次调用根据当前 429 率 + 持续时间决定是否进入 RETREAT / RECOVER。
Returns:
"normal" / "retreat" / "recover"
"""
now = time.monotonic()
with self._retreat_lock:
retreat_rate = self.get_429_rate(self.RETREAT_WINDOW_SECONDS)
recover_rate = self.get_429_rate(self.RECOVER_WINDOW_SECONDS)
if self._retreat_state == RetreatState.NORMAL:
if retreat_rate >= self.RETREAT_429_THRESHOLD:
self._retreat_state = RetreatState.RETREAT
self._last_state_change = now
self._apply_retreat()
elif self._retreat_state == RetreatState.RETREAT:
# 持续高 429 率 → 再次降速
if retreat_rate >= self.RETREAT_429_THRESHOLD * 2:
# 429 > 10%,再次降速
if self._rate > self.RETREAT_MIN_RPM / 60.0:
self._apply_retreat()
elif recover_rate < self.RECOVER_429_THRESHOLD:
time_in_low = now - self._last_state_change
if time_in_low >= self.RECOVER_WINDOW_SECONDS:
self._retreat_state = RetreatState.RECOVER
self._last_state_change = now
self._apply_recover()
elif self._retreat_state == RetreatState.RECOVER:
if retreat_rate >= self.RETREAT_429_THRESHOLD:
# 恢复期间 429 回升,重新进入避退
self._retreat_state = RetreatState.RETREAT
self._last_state_change = now
self._apply_retreat()
elif self._rate >= self._base_rate:
# 已恢复到基准速率
self._rate = self._base_rate
self._retreat_state = RetreatState.NORMAL
self._last_state_change = now
else:
# 继续逐步恢复
self._apply_recover()
return self._retreat_state
def _apply_retreat(self) -> None:
"""执行一次避退降速。"""
new_rate: float = max(
self.RETREAT_MIN_RPM / 60.0,
self._rate * self.RETREAT_FACTOR,
)
self._rate = new_rate
def _apply_recover(self) -> None:
"""执行一次恢复提速。"""
increment: float = self.RECOVER_INCREMENT_RPM / 60.0
new_rate: float = min(self._base_rate, self._rate + increment)
self._rate = new_rate
# ---- 状态查询 ----
def get_retreat_state(self) -> str:
"""获取当前避退状态。
Returns:
"normal" / "retreat" / "recover"
"""
with self._retreat_lock:
return self._retreat_state
def get_effective_rate_rpm(self) -> float:
"""获取当前实际速率(RPM),考虑避退乘数。
Returns:
当前每分钟速率。
"""
with self._lock:
return self._rate * 60.0
def get_base_rate_rpm(self) -> float:
"""获取基准速率(RPM),即未避退时的速率。
Returns:
基准每分钟速率。
"""
return self._base_rate * 60.0
def reset_to_base(self) -> None:
"""手动重置到基准速率(用于运维干预)。"""
with self._retreat_lock:
self._rate = self._base_rate
self._retreat_state = RetreatState.NORMAL
self._last_state_change = time.monotonic()
self._429_window.clear()
-785
View File
@@ -1,785 +0,0 @@
"""
NVIDIA Sidecar 限流代理 — FastAPI 代理主入口 (§3.4)
完整的 API 代理链路:
接收 → 网关识别 → [NVIDIA: 排队 → 令牌限流] → httpx 转发 → 返回
非 NVIDIA 请求直通上游,NVIDIA 请求经过四级优先级队列 + 令牌桶限流。
"""
from __future__ import annotations
import asyncio
import logging
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from typing import Any
import httpx
import structlog
import uvicorn
from fastapi import FastAPI, Request, Response
from fastapi.responses import JSONResponse, StreamingResponse
from nvidia_sidecar.config import load_config, SidecarConfig
from nvidia_sidecar.rate_limiter import (
Priority,
AdaptiveTokenBucket,
is_nvidia_gateway,
)
from nvidia_sidecar.priority_queue import (
PriorityRequestQueue,
QueueFullError,
QueueFullPassthrough,
QueueFullPolicy,
)
from nvidia_sidecar.metrics import PrometheusMetrics
from nvidia_sidecar.health import HealthService
from nvidia_sidecar.webui import webui_router
# ---------------------------------------------------------------------------
# 结构化日志
# ---------------------------------------------------------------------------
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
# 生产环境推荐 JSONRenderer,开发环境可用 ConsoleRenderer
structlog.dev.ConsoleRenderer(),
],
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
logger: structlog.stdlib.BoundLogger = structlog.get_logger("nvidia_sidecar")
# ---------------------------------------------------------------------------
# 全局状态(通过 lifespan 初始化,模块级引用方便路由访问)
# ---------------------------------------------------------------------------
_config: SidecarConfig
_http_client: httpx.AsyncClient
_priority_queue: PriorityRequestQueue
_token_bucket: AdaptiveTokenBucket
_prometheus: PrometheusMetrics
_health_service: HealthService
_pending_requests: dict[str, tuple[asyncio.Future[httpx.Response], float]]
"""request_id → (response future, enqueued_at) 的映射。"""
_metrics_task: asyncio.Task[None] | None = None
# 统计计数器
_stats: dict[str, int] = {
"total_requests": 0,
"nvidia_requests": 0,
"passthrough_requests": 0,
"ratelimited_requests": 0,
"queue_full_rejects": 0,
"upstream_errors": 0,
"start_time": 0,
}
# ---------------------------------------------------------------------------
# 工具函数
# ---------------------------------------------------------------------------
def _extract_model(body: dict[str, Any]) -> str | None:
"""从请求体中提取模型标识符(兼容 OpenAI Chat/Completions 格式)。
Args:
body: 已解析的 JSON 请求体。
Returns:
模型标识符字符串,或 None。
"""
if isinstance(body, dict):
return str(body.get("model", "")) or None
return None
def _resolve_priority(headers: dict[str, str]) -> Priority:
"""从请求 headers 解析优先级。
检查 ``X-Priority`` header,值为 ``urgent``/``high``/``normal``/``low``
不区分大小写。默认 NORMAL。
"""
raw = headers.get("x-priority", "").strip().lower()
mapping: dict[str, Priority] = {
"urgent": Priority.URGENT,
"high": Priority.HIGH,
"normal": Priority.NORMAL,
"low": Priority.LOW,
}
return mapping.get(raw, Priority.NORMAL)
# ---------------------------------------------------------------------------
# 上游转发
# ---------------------------------------------------------------------------
async def _forward_to_upstream(
method: str,
path: str,
body: bytes | None,
headers: dict[str, str],
stream: bool = False,
) -> httpx.Response:
"""将请求转发到 NVIDIA 上游 API。
Args:
method: HTTP 方法。
path: 请求路径(如 ``/v1/chat/completions``)。
body: 原始请求体 bytes。
headers: 要转发的请求 headers(会追加 Authorization)。
stream: 是否请求流式响应。
Returns:
httpx.Response 对象。
Raises:
httpx.HTTPError: HTTP 请求失败。
"""
upstream_url = _config.upstream_url.rstrip("/") + path
forward_headers: dict[str, str] = {
k: v for k, v in headers.items()
if k.lower() not in ("host", "content-length", "transfer-encoding")
}
if _config.upstream_api_key:
forward_headers["authorization"] = f"Bearer {_config.upstream_api_key}"
elif "authorization" not in {k.lower() for k in forward_headers}:
forward_headers["authorization"] = "Bearer nvidia"
try:
req = _http_client.build_request(
method=method,
url=upstream_url,
headers=forward_headers,
content=body,
timeout=_config.request_timeout,
)
response = await _http_client.send(req, stream=stream)
return response
except httpx.TimeoutException:
logger.warning("upstream_timeout", path=path, timeout=_config.request_timeout)
raise
except httpx.HTTPError as exc:
logger.error("upstream_error", path=path, error=str(exc))
raise
# ---------------------------------------------------------------------------
# worker 协程:消费优先级队列 + 令牌桶 + 转发
# ---------------------------------------------------------------------------
async def _worker_loop() -> None:
"""后台 worker:持续从优先级队列取请求 → 令牌限流 → 转发 → 设置 future 结果。"""
log = logger.bind(worker="main")
log.info("worker_started")
while True:
try:
queue_item = await _priority_queue.get(timeout=1.0)
if queue_item is None:
continue
request_id = queue_item.request_id
payload = queue_item.payload
headers = queue_item.headers
enqueued_at = queue_item.enqueued_at
# 查找对应的 pending future
pending_entry = _pending_requests.get(request_id)
if pending_entry is None:
log.warning("orphan_request", request_id=request_id)
continue
future, _ = pending_entry
# 低优先级令牌等待超时处理
if queue_item.priority == Priority.LOW:
# 放线程池执行阻塞的令牌桶调用
got_token = await asyncio.to_thread(
_token_bucket.try_consume,
tokens=1,
timeout=_config.low_priority_timeout,
)
if not got_token:
log.info("low_priority_timeout", request_id=request_id)
_stats["ratelimited_requests"] += 1
_prometheus.record_request(queue_item.priority.name, "ratelimited")
if not future.done():
future.set_exception(
_RateLimitedError(
f"低优先级请求令牌等待超时 ({_config.low_priority_timeout}s)"
)
)
_pending_requests.pop(request_id, None)
continue
else:
# 非低优先级:在 worker 内轮询等待令牌,避免重入队导致 future 悬挂
# (重入队会生成新 request_id,原 future 永不 resolve → 客户端永久 hang
got_token = await asyncio.to_thread(_token_bucket.consume, tokens=1)
if not got_token:
token_deadline = time.monotonic() + _config.request_timeout
while not got_token:
await asyncio.sleep(0.1)
got_token = await asyncio.to_thread(_token_bucket.consume, tokens=1)
if time.monotonic() > token_deadline:
break
if not got_token:
log.warning(
"token_wait_timeout",
request_id=request_id,
priority=queue_item.priority.name,
timeout=_config.request_timeout,
)
_stats["ratelimited_requests"] += 1
_prometheus.record_request(queue_item.priority.name, "ratelimited")
if not future.done():
future.set_exception(
_RateLimitedError(
f"令牌等待超时 ({_config.request_timeout:.0f}s)"
)
)
_pending_requests.pop(request_id, None)
continue
# 转发到上游
upstream_start = time.monotonic()
try:
path = headers.get("x-original-path", "/v1/chat/completions")
method = headers.get("x-original-method", "POST")
# 过滤内部 headers
clean_headers = {
k: v for k, v in headers.items()
if not k.startswith("x-original-") and not k.startswith("x-request-id")
}
resp = await _forward_to_upstream(
method=method,
path=path,
body=payload.get("_raw_body"),
headers=clean_headers,
stream=payload.get("stream", False),
)
upstream_latency = time.monotonic() - upstream_start
queue_latency = time.monotonic() - enqueued_at
total_latency = upstream_latency + queue_latency
is_429: bool = resp.status_code == 429
_token_bucket.record_response(is_429)
# 避退状态评估 + 指标更新
_token_bucket.evaluate_retreat()
retreat_state = _token_bucket.get_retreat_state()
effective_rpm = _token_bucket.get_effective_rate_rpm()
upstream_429_rate = _token_bucket.get_429_rate()
_prometheus.update_retreat_metrics(retreat_state, effective_rpm, upstream_429_rate)
log.info(
"request_completed",
request_id=request_id,
status=resp.status_code,
upstream_latency=round(upstream_latency, 3),
queue_latency=round(queue_latency, 3),
total_latency=round(total_latency, 3),
retreat_state=retreat_state,
effective_rpm=round(effective_rpm, 1),
)
# 记录 Prometheus 指标
model_id = _extract_model(payload) or "unknown"
_prometheus.record_upstream_latency(model_id, upstream_latency)
if not resp.is_success:
_prometheus.record_upstream_error(resp.status_code, model_id)
_prometheus.record_request(queue_item.priority.name, "success" if resp.is_success else "error")
_prometheus.record_queue_latency(queue_item.priority.name, queue_latency)
if not future.done():
future.set_result(resp)
except (httpx.HTTPError, OSError) as exc:
log.error("upstream_request_failed", request_id=request_id, error=str(exc))
_stats["upstream_errors"] += 1
_prometheus.record_request(queue_item.priority.name, "error")
_prometheus.set_health(False)
if not future.done():
future.set_exception(exc)
_pending_requests.pop(request_id, None)
except asyncio.CancelledError:
log.info("worker_cancelled")
break
except Exception:
log.exception("worker_unexpected_error")
# ---------------------------------------------------------------------------
# PASSTHROUGH 直通路径(队列满 + PASSTHROUGH 策略)
# ---------------------------------------------------------------------------
async def _passthrough_with_rate_limit(
request: Request,
path: str,
body_bytes: bytes,
raw_headers: dict[str, str],
priority: Priority,
) -> Response:
"""队列满时的 PASSSTHROUGH 直通路径:仍受令牌桶限流,但不排队。
Args:
request: FastAPI Request。
path: 请求路径。
body_bytes: 原始请求体。
raw_headers: 请求 headers。
priority: 请求优先级。
Returns:
FastAPI Response。
"""
_stats["passthrough_requests"] += 1
_prometheus.increment_fallback()
# 低优先级走令牌桶等待
if priority == Priority.LOW:
got_token = await asyncio.to_thread(
_token_bucket.try_consume,
tokens=1,
timeout=_config.low_priority_timeout,
)
if not got_token:
_stats["ratelimited_requests"] += 1
_prometheus.record_request(priority.name, "ratelimited")
return JSONResponse(
status_code=429,
content={
"error": {
"message": f"令牌不足(队列满 + passthrough),超时 {_config.low_priority_timeout}s",
"type": "RateLimitedError",
}
},
)
else:
got_token = await asyncio.to_thread(_token_bucket.consume, tokens=1)
if not got_token:
# 非低优先级轮询等待
deadline = time.monotonic() + 30.0
while not got_token:
await asyncio.sleep(0.1)
got_token = await asyncio.to_thread(_token_bucket.consume, tokens=1)
if time.monotonic() > deadline:
_stats["ratelimited_requests"] += 1
_prometheus.record_request(priority.name, "ratelimited")
return JSONResponse(
status_code=429,
content={
"error": {
"message": "令牌不足(队列满 + passthrough),等待超时 30s",
"type": "RateLimitedError",
}
},
)
# 拿到令牌,直接转发
try:
clean_headers = {k: v for k, v in raw_headers.items()}
resp = await _forward_to_upstream(
method=request.method,
path=path,
body=body_bytes if body_bytes else None,
headers=clean_headers,
stream=False,
)
retreat_state = _token_bucket.get_retreat_state()
_token_bucket.evaluate_retreat()
_prometheus.update_retreat_metrics(
retreat_state,
_token_bucket.get_effective_rate_rpm(),
_token_bucket.get_429_rate(),
)
return _build_response(resp)
except Exception as exc:
status, msg = _map_exception(exc)
logger.error("passthrough_error", path=path, error=str(exc))
_prometheus.set_health(False)
return JSONResponse(
status_code=status,
content={"error": {"message": msg, "type": type(exc).__name__}},
)
# ---------------------------------------------------------------------------
# 自定义异常
# ---------------------------------------------------------------------------
class _RateLimitedError(Exception):
"""429 限流错误。"""
pass
# ---------------------------------------------------------------------------
# 异常处理矩阵 (§3.4)
# ---------------------------------------------------------------------------
_EXCEPTION_MATRIX: dict[type[Exception], tuple[int, str]] = {
_RateLimitedError: (429, "Too Many Requests — 令牌不足"),
QueueFullError: (503, "Service Unavailable — 队列已满"),
httpx.TimeoutException: (504, "Gateway Timeout — 上游超时"),
httpx.ConnectError: (502, "Bad Gateway — 上游连接失败"),
httpx.HTTPStatusError: (502, "Bad Gateway — 上游返回错误状态"),
}
def _map_exception(exc: Exception) -> tuple[int, str]:
"""将异常映射为 HTTP 状态码 + 错误信息。"""
for exc_type, (status, msg) in _EXCEPTION_MATRIX.items():
if isinstance(exc, exc_type):
return status, msg
return 500, f"Internal Server Error — {type(exc).__name__}"
# ---------------------------------------------------------------------------
# FastAPI 应用 + lifespan
# ---------------------------------------------------------------------------
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, Any]:
"""应用生命周期管理:初始化/清理全局资源。"""
global _config, _http_client, _priority_queue, _token_bucket, _pending_requests
global _prometheus, _health_service, _metrics_task
# 启动
_config = load_config()
logging.getLogger().setLevel(_config.log_level.upper())
_http_client = httpx.AsyncClient(
timeout=httpx.Timeout(_config.request_timeout),
)
_priority_queue = PriorityRequestQueue(max_size=_config.queue_max_size)
_token_bucket = AdaptiveTokenBucket(
rate=_config.rate_rpm / 60.0,
capacity=_config.bucket_capacity,
)
_prometheus = PrometheusMetrics()
_health_service = HealthService()
_pending_requests = {}
_stats["start_time"] = int(time.time())
# 启动 worker 协程
worker_task = asyncio.create_task(_worker_loop())
# 在独立端口 :9191 启动 Prometheus metrics 服务器
metrics_app = _prometheus.build_asgi_app()
metrics_config = uvicorn.Config(
metrics_app,
host=_config.listen_host,
port=_config.metrics_port,
log_level="error",
)
metrics_server = uvicorn.Server(metrics_config)
_metrics_task = asyncio.create_task(metrics_server.serve())
# 挂载 webui 子路由
app.include_router(webui_router)
logger.info(
"sidecar_started",
host=_config.listen_host,
port=_config.listen_port,
metrics_port=_config.metrics_port,
rate_rpm=_config.rate_rpm,
queue_max=_config.queue_max_size,
retreat_enabled=True,
)
yield # app 运行中
# 关闭
worker_task.cancel()
try:
await worker_task
except asyncio.CancelledError:
pass
if _metrics_task is not None:
_metrics_task.cancel()
try:
await _metrics_task
except asyncio.CancelledError:
pass
await _http_client.aclose()
logger.info("sidecar_stopped")
app: FastAPI = FastAPI(
title="NVIDIA Sidecar Rate-Limiting Proxy",
version="0.1.0",
lifespan=lifespan,
)
# ---------------------------------------------------------------------------
# 核心代理处理器
# ---------------------------------------------------------------------------
async def _handle_proxy_request(request: Request, path: str) -> Response:
"""统一的代理请求处理入口。
执行完整链路:
1. 解析请求体 → 提取 model
2. 网关识别 → 非 NVIDIA 直通
3. NVIDIA → 排队 + 令牌限流 + 转发
"""
_stats["total_requests"] += 1
# 解析请求
body_bytes: bytes = await request.body()
raw_headers: dict[str, str] = dict(request.headers)
# 尝试解析 JSON body
body_json: dict[str, Any] = {}
try:
if body_bytes:
body_json = __import__("json").loads(body_bytes)
except (ValueError, TypeError):
body_json = {}
# 提取 model 进行网关识别
model: str | None = _extract_model(body_json)
is_nvidia: bool = is_nvidia_gateway(model)
# 非 NVIDIA → 直接转发
if not is_nvidia:
_stats["passthrough_requests"] += 1
try:
resp = await _forward_to_upstream(
method=request.method,
path=path,
body=body_bytes if body_bytes else None,
headers=raw_headers,
stream=body_json.get("stream", False),
)
return _build_response(resp)
except Exception as exc:
status, msg = _map_exception(exc)
logger.error("passthrough_error", path=path, error=str(exc))
return JSONResponse(
status_code=status,
content={"error": {"message": msg, "type": type(exc).__name__}},
)
# NVIDIA → 排队 + 限流 + 转发
_stats["nvidia_requests"] += 1
priority: Priority = _resolve_priority(raw_headers)
# 注入内部元数据到 payload
payload_for_queue: dict[str, Any] = dict(body_json)
payload_for_queue["_raw_body"] = body_bytes
# 尝试入队;PASSTHROUGH 策略下队列满时走直通路径
try:
request_id = await _priority_queue.put(
item=payload_for_queue,
priority=priority,
headers={
**raw_headers,
"x-original-path": path,
"x-original-method": request.method,
},
)
except QueueFullError:
_stats["queue_full_rejects"] += 1
return JSONResponse(
status_code=503,
content={
"error": {
"message": "队列已满,当前策略: reject",
"type": "QueueFullError",
}
},
)
except QueueFullPassthrough:
# 队列满 + PASSTHROUGH:绕过排队,尝试令牌桶后直接转发
_stats["passthrough_requests"] += 1
logger.info("queue_full_passthrough", path=path)
return await _passthrough_with_rate_limit(request, path, body_bytes, raw_headers, priority)
# 创建 future 并注册到 pending
loop = asyncio.get_running_loop()
future: asyncio.Future[httpx.Response] = loop.create_future()
_pending_requests[request_id] = (future, time.monotonic())
try:
# 等待 worker 完成处理
resp = await future
return _build_response(resp)
except _RateLimitedError as exc:
return JSONResponse(
status_code=429,
content={
"error": {
"message": str(exc),
"type": "RateLimitedError",
}
},
)
except Exception as exc:
status, msg = _map_exception(exc)
logger.error("proxy_error", path=path, request_id=request_id, error=str(exc))
return JSONResponse(
status_code=status,
content={"error": {"message": msg, "type": type(exc).__name__}},
)
def _build_response(resp: httpx.Response) -> Response:
"""将 httpx.Response 转换为 FastAPI Response。
支持 JSON 和流式 (SSE) 两种响应类型。
"""
content_type = resp.headers.get("content-type", "")
# 流式响应 (SSE)
if "text/event-stream" in content_type or "stream" in content_type:
return StreamingResponse(
content=resp.aiter_bytes(),
status_code=resp.status_code,
headers={
k: v for k, v in resp.headers.items()
if k.lower() not in ("content-encoding", "transfer-encoding")
},
media_type="text/event-stream",
)
# 普通 JSON 响应
return Response(
content=resp.content,
status_code=resp.status_code,
headers={
k: v for k, v in resp.headers.items()
if k.lower() not in ("content-encoding", "transfer-encoding")
},
media_type=content_type or "application/json",
)
# ---------------------------------------------------------------------------
# 路由
# ---------------------------------------------------------------------------
@app.get("/health")
async def health() -> dict[str, Any]:
"""存活检查 (liveness)。"""
return _health_service.liveness()
@app.get("/health/ready")
async def health_ready() -> dict[str, Any]:
"""就绪检查 (readiness),含上游连通性。"""
queue_size = await _priority_queue.get_queue_size()
bucket_status = _token_bucket.get_status()
return await _health_service.readiness(
upstream_url=_config.upstream_url,
upstream_api_key=_config.upstream_api_key or "",
queue_current_size=queue_size,
queue_max_size=_config.queue_max_size,
available_tokens=bucket_status["tokens"],
bucket_capacity=bucket_status["capacity"],
)
@app.get("/status")
async def status() -> dict[str, Any]:
"""调试用:限流器 + 队列 + 避退完整状态。"""
queue_stats = await _priority_queue.get_stats()
bucket_status = _token_bucket.get_status()
return {
"requests": {
"total": _stats["total_requests"],
"nvidia": _stats["nvidia_requests"],
"passthrough": _stats["passthrough_requests"],
"ratelimited": _stats["ratelimited_requests"],
},
"errors": {
"queue_full_rejects": _stats["queue_full_rejects"],
"upstream_errors": _stats["upstream_errors"],
},
"queue": queue_stats,
"token_bucket": bucket_status,
"retreat": {
"state": _token_bucket.get_retreat_state(),
"effective_rpm": round(_token_bucket.get_effective_rate_rpm(), 1),
"base_rpm": round(_token_bucket.get_base_rate_rpm(), 1),
"upstream_429_rate": round(_token_bucket.get_429_rate(), 4),
},
"uptime_seconds": int(time.time() - _stats["start_time"]) if _stats["start_time"] else 0,
}
# ---- OpenAI 兼容端点 ----
@app.post("/v1/chat/completions")
async def chat_completions(request: Request) -> Response:
"""OpenAI Chat Completions API 代理(含流式支持)。"""
return await _handle_proxy_request(request, "/v1/chat/completions")
@app.post("/v1/completions")
async def completions(request: Request) -> Response:
"""OpenAI Completions API 代理(legacy)。"""
return await _handle_proxy_request(request, "/v1/completions")
@app.post("/v1/embeddings")
async def embeddings(request: Request) -> Response:
"""OpenAI Embeddings API 代理。"""
return await _handle_proxy_request(request, "/v1/embeddings")
@app.get("/v1/models")
@app.get("/v1/models/{model_id:path}")
async def list_models(request: Request, model_id: str | None = None) -> Response:
"""OpenAI Models API 代理。"""
path = f"/v1/models/{model_id}" if model_id else "/v1/models"
return await _handle_proxy_request(request, path)
# ---- 通用代理(catch-all 用于非标准 NVIDIA 端点) ----
@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"])
async def catch_all(request: Request, path: str) -> Response:
"""通用代理端点:转发任何未匹配的路径到上游。"""
target_path = f"/{path}" if not path.startswith("/") else path
return await _handle_proxy_request(request, target_path)
# ---------------------------------------------------------------------------
# 入口
# ---------------------------------------------------------------------------
def main() -> None:
"""开发/调试入口。"""
import uvicorn
cfg: SidecarConfig = load_config()
uvicorn.run(
"nvidia_sidecar.server:app",
host=cfg.listen_host,
port=cfg.listen_port,
log_level=cfg.log_level.lower(),
)
if __name__ == "__main__":
main()
@@ -1,260 +0,0 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>NVIDIA Sidecar — 实时仪表盘</title>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.7/dist/chart.umd.min.js"></script>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; background: #0f172a; color: #e2e8f0; padding: 24px; }
h1 { font-size: 22px; font-weight: 600; margin-bottom: 4px; color: #f8fafc; }
.subtitle { color: #94a3b8; font-size: 13px; margin-bottom: 24px; }
.grid { display: grid; grid-template-columns: repeat(auto-fit, minmax(380px, 1fr)); gap: 20px; margin-bottom: 24px; }
.card { background: #1e293b; border-radius: 12px; padding: 20px; border: 1px solid #334155; }
.card h2 { font-size: 15px; font-weight: 600; color: #94a3b8; margin-bottom: 14px; text-transform: uppercase; letter-spacing: 0.05em; }
.card canvas { max-height: 220px; }
.stat-row { display: flex; gap: 16px; flex-wrap: wrap; }
.stat { flex: 1; min-width: 100px; background: #0f172a; border-radius: 8px; padding: 12px; text-align: center; border: 1px solid #334155; }
.stat .value { font-size: 28px; font-weight: 700; color: #38bdf8; }
.stat .label { font-size: 11px; color: #64748b; margin-top: 4px; text-transform: uppercase; }
.stat.warn .value { color: #f59e0b; }
.stat.danger .value { color: #ef4444; }
.retreat-badge { display: inline-block; padding: 2px 10px; border-radius: 999px; font-size: 12px; font-weight: 600; }
.retreat-badge.normal { background: #065f46; color: #6ee7b7; }
.retreat-badge.retreat { background: #78350f; color: #fbbf24; }
.retreat-badge.recover { background: #1e3a5f; color: #60a5fa; }
.config-panel { background: #1e293b; border-radius: 12px; padding: 20px; border: 1px solid #334155; }
.config-panel h2 { font-size: 15px; font-weight: 600; color: #94a3b8; margin-bottom: 14px; text-transform: uppercase; letter-spacing: 0.05em; }
.config-row { display: flex; align-items: center; gap: 12px; margin-bottom: 12px; flex-wrap: wrap; }
.config-row label { min-width: 100px; font-size: 13px; color: #cbd5e1; }
.config-row input, .config-row select { background: #0f172a; border: 1px solid #334155; border-radius: 6px; color: #e2e8f0; padding: 6px 10px; font-size: 13px; }
.config-row input[type="range"] { width: 140px; }
.config-row button { background: #38bdf8; color: #0f172a; border: none; border-radius: 6px; padding: 6px 16px; font-size: 13px; font-weight: 600; cursor: pointer; }
.config-row button:hover { background: #7dd3fc; }
.config-row button:disabled { background: #475569; cursor: not-allowed; }
.toast { position: fixed; top: 16px; right: 16px; padding: 10px 20px; border-radius: 8px; font-size: 13px; z-index: 999; animation: fadeInOut 3s; }
.toast.success { background: #065f46; color: #6ee7b7; }
.toast.error { background: #7f1d1d; color: #fca5a5; }
@keyframes fadeInOut { 0% { opacity: 0; transform: translateY(-8px); } 10% { opacity: 1; transform: translateY(0); } 80% { opacity: 1; } 100% { opacity: 0; } }
.disconnected { background: #7f1d1d; color: #fca5a5; padding: 4px 10px; border-radius: 4px; font-size: 12px; display: inline-block; margin-left: 8px; }
.connected { background: #065f46; color: #6ee7b7; padding: 4px 10px; border-radius: 4px; font-size: 12px; display: inline-block; margin-left: 8px; }
</style>
</head>
<body>
<h1>🚀 NVIDIA Sidecar 实时仪表盘
<span id="conn-status" class="connected">已连接</span>
</h1>
<p class="subtitle">令牌桶限流 · 优先级队列 · 避退模式 · 实时监控</p>
<!-- 状态卡片 -->
<div class="stat-row" style="margin-bottom: 24px;">
<div class="stat"><div class="value" id="val-total">0</div><div class="label">总请求</div></div>
<div class="stat"><div class="value" id="val-nvidia">0</div><div class="label">NVIDIA 请求</div></div>
<div class="stat"><div class="value" id="val-rate">0</div><div class="label">当前 RPM</div></div>
<div class="stat"><div class="value" id="val-429">0%</div><div class="label">上游 429 率</div></div>
<div class="stat"><div class="value" id="val-retreat">正常</div><div class="label">避退状态</div></div>
<div class="stat"><div class="value" id="val-uptime">0s</div><div class="label">运行时间</div></div>
</div>
<!-- 图表 -->
<div class="grid">
<div class="card">
<h2>📊 令牌桶使用率</h2>
<canvas id="chart-tokens"></canvas>
</div>
<div class="card">
<h2>📈 队列深度</h2>
<canvas id="chart-queue"></canvas>
</div>
<div class="card">
<h2>📉 请求吞吐量 (最近 20 点)</h2>
<canvas id="chart-throughput"></canvas>
</div>
<div class="card">
<h2>⚙️ 速率历史</h2>
<canvas id="chart-rate"></canvas>
</div>
</div>
<!-- 配置面板 -->
<div class="config-panel">
<h2>🔧 实时配置</h2>
<div class="config-row">
<label>速率 (RPM)</label>
<input type="range" id="cfg-rate-rpm" min="1" max="100" value="40" oninput="document.getElementById('cfg-rate-val').textContent=this.value">
<span id="cfg-rate-val" style="min-width:30px;">40</span>
</div>
<div class="config-row">
<label>队列上限</label>
<input type="number" id="cfg-queue-max" value="500" min="1" max="2000" style="width:80px;">
</div>
<div class="config-row">
<button onclick="applyConfig()">应用配置</button>
</div>
</div>
<script>
// SSE 连接
let evtSource = null;
let dataHistory = { throughput: [], rates: [] };
const MAX_HISTORY = 20;
let latencyLog = [];
function connectSSE() {
if (evtSource) evtSource.close();
evtSource = new EventSource('/api/dashboard/stream');
evtSource.onmessage = (e) => {
try {
const snap = JSON.parse(e.data);
updateDashboard(snap);
updateLatencies(snap);
document.getElementById('conn-status').className = 'connected';
document.getElementById('conn-status').textContent = '已连接';
} catch (err) {
document.getElementById('conn-status').className = 'disconnected';
document.getElementById('conn-status').textContent = '解析错误';
}
};
evtSource.onerror = () => {
document.getElementById('conn-status').className = 'disconnected';
document.getElementById('conn-status').textContent = '断开 - 重连中';
};
}
// 初始化 Chart.js
const ctxTokens = document.getElementById('chart-tokens').getContext('2d');
const chartTokens = new Chart(ctxTokens, {
type: 'doughnut',
data: {
labels: ['已用令牌', '可用令牌'],
datasets: [{ data: [0, 40], backgroundColor: ['#ef4444', '#22c55e'], borderWidth: 0 }]
},
options: { responsive: true, maintainAspectRatio: true, cutout: '65%', plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } } }
});
const ctxQueue = document.getElementById('chart-queue').getContext('2d');
const chartQueue = new Chart(ctxQueue, {
type: 'bar',
data: {
labels: ['URGENT', 'HIGH', 'NORMAL', 'LOW'],
datasets: [{ label: '排队数', data: [0, 0, 0, 0], backgroundColor: ['#ef4444', '#f59e0b', '#38bdf8', '#a78bfa'] }]
},
options: { responsive: true, maintainAspectRatio: true, scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } }, plugins: { legend: { display: false } } }
});
const ctxThroughput = document.getElementById('chart-throughput').getContext('2d');
const chartThroughput = new Chart(ctxThroughput, {
type: 'line',
data: { labels: [], datasets: [
{ label: '成功', data: [], borderColor: '#22c55e', backgroundColor: '#22c55e20', fill: false, tension: 0.3, pointRadius: 2 },
{ label: '429', data: [], borderColor: '#f59e0b', backgroundColor: '#f59e0b20', fill: false, tension: 0.3, pointRadius: 2 },
{ label: '直通', data: [], borderColor: '#a78bfa', backgroundColor: '#a78bfa20', fill: false, tension: 0.3, pointRadius: 2 },
]},
options: { responsive: true, maintainAspectRatio: true, scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } }, plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } } }
});
const ctxRate = document.getElementById('chart-rate').getContext('2d');
const chartRate = new Chart(ctxRate, {
type: 'line',
data: { labels: [], datasets: [
{ label: '有效 RPM', data: [], borderColor: '#38bdf8', fill: false, tension: 0.3, pointRadius: 2 },
{ label: '基准 RPM', data: [], borderColor: '#64748b', fill: false, tension: 0.3, pointRadius: 2, borderDash: [4, 4] },
]},
options: { responsive: true, maintainAspectRatio: true, scales: { y: { beginAtZero: true, ticks: { color: '#94a3b8' } }, x: { ticks: { color: '#94a3b8' } } }, plugins: { legend: { position: 'bottom', labels: { color: '#94a3b8' } } } }
});
function updateDashboard(snap) {
const r = snap.requests || {};
const tb = snap.token_bucket || {};
const rt = snap.retreat || {};
document.getElementById('val-total').textContent = (r.total || 0).toLocaleString();
document.getElementById('val-nvidia').textContent = (r.nvidia || 0).toLocaleString();
document.getElementById('val-rate').textContent = Math.round(rt.effective_rpm || 40);
document.getElementById('val-429').textContent = ((rt.upstream_429_rate || 0) * 100).toFixed(1) + '%';
document.getElementById('val-uptime').textContent = fmtDuration(snap.uptime_seconds || 0);
const retreatEl = document.getElementById('val-retreat');
const state = rt.state || 'normal';
retreatEl.textContent = state === 'retreat' ? '⚠️ 避退' : state === 'recover' ? '↗ 恢复中' : '✅ 正常';
retreatEl.style.color = state === 'retreat' ? '#f59e0b' : state === 'recover' ? '#60a5fa' : '#22c55e';
chartTokens.data.datasets[0].data = [
Math.round((tb.capacity || 40) - (tb.tokens || 40)),
Math.round(tb.tokens || 0)
];
chartTokens.update();
const mb = (snap.metrics_buffer || {});
chartQueue.data.datasets[0].data = [
Math.round(Math.random() * 5),
Math.round(Math.random() * 10),
Math.round(Math.random() * 15),
Math.round(Math.random() * 20)
];
chartQueue.update();
const now = new Date().toLocaleTimeString();
const prev = dataHistory.throughput.length > 0 ? dataHistory.throughput[dataHistory.throughput.length - 1].nvidia : 0;
const throughput = Math.max(0, (r.nvidia || 0) - prev);
dataHistory.throughput.push({ time: now, nvidia: throughput, ratelimited: r.ratelimited || 0, passthrough: r.passthrough || 0 });
dataHistory.rates.push({ time: now, effective: rt.effective_rpm || 40, base: rt.base_rpm || 40 });
if (dataHistory.throughput.length > MAX_HISTORY) dataHistory.throughput.shift();
if (dataHistory.rates.length > MAX_HISTORY) dataHistory.rates.shift();
chartThroughput.data.labels = dataHistory.throughput.map(d => d.time);
chartThroughput.data.datasets[0].data = dataHistory.throughput.map(d => d.nvidia);
chartThroughput.data.datasets[1].data = dataHistory.throughput.map(d => d.ratelimited);
chartThroughput.data.datasets[2].data = dataHistory.throughput.map(d => d.passthrough);
chartThroughput.update();
chartRate.data.labels = dataHistory.rates.map(d => d.time);
chartRate.data.datasets[0].data = dataHistory.rates.map(d => d.effective);
chartRate.data.datasets[1].data = dataHistory.rates.map(d => d.base);
chartRate.update();
}
function updateLatencies(snap) {
const tb = snap.token_bucket || {};
}
function fmtDuration(s) {
if (s < 60) return s + 's';
if (s < 3600) return Math.floor(s/60) + 'm ' + (s%60) + 's';
return Math.floor(s/3600) + 'h ' + Math.floor((s%3600)/60) + 'm';
}
async function applyConfig() {
const btn = document.querySelector('.config-row button');
btn.disabled = true;
try {
const resp = await fetch('/api/admin/config', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
rate_rpm: parseInt(document.getElementById('cfg-rate-rpm').value),
queue_max_size: parseInt(document.getElementById('cfg-queue-max').value),
})
});
const result = await resp.json();
showToast(resp.ok ? 'success' : 'error', resp.ok ? '配置已更新' : (result.detail || '配置更新失败'));
} catch (err) {
showToast('error', '请求失败: ' + err.message);
}
btn.disabled = false;
}
function showToast(type, msg) {
const t = document.createElement('div');
t.className = 'toast ' + type;
t.textContent = msg;
document.body.appendChild(t);
setTimeout(() => t.remove(), 3000);
}
connectSSE();
</script>
</body>
</html>
-200
View File
@@ -1,200 +0,0 @@
"""
NVIDIA Sidecar — WebUI 后端 API
提供仪表盘 SSE 实时推送 + 配置热重载 API。
"""
from __future__ import annotations
import asyncio
import json
import time
from pathlib import Path
from typing import Any, AsyncGenerator
import structlog
from fastapi import APIRouter, HTTPException, Request
from fastapi.responses import HTMLResponse, JSONResponse, StreamingResponse
from pydantic import BaseModel
webui_router: APIRouter = APIRouter(prefix="/api", tags=["webui"])
logger: structlog.stdlib.BoundLogger = structlog.get_logger("nvidia_sidecar.webui")
STATIC_DIR: Path = Path(__file__).parent / "static"
# ---------------------------------------------------------------------------
# 配置热重载模型
# ---------------------------------------------------------------------------
class ConfigPatch(BaseModel):
"""可在线修改的配置字段。"""
rate_rpm: int | None = None
queue_max_size: int | None = None
fallback_enabled_passthrough: bool | None = None
# ---------------------------------------------------------------------------
# 仪表盘 SSE 推送
# ---------------------------------------------------------------------------
async def _dashboard_stream(request: Request) -> StreamingResponse:
"""SSE 实时推送 Sidecar 完整状态快照(每秒一次)。
供 dashboard.html 的 EventSource 消费。
"""
async def event_generator() -> AsyncGenerator[str, None]:
while True:
if await request.is_disconnected():
break
try:
snapshot: dict[str, Any] = _build_snapshot()
yield f"data: {json.dumps(snapshot, ensure_ascii=False)}\n\n"
except Exception:
logger.exception("dashboard_sse_error")
yield f"data: {json.dumps({'error': 'internal'})}\n\n"
await asyncio.sleep(1.0)
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no",
},
)
def _build_snapshot() -> dict[str, Any]:
"""构建当前状态快照(同步部分,从全局状态读取)。"""
# 延迟导入避免循环依赖
from nvidia_sidecar import server
try:
_stats = server._stats
_token_bucket = server._token_bucket
bucket_status = _token_bucket.get_status()
now = time.time()
uptime = int(now - _stats["start_time"]) if _stats.get("start_time") else 0
return {
"timestamp": now,
"uptime_seconds": uptime,
"token_bucket": bucket_status,
"retreat": {
"state": getattr(_token_bucket, "_retreat_state", "normal"),
"effective_rpm": round(getattr(_token_bucket, "get_effective_rate_rpm", lambda: 40.0)(), 1),
"base_rpm": round(getattr(_token_bucket, "get_base_rate_rpm", lambda: 40.0)(), 1),
"upstream_429_rate": round(getattr(_token_bucket, "get_429_rate", lambda: 0.0)(), 4),
},
"requests": {
"total": _stats.get("total_requests", 0),
"nvidia": _stats.get("nvidia_requests", 0),
"passthrough": _stats.get("passthrough_requests", 0),
"ratelimited": _stats.get("ratelimited_requests", 0),
},
"errors": {
"queue_full_rejects": _stats.get("queue_full_rejects", 0),
"upstream_errors": _stats.get("upstream_errors", 0),
},
}
except Exception:
logger.exception("snapshot_build_error")
return {"error": "snapshot_unavailable", "timestamp": time.time()}
# ---------------------------------------------------------------------------
# 配置热重载
# ---------------------------------------------------------------------------
async def get_config() -> dict[str, Any]:
"""获取当前完整配置。"""
from nvidia_sidecar import server
cfg = server._config
return {
"listen_host": cfg.listen_host,
"listen_port": cfg.listen_port,
"metrics_port": cfg.metrics_port,
"upstream_url": cfg.upstream_url,
"rate_rpm": _get_current_rate(server),
"bucket_capacity": cfg.bucket_capacity,
"request_timeout": cfg.request_timeout,
"queue_max_size": cfg.queue_max_size,
"low_priority_timeout": cfg.low_priority_timeout,
"fallback_enabled_passthrough": cfg.fallback_enabled_passthrough,
"log_level": cfg.log_level,
}
async def update_config(body: ConfigPatch) -> JSONResponse:
"""在线修改配置项并即时生效。"""
from nvidia_sidecar import server
cfg = server._config
changed: list[str] = []
if body.rate_rpm is not None:
if body.rate_rpm <= 0:
raise HTTPException(status_code=400, detail="rate_rpm must be > 0")
cfg.rate_rpm = body.rate_rpm
server._token_bucket.set_rate(body.rate_rpm / 60.0)
changed.append("rate_rpm")
if body.queue_max_size is not None:
if body.queue_max_size <= 0:
raise HTTPException(status_code=400, detail="queue_max_size must be > 0")
cfg.queue_max_size = body.queue_max_size
changed.append("queue_max_size")
if body.fallback_enabled_passthrough is not None:
cfg.fallback_enabled_passthrough = body.fallback_enabled_passthrough
changed.append("fallback_enabled_passthrough")
logger.info("config_updated", changed=changed)
return JSONResponse(
content={"status": "ok", "changed": changed},
)
def _get_current_rate(server_module: Any) -> float:
"""获取当前实际速率(避退调整后),兼容 AdaptiveTokenBucket。"""
tb = server_module._token_bucket
if hasattr(tb, "get_effective_rate_rpm"):
return float(round(tb.get_effective_rate_rpm(), 1))
return float(tb.rate * 60.0)
# ---------------------------------------------------------------------------
# 路由注册
# ---------------------------------------------------------------------------
@webui_router.get("/dashboard/stream")
async def dashboard_stream(request: Request) -> StreamingResponse:
"""SSE 仪表盘实时推送端点。"""
return await _dashboard_stream(request)
@webui_router.get("/admin/config")
async def admin_get_config() -> JSONResponse:
"""获取当前配置。"""
return JSONResponse(content=await get_config())
@webui_router.post("/admin/config")
async def admin_update_config(body: ConfigPatch) -> JSONResponse:
"""在线修改配置(热重载)。"""
return await update_config(body)
# ---------------------------------------------------------------------------
# 仪表盘静态页面
# ---------------------------------------------------------------------------
@webui_router.get("/dashboard", include_in_schema=False)
async def dashboard_page() -> HTMLResponse:
"""仪表盘 HTML 页面。"""
dashboard_path = STATIC_DIR / "dashboard.html"
if dashboard_path.is_file():
return HTMLResponse(content=dashboard_path.read_text(encoding="utf-8"))
return HTMLResponse(content="<h1>dashboard.html not found</h1>", status_code=404)