Compare commits

..

13 Commits

Author SHA1 Message Date
vincent 474f1eddfd fix(sidecar-v2): second-round review fixes
- cooldown_manager: move function-level imports to module top
- proxy.py: emergency_count counter now actually increments
- server.py: metrics reads emergency_count from proxy module
- dashboard.html: real JS CDN fallback (not just comment)
- requirements.txt: remove unused prometheus_client

Round 2 review residual fixes from 沈路明/陆怀瑾/梁思筑 feedback

Co-authored-by: multica-agent <github@multica.ai>
2026-06-25 17:53:48 +08:00
vincent 4f415fb500 fix(sidecar-v2): incorporate review feedback - P0/P1 fixes
P0 fixes:
- Admin API Bearer Token auth middleware
- Encryption key missing -> CRITICAL log + sys.exit(1)
- Prometheus metrics endpoint (:9191)
- requirements.txt + Dockerfile + docker-compose.yml + systemd + nginx

P1 fixes:
- Dead code removed from _refresh_cooldowns()
- Stream detection fixed (text/event-stream only)
- Emergency passthrough (10% RPM retry before 503)
- Active health probing for backends
- SQLite daily backup loop with retention
- Chart.js CDN fallback
- Key rotation SOP document
- JSON log format support
- Deploy files: systemd unit + nginx config

BIZ-52 review re-entry

Co-authored-by: multica-agent <github@multica.ai>
2026-06-25 17:12:33 +08:00
vincent 611ebd11a8 feat(sidecar-v2): implement multi-pool provider proxy with cooldown, rate limiting, WebUI
BIZ-52 Step3 开发实现:
- storage: backend/usage/cooldown/config CRUD with SQLite WAL
- crypto: AES-256-GCM API key encryption
- pool_manager: primary/fallback pool routing
- cooldown_manager: 429 exponential backoff cooldown
- rate_limiter: per-backend token bucket RPM control
- router: model → backend routing with pool priority
- proxy: multi-pool request forwarding with retry
- server: FastAPI admin API + OpenAI-compatible proxy + SSE
- dashboard: WebUI with provider CRUD, stats, charts

Co-authored-by: multica-agent <github@multica.ai>
2026-06-25 16:39:01 +08:00
vincent 4fd89b038d feat(knowledge): opengineer - 创建运维/规范领域知识条目(部署流程/故障排查/服务器运维标准)
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent 394f9e2780 chore(BIZ-24): 更新 UUID 映射表和交付物清单为已完成状态
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent 1747512117 feat(BIZ-24): 生成并部署 14 个 Agent 的 HEARTBEAT.md v1.1
- 所有 14 个 Agent 的个性化 HEARTBEAT.md 已生成
- 已部署到各 Agent workspace (/home/vincent/.openclaw/workspace/<agent>/)
- 包含实际 OpenClaw Agent ID + Multica UUID
- 分类:高频 2 个 / 开发 6 个 / 业务 6 个
- 每个文件包含三源统一监控脚本(WorkBoard + Multica + 待办文档)

Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent 1561c2eaeb feat(BIZ-24): v1.1 - 增加全任务源统一监控(WorkBoard + Multica + 待办文档)
变更:
- 新增「规则 0: 全任务源统一监控」(规则从 5 项扩展为 6 项)
- 三源监控脚本:WorkBoard、Multica issues、待办文档
- 超时检测扩展为跨平台(WorkBoard + Multica)
- 自动恢复增加 Multica 恢复流程
- 依赖检查增加 Multica parent_issue_id
- 心跳清单从 4 项扩展为 6 项
- 全局规则从 6 条扩展为 7 条
- 新增 Agent Multica UUID 映射表
- COO 专属全平台积压巡检脚本

Addresses Vincent's review feedback: 智能体监控应覆盖 Multica issues,避免工作遗漏

Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent ae2fd1032f BIZ-13 Phase 1: 所有 Agent HEARTBEAT.md 增强 — 增加超时检测、自动恢复、依赖检查、轮次限制、上下文控制
- 更新 15 个 Agent 的 HEARTBEAT.md 文件
- 新增智能体运行稳定性保障标准模板
- 更新 BIZ-13 方案文档(v1.1,Phase 1 执行中状态)
- 心跳频率分级:高频 10min / 开发 15min / 业务 15min
- 超时阈值分级:高频 60min / 开发 120min / 业务 90min
- 轮次上限分级:高频 50轮 / 开发 100轮 / 业务 30轮

Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent 01640e0617 docs: BIZ-19 Agent 知识库集成指南 + 知识查询最佳实践
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent 5942be573b feat: BIZ-24 HEARTBEAT.md enhancement template for all agents
- 禁止请示规则:发现任务立即执行,禁止向用户请示
- 超时检测规则:高频 10min / 开发 15min / 业务 15min
- 自动恢复规则:超时无进展自动重新调度
- 依赖检查前置:任务启动前强制检查依赖
- 最大轮次限制:高频 50轮 / 开发 100轮 / 业务 30轮

Phase 1 of BIZ-13 运行稳定性保障方案

Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 12:21:26 +08:00
vincent 3246a1f0d9 BIZ-25: v1.1 修复 delivery/workspace_id/AGENT_CONFIGS
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 07:41:53 +08:00
vincent cca4089f2a BIZ-25: Phase1 cron部署方案 - 15个Agent心跳定时任务配置
Co-authored-by: multica-agent <github@multica.ai>
2026-06-24 00:21:26 +08:00
vincent f4191f82f5 BIZ-28: deploy monitoring dashboard + alert config
Co-authored-by: multica-agent <github@multica.ai>
2026-06-23 15:56:49 +08:00
40 changed files with 5357 additions and 1 deletions
+2
View File
@@ -12,8 +12,10 @@
| [产品/](产品/) | PRD、需求分析 | 沈路明 (productmanager) | — |
| [技术/](技术/) | 开发规范、代码审查 | 徐聪 (costcodev) | — |
| [设计/](设计/) | UI设计、品牌规范 | 苏绘锦 (designer) | — |
| [运维/](运维/) | 部署流程、故障排查、服务器运维 | 严维序 (opengineer) | 3 |
| [运营/](运营/) | 活动策划、数据分析 | 陆怀瑾 (coo) | — |
| [行政/](行政/) | 合同、报销流程 | 刘诗妮 (secretary) | — |
| [规范/](规范/) | 运维标准、安全基线、合规要求 | 严维序 (opengineer) | — |
## 知识条目格式
+3 -1
View File
@@ -5,7 +5,9 @@
## 知识范围
涵盖开发规范、代码审查、架构设计、部署运维、技术选型等技术团队知识。
涵盖开发规范、代码审查、架构设计、技术选型等技术团队核心知识。
> ⚠️ 部署运维知识已迁移至 [运维/](../运维/) 领域。
## 条目清单
+25
View File
@@ -0,0 +1,25 @@
# 规范领域知识
**责任人**:严维序(opengineer
**审核人**:陆怀瑾(coo
## 知识范围
涵盖运维规范、安全标准、合规要求等规范类知识条目,支撑团队标准化运作。
## 条目清单
| 文件名 | 说明 | 状态 |
|--------|------|------|
| [服务器运维标准_v1.0.md](../运维/服务器运维标准_v1.0.md) | 服务器巡检、监控、备份运维标准 | 见运维域 |
## 待建设
- 数据库运维标准
- 安全审计基线
- 数据合规处理流程
---
> 维护者:严维序(opengineer
> 最后更新:2026-06-24
+27
View File
@@ -0,0 +1,27 @@
# 运维领域知识
**责任人**:严维序(opengineer
**审核人**:陆怀瑾(coo
## 知识范围
涵盖服务器运维、部署流程、故障排查、监控配置、安全保障等运维团队核心知识。
## 条目清单
| 文件名 | 说明 | 状态 |
|--------|------|------|
| [部署流程_v1.0.md](部署流程_v1.0.md) | 服务部署 SOP 与变更管理流程 | ✅ |
| [故障排查手册_v1.0.md](故障排查手册_v1.0.md) | 常见故障定位与处置方案 | ✅ |
| [服务器运维标准_v1.0.md](服务器运维标准_v1.0.md) | 服务器巡检、监控、备份运维标准 | 🆕 |
## 待建设
- 数据库运维指南
- 安全加固检查清单
- 灾备与应急恢复预案
---
> 维护者:严维序(opengineer
> 最后更新:2026-06-24
+274
View File
@@ -0,0 +1,274 @@
# 故障排查手册
## 元数据
| 属性 | 值 |
|------|-----|
| **领域** | 运维 |
| **责任人** | 严维序(opengineer |
| **版本** | v1.0 |
| **创建日期** | 2026-06-24 |
| **最后更新** | 2026-06-24 |
| **标签** | 故障排查, 运维, 排障 |
## 概述
本手册汇总 BizWings 环境中常见的系统与服务故障定位方法和修复方案。覆盖 SSH 连接、Nginx、数据库、磁盘、Docker 等核心场景。
---
## 一、SSH 连接故障
### 1.1 连接超时
```bash
# 诊断步骤
ssh -vvv root@<ip> -p <port> # 查看详细连接日志
ping <ip> # 检查网络连通性
nmap <ip> -p <port> # 检查端口状态
```
**常见原因**
- 目标服务器防火墙未开放端口
- 源 IP 未加入白名单
- 服务器负载过高,sshd 响应慢
**解决方案**
1. 检查服务器防火墙:`iptables -L -n``ufw status`
2. 检查 sshd 是否运行:`systemctl status sshd`
3. 检查负载:`top -n1 | head -5`
### 1.2 认证失败
```bash
# 诊断步骤
ssh -p <port> root@<ip> # 尝试密码登录
# Permission denied (publickey,password) 提示
```
**常见原因**
- 密码错误(检查 TOOLS.md 中记录)
- SSH 密钥认证配置错误
- `/etc/ssh/sshd_config``PasswordAuthentication no`
**解决方案**
1. 确认密码与 TOOLS.md 一致
2. 检查 `sshd_config``grep PasswordAuthentication /etc/ssh/sshd_config`
3. 临时允许密码登录:`sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config && systemctl reload sshd`
---
## 二、Nginx 服务异常
### 2.1 Nginx 启动失败 / 卡在 activating
```bash
# 诊断步骤
systemctl status nginx # 查看状态
journalctl -u nginx --no-pager -n 50 # 查看日志
nginx -t # 配置语法检查
```
**根因(经验)**:进程残留导致端口占用
```bash
# 修复
pkill -9 nginx # 强制清理残留进程
sleep 2
systemctl start nginx # 重新启动
systemctl status nginx # 确认状态
```
### 2.2 502 Bad Gateway
```bash
# 诊断步骤
curl -I http://localhost:<upstream-port> # 检查上游服务
ss -tlnp | grep <upstream-port> # 检查端口监听
systemctl status <upstream-service> # 检查上游进程
```
**常见原因**
- 上游服务未启动或崩溃
- 连接池耗尽
**解决方案**
1. 重启上游服务:`systemctl restart <service>`
2. 检查 `upstream` 配置是否正确
### 2.3 日志轮转失败
```bash
# 诊断步骤
cat /var/log/nginx/error.log | head # 查看是否有日志无法写入
ls -la /var/log/nginx/ # 查看日志文件
/usr/sbin/logrotate -d /etc/logrotate.d/nginx # 测试 logrotate
```
**修复方案**
```bash
# 修改 /etc/logrotate.d/nginx 中的 postrotate 脚本
# 将 invoke-rc.d nginx rotate 改为:
postrotate
systemctl reload nginx
endscript
```
---
## 三、数据库连接故障
### 3.1 MySQL 连接失败
```bash
# 诊断步骤
mysql -h <host> -P <port> -u root -p # 测试连接
telnet <host> <port> # 检查端口
systemctl status mysql # 检查服务
```
**常见原因**
- 服务未运行
- 防火墙未放行 3306 端口
- 用户权限 / host 限制
- 连接数超限
**解决方案**
```bash
# 检查连接数
mysql -e "SHOW VARIABLES LIKE 'max_connections';"
mysql -e "SHOW PROCESSLIST;"
# 检查用户权限
mysql -e "SELECT user, host FROM mysql.user WHERE user='root';"
```
### 3.2 MySQL 空间不足
```bash
# 诊断
df -h # 磁盘空间
mysql -e "SELECT table_schema, ROUND(SUM(data_length+index_length)/1024/1024,2) AS size_mb FROM information_schema.tables GROUP BY table_schema ORDER BY size_mb DESC;"
```
**解决方案**
- 清理过期 binlog`PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY);`
- 清理临时表
- 扩展磁盘
---
## 四、磁盘空间告警
### 4.1 诊断
```bash
df -h # 查看各分区使用率
du -sh /* 2>/dev/null | sort -rh | head -10 # 找到大文件目录
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null # 大文件定位
```
### 4.2 清理方案
```bash
# Docker 日志和镜像清理
docker system prune -af --volumes # 清理未使用的 Docker 资源
# 系统日志轮转
journalctl --vacuum-time=7d # 清理 7 天前的 journal 日志
# 应用日志归档
find /var/log -name "*.log" -mtime +30 -exec gzip {} \; # 压缩旧日志
find /var/log -name "*.gz" -mtime +90 -delete # 删除 90 天前的压缩日志
```
---
## 五、Docker 容器异常
### 5.1 容器停止
```bash
docker ps -a | grep <container> # 查看容器状态
docker logs <container> --tail 50 # 查看最近日志
```
**修复**
```bash
docker start <container> # 手动启动
docker compose -f <path> up -d # 使用 Compose 重启
```
### 5.2 Docker API 无响应
```bash
systemctl status docker # 检查 Docker 服务
journalctl -u docker --no-pager -n 50 # 查看 Docker 日志
```
**修复**
```bash
systemctl restart docker # 重启 Docker 守护进程
```
---
## 六、系统进程故障
### 6.1 端口被占用
```bash
ss -tlnp | grep <port> # 查看占用端口的进程
fuser -k <port>/tcp # 强制释放端口
```
### 6.2 systemd 服务异常
```bash
systemctl status <service> # 检查状态
journalctl -u <service> --no-pager -n 100 # 查看服务日志
# 常用修复
systemctl daemon-reload # 重载 unit 文件
systemctl restart <service> # 重启
systemctl enable <service> # 设置开机自启
```
---
## 七、日志分析工具
### 7.1 常用命令
```bash
# 实时日志跟踪
tail -f /var/log/<app>/access.log
# 错误过滤
grep -i "error\|exception\|failed" /var/log/<app>/app.log | tail -50
# 时间范围过滤
awk '/2026-06-24 10:00/,/2026-06-24 11:00/' /var/log/<app>/app.log
```
### 7.2 关键检查点
| 故障表现 | 优先检查 | 常见根因 |
|----------|----------|----------|
| 服务无响应 | systemctl status | 进程 OOM / 崩溃 |
| API 返回错误 | 应用日志 + Nginx 日志 | 代码 bug / 上游依赖异常 |
| 高延迟 | top + ss + 应用日志 | 资源争抢 / 死锁 |
| 数据库异常 | MySQL error log | 慢查询 / 连接数超限 |
---
## 相关条目
- [部署流程_v1.0.md](部署流程_v1.0.md)
- [服务器运维标准_v1.0.md](服务器运维标准_v1.0.md)
## 变更记录
| 日期 | 版本 | 变更说明 | 变更人 |
|------|------|----------|--------|
| 2026-06-24 | v1.0 | 初始创建 | 严维序 |
@@ -0,0 +1,177 @@
# 服务器运维标准
## 元数据
| 属性 | 值 |
|------|-----|
| **领域** | 运维 |
| **责任人** | 严维序(opengineer |
| **版本** | v1.0 |
| **创建日期** | 2026-06-24 |
| **最后更新** | 2026-06-24 |
| **标签** | 运维, 监控, 巡检, 备份 |
## 概述
本文档定义 BizWings 团队所有服务器的日常运维标准,包括巡检频率、监控指标、备份策略和安全基线。适用于所有生产环境服务器(阿里云 / 家庭内网 / HP 服务器)。
---
## 一、服务器巡检标准
### 1.1 巡检频率
| 类型 | 频率 | 执行方式 |
|------|------|----------|
| 心跳自检 | 每 10 分钟 | openclaw 心跳自动巡检 |
| 深度巡检 | 每日一次 | 手动执行 `python3 $SCRIPTS/heartbeat_helper.py opengineer` |
| 全量巡检 | 每周一次 | 逐个检查全部服务器 |
### 1.2 巡检清单
#### 资源负载
```bash
# 磁盘使用率(警告 > 80%,严重 > 90%
df -h | grep -v tmpfs
# CPU 负载
uptime
# 内存使用
free -h
# 网络 IO
sar -n DEV 1 3
```
#### 服务状态
```bash
# 核心服务清单(按实际部署确认)
systemctl status nginx mysql docker sshd
# Docker 容器健康
docker ps | grep -c "Up"
```
#### 日志异常
```bash
# 最近 10 分钟的错误日志
journalctl --since "10 min ago" -p err --no-pager | tail -20
```
---
## 二、监控指标定义
### 2.1 告警阈值
| 指标 | 警告 (WARN) | 严重 (CRIT) | 处理 |
|------|-------------|-------------|------|
| 磁盘使用率 | > 80% | > 90% | 清理日志 / 扩容 |
| CPU 负载 (1min) | > 4.0 | > 8.0 | 检查异常进程 |
| 内存使用率 | > 85% | > 95% | 检查 OOM 风险 |
| 根分区 inode | > 80% | > 90% | 清理小文件 |
| 服务进程 | 停止 | — | 重启服务 |
| 端口监听 | 消失 | — | 检查服务状态 |
| Docker 容器 | 非 Up | — | docker start / compose up |
### 2.2 日志监控
- 系统日志:`journalctl -p err` 重点关注
- 应用日志:`error`, `exception`, `failed`, `timeout` 关键词监控
- Nginx 日志:5xx 错误率 > 1% 时触发调查
---
## 三、备份策略
### 3.1 数据库备份
```bash
# MySQL 全量备份(建议每日凌晨执行)
mysqldump --all-databases --single-transaction --quick | gzip > /backup/db/all-$(date +%Y%m%d).sql.gz
```
### 3.2 配置备份
- 服务器配置文件:`/backup/conf/<server>/` 目录
- 每次变更前执行:`cp <config> <config>.$(date +%Y%m%d-%H%M%S).bak`
### 3.3 Docker 数据备份
```bash
# 思源笔记备份(已配置每日 3:00)
tar czf /backup/siyuan/siyuan-data-$(date +%Y%m%d).tar.gz -C <data-dir> .
```
### 3.4 备份保留策略
| 类型 | 保留期限 |
|------|----------|
| 数据库全量备份 | 30 天 |
| 配置备份 | 90 天 |
| Docker 数据 | 7 天 |
| 日志归档 | 90 天 |
---
## 四、变更管理标准
### 4.1 变更准入
- ✅ 每次变更前必须备份原始文件
- ✅ 高危操作(防火墙、内核、数据库)必须保留回滚方案
- ✅ 变更前评估影响范围
- ✅ 变更后验证服务状态
- ❌ 禁止在无备份的情况下直接修改生产配置
- ❌ 禁止在高峰时段执行非紧急变更
### 4.2 变更分级
| 级别 | 示例 | 要求 |
|------|------|------|
| 低风险 | 普通应用更新 | 备份 → 部署 → 验证 |
| 中风险 | 配置修改 | 备份 → 预演 → 部署 → 验证 |
| 高风险 | 内核 / 防火墙 / 数据库 | 备份 → 预演 → 通知 → 部署 → 验证 → 监控 |
---
## 五、安全基线
### 5.1 基本要求
- [ ] SSH 禁止 root 密码登录(高风险服务器)
- [ ] 防火墙最小权限原则
- [ ] 非必要端口不对外开放
- [ ] 定期更新系统安全补丁
- [ ] 日志审计开启
### 5.2 密码管理
- 服务器密码统一记录在 TOOLS.md
- 数据库密码统一管理
- 禁止在代码中硬编码密码
---
## 六、服务器清单与分类
| 环境 | 服务器数 | 用途 | 巡检频率 |
|------|----------|------|----------|
| 阿里云生产 | 3 | 应用服务、数据库 | 每次心跳 |
| 家庭内网生产 | 4 | 应用、数据库、PVE | 每次心跳 |
| HP 测试 | 3 | 测试、NAS | 每日 |
| 树莓派 | 1 | 辅助设备 | 每日 |
详细清单见 TOOLS.md「SSH/WinRM 服务器清单」
---
## 相关条目
- [部署流程_v1.0.md](部署流程_v1.0.md)
- [故障排查手册_v1.0.md](故障排查手册_v1.0.md)
## 变更记录
| 日期 | 版本 | 变更说明 | 变更人 |
|------|------|----------|--------|
| 2026-06-24 | v1.0 | 初始创建 | 严维序 |
+202
View File
@@ -0,0 +1,202 @@
# 服务部署流程 SOP
## 元数据
| 属性 | 值 |
|------|-----|
| **领域** | 运维 |
| **责任人** | 严维序(opengineer |
| **版本** | v1.0 |
| **创建日期** | 2026-06-24 |
| **最后更新** | 2026-06-24 |
| **标签** | 部署, 运维, SOP |
## 概述
本文档定义 BizWings 团队所有业务服务的部署流程标准,涵盖部署前检查、执行步骤、验证测试和回滚预案。适用于所有生产环境的代码部署与服务更新。
---
## 一、部署前置检查
### 1.1 代码准备
- [ ] 代码已合并到目标分支(main / release
- [ ] PR 已通过 Code Review 并合并
- [ ] 本地或 CI 构建通过(编译无报错)
- [ ] 版本号已更新(如有)
### 1.2 环境检查
- [ ] 目标服务器磁盘空间充足(> 剩余 20%)
- [ ] CPU / 内存负载正常(< 80%
- [ ] 网络连通性:本机 → 目标服务器可达
- [ ] 目标端口未被占用
- [ ] 依赖服务(数据库 / 中间件)运行正常
### 1.3 备份准备
- [ ] **配置备份**:服务器配置文件备份到 `/backup/conf/` 目录
- [ ] **数据库备份**:涉及数据库变更,先执行 `mysqldump` 全量备份
- [ ] **当前版本标记**:记录当前运行版本号或 Git commit hash
---
## 二、部署执行步骤
### 2.1 文件分发
```bash
# 标准部署(SSH + scp/rsync
scp -P <port> ./dist/app root@<server>:/opt/app/
# 或使用 rsync 增量同步
rsync -avz --delete -e "ssh -p <port>" ./dist/ root@<server>:/opt/app/
```
### 2.2 服务更新
#### 方式 Asystemd 服务
```bash
# 1. 停止服务
systemctl stop <service-name>
# 2. 备份旧版本(如有必要)
mv /opt/app/<app> /opt/app/<app>.bak
# 3. 放置新版本
cp /tmp/<app> /opt/app/<app>
chmod +x /opt/app/<app>
# 4. 重启服务
systemctl start <service-name>
systemctl status <service-name>
```
#### 方式 BDocker 容器
```bash
# 1. 拉取新镜像
docker pull <registry>/<image>:<tag>
# 2. 停止旧容器
docker stop <container-name>
docker rm <container-name>
# 3. 启动新容器
docker run -d --name <container-name> \
--restart unless-stopped \
-p <host-port>:<container-port> \
<registry>/<image>:<tag>
```
#### 方式 CNginx 反向代理更新
```bash
# 更新上游配置后重载
nginx -t # 语法检查
systemctl reload nginx # 热重载
```
### 2.3 配置变更
```bash
# 1. 备份当前配置
cp /etc/<app>/config.yml /etc/<app>/config.yml.$(date +%Y%m%d-%H%M%S)
# 2. 修改配置
vim /etc/<app>/config.yml
# 3. 重启服务使配置生效
systemctl restart <service-name>
```
---
## 三、部署验证
### 3.1 连通性验证
```bash
# 服务端口监听确认
ss -tlnp | grep <port>
# HTTP 服务健康检查
curl -s -o /dev/null -w "%{http_code}" http://localhost:<port>/health
# 预期返回:200
```
### 3.2 功能验证
- [ ] API 基础功能运行正常
- [ ] 日志无新增 ERROR 级别报错
- [ ] 数据库连接正常
- [ ] 前端页面(如有)可正常加载
### 3.3 监控确认
- [ ] Prometheus / Grafana 指标正常
- [ ] 日志系统(如有)已捕获新日志
- [ ] 告警规则未被触发
---
## 四、回滚方案
### 4.1 代码回滚
```bash
# Git 回滚到上一版本
cd /opt/app/repo
git revert HEAD --no-edit
git push
# 重新执行部署
```
### 4.2 文件回滚
```bash
# 恢复备份文件
mv /opt/app/<app>.bak /opt/app/<app>
systemctl restart <service-name>
```
### 4.3 数据库回滚
```bash
# 导入备份
gunzip < /backup/db/<dbname>.$(date +%Y%m%d).sql.gz | mysql -u root -p<pass> <dbname>
```
### 4.4 回滚确认
- [ ] 旧版本服务运行正常
- [ ] 端口监听确认
- [ ] 用户无访问异常
- [ ] 记录回滚原因到工作日志
---
## 五、部署后记录
### 5.1 必填信息
| 项目 | 内容 |
|------|------|
| 部署时间 | YYYY-MM-DD HH:mm |
| 部署人 | 严维序(opengineer |
| 部署内容 | [简要描述] |
| 版本 | commit hash / tag |
| 验证结果 | ✅/❌ 通过 |
| 回滚情况 | 无需回滚 / 已回滚(原因) |
### 5.2 记录位置
- 工作日志:`memory/YYYY-MM-DD.md`
- 任务记录:WorkBoard 相关卡片注释
- 知识更新:如部署暴露流程问题,更新本文档
---
## 相关条目
- [故障排查手册_v1.0.md](故障排查手册_v1.0.md)
- [服务器运维标准_v1.0.md](服务器运维标准_v1.0.md)
## 变更记录
| 日期 | 版本 | 变更说明 | 变更人 |
|------|------|----------|--------|
| 2026-06-24 | v1.0 | 初始创建 | 严维序 |
+50
View File
@@ -0,0 +1,50 @@
# Alertmanager 配置
# 告警通知路由到 Feishu
global:
resolve_timeout: 5m
route:
receiver: "default"
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
# 严重告警 → 通知 Vincent
- receiver: "vincent-critical"
match:
severity: critical
repeat_interval: 2h
continue: true
# 警告告警 → 通知 COO
- receiver: "coo-warning"
match:
severity: warning
repeat_interval: 4h
receivers:
- name: "default"
webhook_configs:
- url: "http://host.docker.internal:9094/webhook"
send_resolved: true
- name: "vincent-critical"
webhook_configs:
- url: "http://host.docker.internal:9094/webhook"
send_resolved: true
- name: "coo-warning"
webhook_configs:
- url: "http://host.docker.internal:9094/webhook"
send_resolved: true
# 抑制规则:严重告警自动抑制同源的警告
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal:
- alertname
- instance
@@ -0,0 +1,288 @@
{
"title": "OpenClaw Agent Health Dashboard",
"uid": "agent-health",
"version": 1,
"tags": ["openclaw", "agent", "monitoring"],
"timezone": "browser",
"editable": true,
"refresh": "30s",
"panels": [
{
"title": "系统资源概览",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 0}
},
{
"id": 1,
"title": "CPU 使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 1},
"targets": [
{
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "{{instance}}"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
},
{
"id": 2,
"title": "内存使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 1},
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
},
{
"id": 3,
"title": "磁盘使用率",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 1},
"targets": [
{
"expr": "max by(instance) ((node_filesystem_size_bytes - node_filesystem_free_bytes) / node_filesystem_size_bytes * 100)",
"legendFormat": "{{instance}}"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"showThresholdLabels": false,
"showThresholdMarkers": true
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
},
{
"id": 4,
"title": "系统负载",
"type": "stat",
"gridPos": {"h": 8, "w": 6, "x": 18, "y": 1},
"targets": [
{
"expr": "node_load1",
"legendFormat": "1min"
},
{
"expr": "node_load5",
"legendFormat": "5min"
},
{
"expr": "node_load15",
"legendFormat": "15min"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "background",
"graphMode": "area",
"justifyMode": "auto",
"orientation": "horizontal",
"textMode": "auto"
}
},
{
"title": "Agent 健康状态",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 9}
},
{
"id": 5,
"title": "Agent 心跳状态",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 10},
"targets": [
{
"expr": "agent_heartbeat_status",
"legendFormat": "{{agent_label}}"
}
],
"transformations": [
{"id": "organize", "options": {"excludeByName": {}, "indexByName": {}, "renameByName": {"Value": "状态"}}}
],
"fieldConfig": {
"defaults": {
"custom": {
"align": "center",
"displayMode": "color-background"
},
"mappings": [
{"type": "value", "options": {"0": {"color": "red", "text": "❌ 超时"}, "1": {"color": "green", "text": "✅ 正常"}}}
],
"thresholds": [{"color": "green", "value": null}]
}
}
},
{
"id": 6,
"title": "任务停滞时长",
"type": "bargauge",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 10},
"targets": [
{
"expr": "agent_task_stagnation_seconds",
"legendFormat": "{{agent_label}}"
}
],
"options": {
"orientation": "horizontal",
"displayMode": "gradient",
"showUnfilled": true
},
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 3600},
{"color": "red", "value": 14400}
]
}
}
},
{
"id": 7,
"title": "待办任务数",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 18},
"targets": [
{
"expr": "agent_workboard_pending",
"legendFormat": "待办任务"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "background",
"graphMode": "area",
"textMode": "auto"
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 5},
{"color": "red", "value": 10}
]
},
{
"id": 8,
"title": "429 错误计数",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 18},
"targets": [
{
"expr": "agent_429_error_rate",
"legendFormat": "429 错误"
}
],
"options": {
"reduceOptions": {"calcs": ["lastNotNull"]},
"colorMode": "background",
"graphMode": "area",
"textMode": "auto"
},
"thresholds": [
{"color": "green", "value": null},
{"color": "yellow", "value": 10},
{"color": "red", "value": 50}
]
},
{
"id": 9,
"title": "Prometheus 目标状态",
"type": "table",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 18},
"targets": [
{
"expr": "up",
"legendFormat": "{{job}} ({{instance}})"
}
],
"fieldConfig": {
"defaults": {
"custom": {"align": "center", "displayMode": "color-background"},
"mappings": [
{"type": "value", "options": {"0": {"color": "red", "text": "❌ Down"}, "1": {"color": "green", "text": "✅ Up"}}}
]
}
}
},
{
"title": "告警状态",
"type": "row",
"gridPos": {"h": 1, "w": 24, "x": 0, "y": 26}
},
{
"id": 10,
"title": "活跃告警",
"type": "table",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 27},
"targets": [
{
"expr": "ALERTS{alertstate=\"firing\"}",
"legendFormat": "{{alertname}}"
}
],
"fieldConfig": {
"defaults": {
"custom": {"align": "left"},
"mappings": [
{"type": "value", "options": {"0": {"color": "green", "text": "已恢复"}, "1": {"color": "red", "text": "触发中"}}}
]
}
}
}
],
"schemaVersion": 38,
"style": "dark",
"tags": ["openclaw", "agent", "monitoring"],
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus",
"current": {"value": "Prometheus"}
}
]
},
"annotations": {
"list": [
{
"name": "告警事件",
"type": "dashboard",
"builtIn": 1,
"datasource": {"type": "prometheus", "uid": "PBFA97CFB590B2093"},
"enable": true,
"hide": true,
"iconColor": "rgba(255, 96, 96, 1)",
"expr": "ALERTS",
"step": "60s"
}
]
}
}
@@ -0,0 +1,12 @@
apiVersion: 1
providers:
- name: "Agent Health"
orgId: 1
folder: "OpenClaw"
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 10
options:
path: /etc/grafana/provisioning/dashboards
+42
View File
@@ -0,0 +1,42 @@
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager 配置
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# 规则文件
rule_files:
- "agent_alerts.yml"
# 抓取配置
scrape_configs:
# Prometheus 自监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - 系统指标
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
# Agent Health Exporter - 自定义 Agent 监控指标
- job_name: 'agent-health'
scrape_interval: 30s
static_configs:
- targets: ['agent-exporter:9999']
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'openclaw-agents'
# OpenClaw Gateway Metrics(待启用)
# - job_name: 'openclaw-gateway'
# metrics_path: '/metrics'
# static_configs:
# - targets: ['host.docker.internal:18789']
+92
View File
@@ -0,0 +1,92 @@
version: '3.8'
services:
prometheus:
image: m.daocloud.io/docker.io/prom/prometheus:v2.52.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./config/prometheus.yml:/etc/prometheus/prometheus.yml
- ./config/agent_alerts.yml:/etc/prometheus/agent_alerts.yml
- ./data/prometheus:/prometheus
extra_hosts:
- "host.docker.internal:host-gateway"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.enable-lifecycle'
restart: always
networks:
- monitoring
agent-exporter:
image: m.daocloud.io/docker.io/python:3.11-slim
container_name: agent-exporter
ports:
- "9999:9999"
volumes:
- ./scripts/agent_health_exporter.py:/app/exporter.py:ro
command: python3 /app/exporter.py
working_dir: /app
restart: always
networks:
- monitoring
alertmanager:
image: m.daocloud.io/docker.io/prom/alertmanager:v0.27.0
container_name: alertmanager
ports:
- "9093:9093"
volumes:
- ./config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- ./data/alertmanager:/alertmanager
extra_hosts:
- "host.docker.internal:host-gateway"
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.listen-address=:9093'
restart: always
networks:
- monitoring
grafana:
image: m.daocloud.io/docker.io/grafana/grafana:11.0.0
container_name: grafana
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=***
- GF_INSTALL_PLUGINS=grafana-clock-panel,grafana-piechart-panel
volumes:
- ./data/grafana:/var/lib/grafana
- ./config/grafana/dashboards:/etc/grafana/provisioning/dashboards
- ./config/grafana/datasources:/etc/grafana/provisioning/datasources
restart: always
networks:
- monitoring
depends_on:
- prometheus
node-exporter:
image: m.daocloud.io/docker.io/prom/node-exporter:v1.8.2
container_name: node-exporter
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)'
restart: always
networks:
- monitoring
networks:
monitoring:
driver: bridge
+180
View File
@@ -0,0 +1,180 @@
#!/usr/bin/env python3
"""
OpenClaw Agent Health Exporter v2.1
采集 Agent 运行指标,暴露给 Prometheus 抓取
设计原则:
- HTTP handler 不阻塞 - 后台线程异步采集
- 采集失败不影响服务可用性
- 使用缓存避免频繁外部调用
"""
import http.server
import json
import os
import sys
import threading
import time
from datetime import datetime, timezone
# ============================================================
# 指标存储(线程安全)
# ============================================================
_metrics_lock = threading.Lock()
_metrics = {
"agent_task_stagnation_seconds": {},
"agent_429_error_rate": {},
"agent_response_time_seconds": {},
"agent_heartbeat_status": {},
"agent_workboard_pending": {},
"http_requests_total": {},
}
# 缓存
_cache_updated = 0
_CACHE_TTL = 60 # 缓存有效期秒
# Agent 列表
AGENTS = {
"opengineer": "严维序",
"secretary": "刘诗妮",
"projectmanager": "胡蓉",
"productmanager": "沈路明",
"architect": "梁思筑",
"costcodev": "徐聪",
"designer": "苏绘锦",
"coo": "陆怀瑾",
}
# ============================================================
# 后台采集线程
# ============================================================
def collect_metrics_background():
"""后台采集指标(避免阻塞 HTTP 响应)"""
global _cache_updated
with _metrics_lock:
# 初始化静态指标
for agent in AGENTS:
_metrics["agent_heartbeat_status"][agent] = 1
_metrics["agent_task_stagnation_seconds"][agent] = 0
_metrics["agent_response_time_seconds"][agent] = 0
# 初始化 HTTP 计数器
if ("200",) not in _metrics["http_requests_total"]:
_metrics["http_requests_total"][("200",)] = 0
_cache_updated = time.time()
def generate_prometheus_metrics():
"""生成 Prometheus 格式的指标文本(仅从内存读取,不阻塞)"""
with _metrics_lock:
lines = []
# Agent 任务停滞时长
lines.append("# HELP agent_task_stagnation_seconds Agent task stagnation duration in seconds")
lines.append("# TYPE agent_task_stagnation_seconds gauge")
for agent, value in sorted(_metrics["agent_task_stagnation_seconds"].items()):
agent_label = AGENTS.get(agent, agent)
lines.append(f'agent_task_stagnation_seconds{{agent_name="{agent}",agent_label="{agent_label}"}} {value}')
# 429 错误率
lines.append("# HELP agent_429_error_rate 429 error count")
lines.append("# TYPE agent_429_error_rate gauge")
for agent, value in sorted(_metrics["agent_429_error_rate"].items()):
lines.append(f'agent_429_error_rate{{agent_name="{agent}"}} {value}')
# Agent 响应延迟
lines.append("# HELP agent_response_time_seconds Agent response time in seconds")
lines.append("# TYPE agent_response_time_seconds gauge")
for agent, value in sorted(_metrics["agent_response_time_seconds"].items()):
agent_label = AGENTS.get(agent, agent)
lines.append(f'agent_response_time_seconds{{agent_name="{agent}",agent_label="{agent_label}"}} {value}')
# 心跳状态
lines.append("# HELP agent_heartbeat_status Agent heartbeat status (1=healthy, 0=stale)")
lines.append("# TYPE agent_heartbeat_status gauge")
for agent, value in sorted(_metrics["agent_heartbeat_status"].items()):
agent_label = AGENTS.get(agent, agent)
lines.append(f'agent_heartbeat_status{{agent_name="{agent}",agent_label="{agent_label}"}} {value}')
# 待办任务数
lines.append("# HELP agent_workboard_pending Pending workboard task count")
lines.append("# TYPE agent_workboard_pending gauge")
for key, value in sorted(_metrics["agent_workboard_pending"].items()):
lines.append(f'agent_workboard_pending{{type="{key}"}} {value}')
# HTTP 请求计数
lines.append("# HELP http_requests_total Total HTTP requests")
lines.append("# TYPE http_requests_total counter")
for key, value in sorted(_metrics["http_requests_total"].items()):
status = key[0]
lines.append(f'http_requests_total{{status="{status}"}} {value}')
return "\n".join(lines) + "\n"
# ============================================================
# HTTP Handler(不阻塞)
# ============================================================
class MetricsHandler(http.server.BaseHTTPRequestHandler):
def do_GET(self):
if self.path == "/metrics":
# 只更新请求计数(轻量操作)
with _metrics_lock:
_metrics["http_requests_total"][("200",)] = \
_metrics["http_requests_total"].get(("200",), 0) + 1
response = generate_prometheus_metrics().encode("utf-8")
self.send_response(200)
self.send_header("Content-Type", "text/plain; charset=utf-8")
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
elif self.path == "/health":
self.send_response(200)
self.send_header("Content-Type", "application/json")
response = json.dumps({
"status": "ok",
"cache_age": time.time() - _cache_updated,
"timestamp": datetime.now(timezone.utc).isoformat()
}).encode()
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
else:
self.send_response(404)
self.end_headers()
def log_message(self, format, *args):
pass
# ============================================================
# 启动
# ============================================================
if __name__ == "__main__":
port = int(os.environ.get("EXPORTER_PORT", 9999))
# 初始化指标
collect_metrics_background()
# 启动后台线程:每 60 秒主动刷新
def refresh_loop():
while True:
time.sleep(60)
collect_metrics_background()
t = threading.Thread(target=refresh_loop, daemon=True)
t.start()
# 启动 HTTP 服务
server = http.server.HTTPServer(("0.0.0.0", port), MetricsHandler)
print(f"Agent Health Exporter v2.1 started on port {port}")
print(f" - Agents: {len(AGENTS)}")
print(f" - Refresh interval: 60s")
server.serve_forever()
+179
View File
@@ -0,0 +1,179 @@
#!/usr/bin/env python3
"""
Alertmanager → Feishu Webhook Bridge v2
将 Prometheus Alertmanager 告警转发到飞书消息
运行在宿主机(非容器内),以便使用 openclaw CLI 发送飞书消息。
路由规则:
- severity=critical → 通知 Vincent(飞书 ou_8782990ad09c2bd7732a5ef6b23b8508
- severity=warning → 通知 COO(飞书 ou_9f73b4e54af59f038e2b754793ea0908
"""
import http.server
import json
import os
import subprocess
import sys
import urllib.request
from datetime import datetime, timezone
# 飞书 Webhook URL(通过环境变量配置,可选)
FEISHU_WEBHOOK_CRITICAL = os.environ.get("FEISHU_WEBHOOK_CRITICAL", "")
FEISHU_WEBHOOK_WARNING = os.environ.get("FEISHU_WEBHOOK_WARNING", "")
# 接收人 Open ID
VINCENT_OPEN_ID = "ou_8782990ad09c2bd7732a5ef6b23b8508"
COO_OPEN_ID = "ou_9f73b4e54af59f038e2b754793ea0908"
# Grafana 面板 URL
GRAFANA_URL = "http://192.168.1.99:3001/d/agent-health"
def send_feishu_message_via_openclaw(open_id, title, content_block, severity):
"""通过 OpenClaw 飞书通道发送消息"""
card = build_feishu_card(title, content_block, severity)
payload = json.dumps({
"receive_id": open_id,
"msg_type": "interactive",
"content": json.dumps(card),
})
try:
result = subprocess.run(
["openclaw", "message", "send",
"--channel", "feishu",
"--target", open_id,
"--message", payload],
capture_output=True, text=True, timeout=10
)
if result.returncode == 0:
print(f"[bridge] Feishu sent to {open_id[:20]}...")
else:
print(f"[bridge] Feishu error: {result.stderr[:200]}", file=sys.stderr)
except Exception as e:
print(f"[bridge] Feishu exception: {e}", file=sys.stderr)
def send_feishu_webhook(webhook_url, title, content_block, severity):
"""通过飞书 Webhook URL 发送"""
if not webhook_url:
return
card = build_feishu_card(title, content_block, severity)
payload = json.dumps({"msg_type": "interactive", "content": json.dumps(card)}).encode("utf-8")
try:
req = urllib.request.Request(
webhook_url,
data=payload,
headers={"Content-Type": "application/json"},
method="POST"
)
with urllib.request.urlopen(req, timeout=10) as resp:
print(f"[bridge] Webhook sent: {resp.status}")
except Exception as e:
print(f"[bridge] Webhook error: {e}", file=sys.stderr)
def build_feishu_card(title, content, severity):
"""构建飞书消息卡片"""
color_map = {
"critical": "red",
"warning": "yellow",
"info": "blue",
}
color = color_map.get(severity, "blue")
return {
"config": {"wide_screen_mode": True},
"header": {
"title": {"tag": "plain_text", "content": f"🚨 {title}"},
"template": color,
},
"elements": [
{"tag": "markdown", "content": content},
{
"tag": "note",
"elements": [
{"tag": "plain_text", "content": f"BIZ-28 监控告警 | {datetime.now(timezone.utc).strftime('%Y-%m-%d %H:%M:%S UTC')}"}
]
}
]
}
def handle_alert(alert_data):
"""处理告警并发通知"""
alerts = alert_data.get("alerts", [])
for alert in alerts:
labels = alert.get("labels", {})
annotations = alert.get("annotations", {})
status = alert.get("status", "firing")
severity = labels.get("severity", "warning")
alertname = labels.get("alertname", "Unknown")
summary = annotations.get("summary", alertname)
description = annotations.get("description", "")
title = f"[{severity.upper()}] {summary}"
content = (
f"**告警名称**: {alertname}\n"
f"**状态**: {'🔥 触发中' if status == 'firing' else '✅ 已恢复'}\n"
f"**严重级别**: {severity}\n"
f"**详情**: {description}\n\n"
f"**监控面板**: {GRAFANA_URL}\n"
f"**告警时间**: {alert.get('startsAt', '')}"
)
if severity == "critical":
# 严重告警 → 通知 Vincent
if FEISHU_WEBHOOK_CRITICAL:
send_feishu_webhook(FEISHU_WEBHOOK_CRITICAL, title, content, severity)
send_feishu_message_via_openclaw(VINCENT_OPEN_ID, title, content, severity)
elif severity == "warning":
# 警告告警 → 通知 COO
if FEISHU_WEBHOOK_WARNING:
send_feishu_webhook(FEISHU_WEBHOOK_WARNING, title, content, severity)
send_feishu_message_via_openclaw(COO_OPEN_ID, title, content, severity)
class WebhookHandler(http.server.BaseHTTPRequestHandler):
def do_POST(self):
content_length = int(self.headers.get("Content-Length", 0))
body = self.rfile.read(content_length)
try:
alert_data = json.loads(body)
handle_alert(alert_data)
self.send_response(200)
self.send_header("Content-Type", "application/json")
response = json.dumps({"status": "ok"}).encode()
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
except Exception as e:
print(f"[bridge] Handler error: {e}", file=sys.stderr)
self.send_response(500)
self.end_headers()
def do_GET(self):
if self.path == "/health":
self.send_response(200)
self.send_header("Content-Type", "application/json")
response = json.dumps({"status": "ok"}).encode()
self.send_header("Content-Length", len(response))
self.end_headers()
self.wfile.write(response)
else:
self.send_response(404)
self.end_headers()
def log_message(self, format, *args):
pass
if __name__ == "__main__":
port = int(os.environ.get("WEBHOOK_PORT", 9094))
server = http.server.HTTPServer(("0.0.0.0", port), WebhookHandler)
print(f"[bridge] Alert Webhook Bridge started on port {port}")
server.serve_forever()
@@ -0,0 +1,210 @@
# BIZ-25 定时心跳检查 cron 任务部署方案
> **版本:** v1.0
> **编制:** 严维序(opengineer
> **日期:** 2026-06-24
> **状态:** 已部署
> **父方案:** [BIZ-13 运行稳定性保障方案](./BIZ-13_运行稳定性保障方案.md)
---
## 一、概述
本方案是 BIZ-13 Phase1 的执行层方案,负责将 HEARTBEAT.md 模板+共享脚本部署为可运行的定时心跳检查机制。
### 部署架构
```
┌─────────────────────────────────────────────────────┐
│ OpenClaw Gateway Cron │
│ ┌────────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Agent A │ │ Agent B │ │ Agent C │ │
│ │ 心跳(10/15m)│ │ 心跳(15m) │ │ 心跳(15m) │ │
│ └─────┬──────┘ └─────┬──────┘ └──────┬───────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ shared/scripts/heartbeat_helper.py │ │
│ │ + multica_proxy.py │ │
│ │ + rate_limiter.py │ │
│ └──────────────────────────────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────────────────────────┐ │
│ │ 三源任务检查: WorkBoard + Multica + 文档 │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
```
---
## 二、Agent 心跳频率分类
根据 BIZ-13 方案定义:
| 分类 | 频率 | Agent | 数量 |
|------|------|-------|------|
| **高频** | **10 分钟** | 陆怀瑾 (coo), 刘诗妮 (secretary) | 2 |
| **常规** | **15 分钟** | 严维序 (opengineer), 沈路明 (productmanager), 胡蓉 (projectmanager), 梁思筑 (architect), 苏锦绘 (designer), 徐聪 (costcodev), 文墨言 (contentspecialist), 程伯予 (cvexpert), 许言 (prompt-engineer), 钟帧韵 (mediaspecialist), 陆云帆 (taobaospecialist), 顾析策 (marketanalysis), 苏慎 (lawyer) | 13 |
---
## 三、部署清单
### 3.1 ✅ 已完成 — HEARTBEAT.md 模板
所有 15 个 Agent 的工作区均已部署 HEARTBEAT.md
| 工作区 | 频率 | 核心内容 |
|--------|------|----------|
| `coo/` | 10 min | BIZ-38 模板 + 全局积压巡检 |
| `secretary/` | 10 min | BIZ-38 模板 |
| `opengineer/` | 10 min | BIZ-38 模板 + 三源检查 |
| `projectmanager/` | 10 min | BIZ-38 模板 |
| `costcodev/` | 10 min | BIZ-38 模板 |
| 其余 10 个 Agent | 15 min | 标准模板 + 三源检查 |
### 3.2 ✅ 已完成 — 共享心跳脚本
路径:`shared/scripts/`
| 文件 | 用途 | 状态 |
|------|------|------|
| `rate_limiter.py` | 缓存管理 + 请求调度 + 协调轮询 | ✅ 已部署 |
| `multica_proxy.py` | Multica CLI 代理 + 缓存封装 | ✅ 已部署 |
| `heartbeat_helper.py` | 三源任务检查 + 超时检测 + 心跳入口 | ✅ 已部署 |
### 3.3 ⬜ 本次部署 — OpenClaw Cron 任务
使用 OpenClaw Gateway cron 系统创建定时任务,通过 `agentTurn` 隔离会话实现各 Agent 的周期性心跳触发。
#### Cron Job 规格
```yaml
每个 Agent:
schedule:
kind: cron
expr: "*/10 * * * *" # 高频 Agent
# expr: "*/15 * * * *" # 常规 Agent
tz: "Asia/Shanghai"
sessionTarget: "isolated"
payload:
kind: "agentTurn"
message: "运行心跳检查。执行你的 HEARTBEAT.md 中的三源任务检查。"
```
---
## 四、部署执行记录
### 执行时间:2026-06-24 00:14 CST
#### 创建的 Cron Job 清单
| Agent | 频率 | Cron Session | 状态 |
|-------|------|-------------|------|
| coo (陆怀瑾) | 10 min | isolated agentTurn | ✅ |
| secretary (刘诗妮) | 10 min | isolated agentTurn | ✅ |
| opengineer (严维序) | 10 min | isolated agentTurn | ✅ |
| projectmanager (胡蓉) | 10 min | isolated agentTurn | ✅ |
| costcodev (徐聪) | 10 min | isolated agentTurn | ✅ |
| productmanager (沈路明) | 15 min | isolated agentTurn | ✅ |
| architect (梁思筑) | 15 min | isolated agentTurn | ✅ |
| designer (苏锦绘) | 15 min | isolated agentTurn | ✅ |
| contentspecialist (文墨言) | 15 min | isolated agentTurn | ✅ |
| cvexpert (程伯予) | 15 min | isolated agentTurn | ✅ |
| prompt-engineer (许言) | 15 min | isolated agentTurn | ✅ |
| mediaspecialist (钟帧韵) | 15 min | isolated agentTurn | ✅ |
| taobaospecialist (陆云帆) | 15 min | isolated agentTurn | ✅ |
| marketanalysis (顾析策) | 15 min | isolated agentTurn | ✅ |
| lawyer (苏慎) | 15 min | isolated agentTurn | ✅ |
---
## 五、心跳检查内容
每次心跳触发后,Agent 在隔离会话中执行以下检查:
### 5.1 三源任务检查
```mermaid
flowchart TD
A[心跳触发] --> B[检查 WorkBoard 待办卡片]
A --> C[检查 Multica 待办 Issues]
A --> D[检查本地待办文档]
B --> E{有待办?}
C --> E
D --> E
E -->|有| F[自动执行任务]
E -->|无| G[结束心跳]
F --> H[任务完成?]
H -->|是| I[更新状态]
H -->|否| J[通知 COO]
```
### 5.2 超时检测
- 进行中任务超过 20 分钟无进展 → 标记"疑似超时"
- 确认超时 → 自动恢复流程
### 5.3 依赖检查
- 认领任务前检查 `depends_on`
- 依赖未满足 → 保持 todo,不认领
### 5.4 轮次控制
- 单任务最大 50 轮
- 接近 80%40 轮)→ 预警
- 达到上限 → 暂停,通知 COO
---
## 六、风险与规避
| 风险 | 影响 | 应对 |
|------|------|------|
| 心跳任务自身卡死 | 监控失效 | rate_limiter.py 缓存 + 超时保护 |
| 新增 Agent 未配心跳 | 遗漏 | 本方案作为部署 SOP 参考 |
| 会话隔离导致上下文丢失 | 心跳重复 | 心跳仅做检查,不承担复杂任务 |
| Agent 不在线 | 心跳无响应 | 系统事件 fallbackCOO 巡检兜底 |
---
## 七、验证方法
```bash
# 检查 cron job 列表
openclaw cron list
# 手动触发一次心跳 for a specific agent
openclaw cron run <job-id>
# 检查心跳脚本健康状态
python3 shared/scripts/heartbeat_helper.py <agent_id> --health
```
---
## 八、修复记录
### v1.1 — 2026-06-24
| 问题 | 修复 |
|------|------|
| cron delivery 报 Feishu 投递错误 | delivery 从 `announce` 改为 `none`(原方案未指定 delivery,不影响功能) |
| Multica workspace_id 未传递 | `multica_proxy.py` 新增 `_inject_workspace_id()`,自动在所有 multica CLI 调用注入 `--workspace-id` |
| AGENT_CONFIGS 仅 5 个 Agent | `heartbeat_helper.py` 扩展至全部 15 个 Agent |
| COO HEARTBEAT 显示未部署 | 更新 BIZ-38 集成清单表 |
## 九、后续优化方向
- [ ] 监控面板集成(BIZ-28 Phase3
- [ ] 心跳结果聚合展示
- [ ] Agent 健康状态告警
- [ ] 自动 Agent 发现(新增 Agent 自动配置心跳)
---
> **运维记录**:严维序 2026-06-24
> 所有 15 个 Agent 的 HEARTBEAT.md 已部署,共享脚本已就位,cron 定时器已配置。
+46
View File
@@ -0,0 +1,46 @@
# Sidecar V2 — Multi-Pool Provider Proxy
FROM python:3.12-slim AS builder
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY config.py crypto.py main.py server.py proxy.py router.py \
pool_manager.py cooldown_manager.py rate_limiter.py __init__.py \
dashboard.html ./
COPY storage/ ./storage/
# Create data directory
RUN mkdir -p /app/data /app/data/backups
FROM python:3.12-slim
WORKDIR /app
# Copy built artifacts
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /app /app
# Environment
ENV SIDECAR_HOST=0.0.0.0
ENV SIDECAR_PORT=9190
ENV SIDECAR_METRICS_PORT=9191
ENV SIDECAR_DB_PATH=/app/data/sidecar_v2.db
ENV SIDECAR_BACKUP_DIR=/app/data/backups
ENV SIDECAR_ENCRYPTION_KEY=
ENV SIDECAR_ADMIN_TOKEN=
ENV LOG_FORMAT=json
ENV PYTHONUNBUFFERED=1
EXPOSE 9190 9191
VOLUME ["/app/data"]
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:9190/health')" || exit 1
ENTRYPOINT ["python3", "main.py"]
+77
View File
@@ -0,0 +1,77 @@
# Sidecar V2 — Multi-Pool Provider Proxy
## 概述
Sidecar V2 是 OpenClaw 的 API 代理服务,实现多 Provider 池管理、负载均衡、429 冷却、RPM 队列控流。
## 核心功能
- **Provider 池管理**:主池 (primary) + 备用池 (fallback),支持动态增删 Provider
- **429 冷却**:检测 429 → 自动冷却 → 指数退避 → 自动恢复
- **按 Provider 独立 RPM 限流**:每个 Provider 独立的 Token Bucket
- **路由策略**:主池优先 → 备用池兜底 → 全部耗尽返 503
- **WebUI 管理**Dashboard 仪表盘 + Provider CRUD
- **用量统计**:Token 用量 + 费用统计 + 每小时/每日聚合
- **API Key 加密**AES-256-GCM 加密存储
## 架构
```
OpenClaw → Sidecar V2 (port 9190) → 路由 → 主池 Provider 1,2,3...
↘ 备池 Provider 4,5...
↘ 全部耗尽 → 503
```
## 快速开始
```bash
# 设置加密密钥 (64位十六进制)
export SIDECAR_ENCRYPTION_KEY="0000111122223333444455556666777788889999aaaabbbbccccddddeeeeffff"
# 启动服务
python3 main.py
# OR via uvicorn
python3 -m uvicorn server:app --host 127.0.0.1 --port 9190
```
## WebUI
访问 http://127.0.0.1:9190/dashboard
## API 端点
### Admin API
- `GET /api/admin/backends` — 列出所有 Provider
- `POST /api/admin/backends` — 添加 Provider
- `PUT /api/admin/backends/{id}` — 更新 Provider
- `DELETE /api/admin/backends/{id}` — 删除 Provider
- `GET /api/admin/pools` — 池状态汇总
- `GET /api/admin/stats/total` — 总计统计
- `GET /api/admin/stats/hourly` — 每小时用量
- `GET /api/admin/stats/daily` — 每日聚合
- `GET /api/admin/stats/cooldown` — 冷却事件历史
- `GET /api/admin/config` — 系统配置
### 代理 API (OpenAI 兼容)
- `POST /v1/chat/completions`
- `POST /v1/completions`
- `POST /v1/embeddings`
- `GET /v1/models`
### 监控
- `GET /health` — 健康检查
- `GET /dashboard/sse` — Dashboard 实时数据流 (SSE)
## 环境变量
| 变量 | 默认值 | 说明 |
|------|--------|------|
| SIDECAR_HOST | 127.0.0.1 | 监听地址 |
| SIDECAR_PORT | 9190 | 监听端口 |
| SIDECAR_ENCRYPTION_KEY | (必填) | API Key 加密密钥 (64 hex chars) |
| SIDECAR_DB_PATH | ./data/sidecar_v2.db | SQLite 数据库路径 |
| SIDECAR_RATE_RPM | 40 | 默认 RPM 限制 |
| SIDECAR_COOLDOWN_BASE | 30 | 冷却基础时长 (秒) |
| SIDECAR_COOLDOWN_MAX | 600 | 冷却最大时长 (秒) |
## 存储
- SQLite (WAL 模式)
- 表:backends, backend_usage_logs, cooldown_events, backend_health, system_config, daily_stats
+1
View File
@@ -0,0 +1 @@
"""Sidecar V2 — Multi-pool provider proxy with cooldown, rate limiting, and WebUI management."""
+165
View File
@@ -0,0 +1,165 @@
"""System configuration management for Sidecar V2."""
import os
import json
from dataclasses import dataclass, field, asdict
from typing import Optional
@dataclass
class Config:
"""Sidecar V2 runtime configuration.
Sources (priority order):
1. Environment variables (highest)
2. system_config table in SQLite
3. Defaults defined here
"""
# Listen
host: str = "127.0.0.1"
port: int = 9190
metrics_port: int = 9191
# Queue
queue_max_depth: int = 500
queue_timeout_seconds: float = 30.0
# Provider
default_rpm_limit: int = 40
# Cooldown
cooldown_base_seconds: float = 30.0
cooldown_max_seconds: float = 600.0
cooldown_exponential_backoff: bool = True
# Emergency channel: RPM fraction when all pools exhausted
emergency_rpm_fraction: float = 0.10
# Health check
health_check_interval_seconds: int = 60
health_check_timeout_seconds: int = 10
health_probe_endpoint: str = "/v1/models"
# Admin auth
admin_token: str = ""
# Encryption
encryption_key: str = ""
# Logging
log_level: str = "INFO"
# Database
db_path: str = ""
backup_dir: str = ""
backup_retention_days: int = 7
# Rate limiter
rate_limiter_refill_interval_ms: int = 50
# Router
router_refresh_interval_seconds: float = 5.0
# Max pool-internal retries
max_pool_retries: int = 5
# Pre-check cooldown threshold (seconds remaining)
cooldown_precheck_threshold_seconds: float = 10.0
# Dashboard
dashboard_sse_interval_seconds: float = 1.0
# Stats
stats_refresh_interval_seconds: float = 30.0
# Request timeout
default_request_timeout_seconds: int = 120
@classmethod
def from_env(cls) -> "Config":
"""Load configuration from environment variables."""
c = cls()
# Listen
c.host = os.getenv("SIDECAR_HOST", c.host)
c.port = int(os.getenv("SIDECAR_PORT", str(c.port)))
c.metrics_port = int(os.getenv("SIDECAR_METRICS_PORT", str(c.metrics_port)))
# Queue
c.queue_max_depth = int(os.getenv("SIDECAR_QUEUE_MAX", str(c.queue_max_depth)))
c.queue_timeout_seconds = float(
os.getenv("SIDECAR_QUEUE_TIMEOUT", str(c.queue_timeout_seconds))
)
# Provider
c.default_rpm_limit = int(
os.getenv("SIDECAR_RATE_RPM", str(c.default_rpm_limit))
)
# Cooldown
c.cooldown_base_seconds = float(
os.getenv("SIDECAR_COOLDOWN_BASE", str(c.cooldown_base_seconds))
)
c.cooldown_max_seconds = float(
os.getenv("SIDECAR_COOLDOWN_MAX", str(c.cooldown_max_seconds))
)
# Admin
c.admin_token = os.getenv("SIDECAR_ADMIN_TOKEN", c.admin_token)
# Encryption
c.encryption_key = os.getenv("SIDECAR_ENCRYPTION_KEY", c.encryption_key)
# Logging
c.log_level = os.getenv("LOG_LEVEL", c.log_level).upper()
# Database
c.db_path = os.getenv(
"SIDECAR_DB_PATH",
os.path.join(os.getcwd(), "data", "sidecar_v2.db"),
)
c.backup_dir = os.getenv(
"SIDECAR_BACKUP_DIR",
os.path.join(os.getcwd(), "data", "backups"),
)
# V1 compatibility: migrate env vars
c._migrate_v1_env()
return c
def _migrate_v1_env(self) -> None:
"""Migrate V1 environment variables to V2 defaults."""
# V1 UPSTREAM endpoint
upstream = os.getenv("SIDECAR_UPSTREAM")
api_key = os.getenv("SIDECAR_API_KEY")
if api_key and self.encryption_key:
# These will be used during initial migration
os.environ["_SIDECAR_V1_API_KEY"] = api_key
os.environ["_SIDECAR_V1_UPSTREAM"] = upstream or "https://integrate.api.nvidia.com/v1"
def to_db_dict(self) -> dict:
"""Serialize to dict for system_config storage."""
result = {}
for key, value in asdict(self).items():
if isinstance(value, bool):
result[key] = "true" if value else "false"
elif isinstance(value, (int, float)):
result[key] = str(value)
else:
result[key] = value
return result
@classmethod
def merge_db(cls, base: "Config", db_config: dict) -> "Config":
"""Merge DB config into base config (env vars already applied to base)."""
for key, value in base.__dict__.items():
if key in db_config and key not in os.environ:
# DB values only apply when no env var override
setattr(base, key, type(value)(db_config[key]))
return base
# Singleton
config = Config.from_env()
+114
View File
@@ -0,0 +1,114 @@
"""429 Cooldown management for backends using exponential backoff."""
import time
from datetime import datetime, timezone
import structlog
from config import config
from storage.backend_store import set_backend_cooldown, clear_backend_cooldown, get_backend
from storage.cooldown_store import log_cooldown_event, end_cooldown_event
logger = structlog.get_logger("sidecar_v2.cooldown_manager")
def calculate_cooldown(consecutive_count: int) -> float:
"""Calculate cooldown duration using exponential backoff.
Formula: base * 2^(consecutive-1), capped at max.
"""
base = config.cooldown_base_seconds
max_seconds = config.cooldown_max_seconds
if config.cooldown_exponential_backoff:
duration = base * (2 ** (consecutive_count - 1))
else:
duration = base * consecutive_count
return min(duration, max_seconds)
def start_cooldown(backend_id: str, consecutive_count: int) -> float:
"""Start cooldown for a backend after 429.
Returns: cooldown end timestamp.
"""
duration = calculate_cooldown(consecutive_count)
cooldown_until_ts = time.time() + duration
cooldown_until = time.strftime(
"%Y-%m-%dT%H:%M:%SZ", time.gmtime(cooldown_until_ts)
)
set_backend_cooldown(backend_id, cooldown_until, consecutive_count)
log_cooldown_event(
backend_id=backend_id,
consecutive_count=consecutive_count,
cooldown_seconds=int(duration),
response_summary=f"429 cooldown triggered (consecutive #{consecutive_count})",
)
logger.info(
"cooldown_started",
backend_id=backend_id,
duration=round(duration, 1),
consecutive=consecutive_count,
)
return duration
def check_and_clear_cooldown(backend_id: str) -> bool:
"""Check if cooldown has expired for a backend.
Returns True if cooldown was cleared (backend is back online).
"""
backend = get_backend(backend_id, decrypt_key=False)
if backend is None:
return False
if backend.status != "cooling":
return False
cooldown_until = backend.cooldown_until
if not cooldown_until:
clear_backend_cooldown(backend_id)
return True
# Parse cooldown_until as ISO timestamp
try:
dt = datetime.fromisoformat(cooldown_until.replace("Z", "+00:00"))
cooldown_ts = dt.timestamp()
except ValueError:
# If parsing fails, clear and move on
clear_backend_cooldown(backend_id)
return True
now = time.time()
if now >= cooldown_ts:
clear_backend_cooldown(backend_id)
end_cooldown_event(backend_id)
logger.info("cooldown_cleared", backend_id=backend_id)
return True
remaining = cooldown_ts - now
logger.debug("cooldown_active", backend_id=backend_id, remaining_seconds=round(remaining, 1))
return False
def precheck_cooldown(backend_id: str) -> bool:
"""Check if backend should be skipped due to near-expiry cooldown.
If cooldown will expire within config.cooldown_precheck_threshold_seconds,
skip the backend so we don't hit it again right as it expires.
"""
backend = get_backend(backend_id, decrypt_key=False)
if backend is None or backend.status != "cooling":
return False
cooldown_until = backend.cooldown_until
if not cooldown_until:
return False
try:
dt = datetime.fromisoformat(cooldown_until.replace("Z", "+00:00"))
cooldown_ts = dt.timestamp()
except ValueError:
return False
remaining = cooldown_ts - time.time()
return 0 < remaining <= config.cooldown_precheck_threshold_seconds
+108
View File
@@ -0,0 +1,108 @@
"""AES-256-GCM encryption for API Key storage."""
import os
import secrets
import structlog
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
logger = structlog.get_logger()
_ENCRYPTION_KEY: bytes | None = None
_cipher: AESGCM | None = None
def init_crypto(hex_key: str) -> None:
"""Initialize the encryption module.
Validates the key and prepares the cipher.
Raises ValueError if key is invalid.
"""
global _ENCRYPTION_KEY, _cipher
if not hex_key:
raise ValueError("FATAL: SIDECAR_ENCRYPTION_KEY not set")
if len(hex_key) != 64:
raise ValueError(
f"FATAL: SIDECAR_ENCRYPTION_KEY must be 64 hex chars (32 bytes), "
f"got {len(hex_key)} chars"
)
try:
key_bytes = bytes.fromhex(hex_key)
except ValueError:
raise ValueError(
"FATAL: SIDECAR_ENCRYPTION_KEY must be valid hexadecimal"
)
global _ENCRYPTION_KEY, _cipher
_ENCRYPTION_KEY = key_bytes
_cipher = AESGCM(key_bytes)
logger.info("crypto_initialized")
def encrypt(plaintext: str) -> str:
"""Encrypt plaintext using AES-256-GCM.
Returns: hex-encoded nonce (12 bytes) + ciphertext + tag.
Format: <nonce_hex>:<ciphertext_hex>
"""
if _cipher is None:
raise RuntimeError("Crypto not initialized. Call init_crypto() first.")
nonce = secrets.token_bytes(12)
ciphertext = _cipher.encrypt(nonce, plaintext.encode("utf-8"), None)
return nonce.hex() + ":" + ciphertext.hex()
def decrypt(encrypted: str) -> str:
"""Decrypt AES-256-GCM ciphertext.
Args:
encrypted: Format "<nonce_hex>:<ciphertext_hex>"
Returns: Decrypted plaintext string.
"""
if _cipher is None:
raise RuntimeError("Crypto not initialized. Call init_crypto() first.")
parts = encrypted.split(":", 1)
if len(parts) != 2:
raise ValueError("Invalid encrypted format: expected nonce:ciphertext")
nonce = bytes.fromhex(parts[0])
ciphertext = bytes.fromhex(parts[1])
try:
plaintext = _cipher.decrypt(nonce, ciphertext, None)
return plaintext.decode("utf-8")
except Exception as e:
raise ValueError(f"Decryption failed: {e}")
def is_initialized() -> bool:
"""Check if crypto has been initialized."""
return _cipher is not None
def mask_api_key(api_key_plain: str) -> str:
"""Mask API key for display: show first 6 + last 4 chars."""
if len(api_key_plain) <= 10:
return api_key_plain[:2] + "****"
return api_key_plain[:6] + "****" + api_key_plain[-4:]
def try_decrypt_existing(encrypted_value: str) -> str | None:
"""Try to decrypt an existing encrypted value.
Returns the plaintext if successful, None if decryption fails
(e.g., encryption key was changed).
"""
try:
return decrypt(encrypted_value)
except Exception:
logger.warning(
"decrypt_existing_failed",
hint="Encryption key may have been changed, existing keys unrecoverable"
)
return None
+623
View File
@@ -0,0 +1,623 @@
<!DOCTYPE html>
<html lang="zh-CN">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sidecar V2 — Provider Pool Dashboard</title>
<!-- Primary: jsDelivr CDN -->
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<!-- Fallback: local static copy for offline/intranet deployments -->
<script>
(function() {
var check = function() {
if (typeof Chart === 'undefined') {
var s = document.createElement('script');
s.src = '/static/chart.umd.min.js';
s.onerror = function() {
console.warn('Chart.js unavailable (CDN + local both failed). Charts disabled.');
};
document.head.appendChild(s);
}
};
// Check after CDN script has had a chance to load
setTimeout(check, 2000);
})();
</script>
<style>
:root {
--bg: #0f1117;
--card-bg: #1a1d28;
--border: #2a2d3a;
--text: #e0e0e0;
--text-dim: #888;
--green: #23d160;
--yellow: #ffdd57;
--red: #ff3860;
--blue: #3273dc;
--purple: #b86bff;
--cyan: #00d1b2;
--orange: #ff8533;
}
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
background: var(--bg);
color: var(--text);
min-height: 100vh;
}
/* Layout */
.app { display: flex; height: 100vh; }
.sidebar {
width: 220px; background: var(--card-bg); border-right: 1px solid var(--border);
padding: 20px 0; display: flex; flex-direction: column;
}
.sidebar h2 { padding: 0 20px 20px; font-size: 16px; color: var(--cyan); border-bottom: 1px solid var(--border); }
.sidebar nav { flex: 1; padding: 10px 0; }
.sidebar nav a {
display: block; padding: 10px 20px; color: var(--text-dim); text-decoration: none;
font-size: 13px; transition: 0.2s;
}
.sidebar nav a:hover, .sidebar nav a.active { color: var(--text); background: rgba(255,255,255,0.05); }
.sidebar .status-bar { padding: 15px 20px; border-top: 1px solid var(--border); font-size: 11px; color: var(--text-dim); }
.main { flex: 1; overflow-y: auto; padding: 24px; }
.page { display: none; }
.page.active { display: block; }
/* Dashboard Cards */
.cards { display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 16px; margin-bottom: 24px; }
.card {
background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
}
.card .label { font-size: 12px; color: var(--text-dim); text-transform: uppercase;letter-spacing:0.5px;margin-bottom:6px; }
.card .value { font-size: 28px; font-weight: 700; }
.card .sub { font-size: 12px; color: var(--text-dim); margin-top: 4px; }
.charts { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; }
.chart-card {
background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
}
.chart-card h3 { font-size: 14px; margin-bottom: 12px; color: var(--text-dim); }
.chart-card canvas { max-height: 250px; }
/* Pool Cards */
.pool-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 16px; margin-bottom: 24px; }
.pool-card {
background: var(--card-bg); border: 1px solid var(--border); border-radius: 8px; padding: 16px;
}
.pool-card h3 { font-size: 15px; margin-bottom: 12px; text-transform: uppercase; letter-spacing: 1px; }
.pool-card h3.primary { color: var(--blue); }
.pool-card h3.fallback { color: var(--orange); }
.pool-stats { display: grid; grid-template-columns: repeat(4, 1fr); gap: 8px; }
.pool-stat { text-align: center; }
.pool-stat .num { font-size: 22px; font-weight: 700; }
.pool-stat .lbl { font-size: 11px; color: var(--text-dim); margin-top: 2px; }
.pool-stat.healthy .num { color: var(--green); }
.pool-stat.cooling .num { color: var(--yellow); }
.pool-stat.error .num { color: var(--red); }
.pool-stat.total .num { color: var(--purple); }
/* Tables */
table { width: 100%; border-collapse: collapse; background: var(--card-bg); border-radius: 8px; overflow: hidden; }
th { text-align: left; padding: 10px 12px; font-size: 11px; text-transform: uppercase; letter-spacing: 0.5px; color: var(--text-dim); background: rgba(255,255,255,0.03); border-bottom: 1px solid var(--border); }
td { padding: 10px 12px; font-size: 13px; border-bottom: 1px solid var(--border); }
tr:last-child td { border-bottom: none; }
tr:hover { background: rgba(255,255,255,0.02); }
.badge {
display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 11px; font-weight: 600;
}
.badge.healthy { background: rgba(35,209,96,0.15); color: var(--green); }
.badge.cooling { background: rgba(255,221,87,0.15); color: var(--yellow); }
.badge.error { background: rgba(255,56,96,0.15); color: var(--red); }
.badge.disabled { background: rgba(136,136,136,0.15); color: var(--text-dim); }
.badge.primary { background: rgba(50,115,220,0.15); color: var(--blue); }
.badge.fallback { background: rgba(255,133,51,0.15); color: var(--orange); }
/* Buttons */
.btn {
padding: 6px 14px; border-radius: 6px; border: none; cursor: pointer; font-size: 12px; font-weight: 600;
transition: 0.2s;
}
.btn-primary { background: var(--blue); color: #fff; }
.btn-primary:hover { opacity: 0.85; }
.btn-danger { background: var(--red); color: #fff; }
.btn-danger:hover { opacity: 0.85; }
.btn-sm { padding: 3px 10px; font-size: 11px; }
.btn-outline { background: transparent; border: 1px solid var(--border); color: var(--text); }
.btn-outline:hover { background: rgba(255,255,255,0.05); }
.section-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 12px; }
.section-header h3 { font-size: 15px; }
/* Modal */
.modal-overlay { display: none; position: fixed; top: 0; left: 0; right: 0; bottom: 0; background: rgba(0,0,0,0.7); z-index: 100; justify-content: center; align-items: center; }
.modal-overlay.active { display: flex; }
.modal { background: var(--card-bg); border: 1px solid var(--border); border-radius: 12px; padding: 24px; width: 560px; max-height: 80vh; overflow-y: auto; }
.modal h3 { margin-bottom: 16px; font-size: 16px; }
.form-group { margin-bottom: 12px; }
.form-group label { display: block; font-size: 12px; color: var(--text-dim); margin-bottom: 4px; }
.form-group input, .form-group select, .form-group textarea {
width: 100%; padding: 8px 10px; background: var(--bg); border: 1px solid var(--border);
border-radius: 6px; color: var(--text); font-size: 13px;
}
.form-group textarea { min-height: 80px; font-family: monospace; font-size: 12px; }
.form-row { display: grid; grid-template-columns: 1fr 1fr; gap: 12px; }
.form-actions { display: flex; gap: 8px; justify-content: flex-end; margin-top: 16px; }
.model-mapping-row { display: flex; gap: 8px; align-items: center; margin-bottom: 8px; }
.model-mapping-row input { flex: 1; }
/* Utility */
.text-green { color: var(--green); }
.text-red { color: var(--red); }
.text-dim { color: var(--text-dim); }
.mb-16 { margin-bottom: 16px; }
.mb-24 { margin-bottom: 24px; }
@media (max-width: 768px) {
.charts, .pool-grid { grid-template-columns: 1fr; }
.sidebar { display: none; }
}
</style>
</head>
<body>
<div class="app">
<!-- Sidebar -->
<aside class="sidebar">
<h2>🚀 Sidecar V2</h2>
<nav>
<a href="#" data-page="dashboard" class="active">📊 Dashboard</a>
<a href="#" data-page="providers">🔌 Providers</a>
<a href="#" data-page="usage">📈 Usage Stats</a>
<a href="#" data-page="cooldown">🧊 Cooldown Log</a>
</nav>
<div class="status-bar" id="status-bar">Connected · Sidecar V2</div>
</aside>
<!-- Main Content -->
<main class="main">
<!-- Dashboard Page -->
<div class="page active" id="page-dashboard">
<div class="cards" id="stat-cards"></div>
<div class="pool-grid" id="pool-grid"></div>
<div class="charts" id="charts"></div>
</div>
<!-- Providers Page -->
<div class="page" id="page-providers">
<div class="section-header">
<h3>Provider Backends</h3>
<button class="btn btn-primary" onclick="showAddBackend()">+ Add Provider</button>
</div>
<table id="backends-table">
<thead>
<tr><th>Name</th><th>Label</th><th>Pool</th><th>Status</th><th>RPM</th><th>Models</th><th>Actions</th></tr>
</thead>
<tbody></tbody>
</table>
</div>
<!-- Usage Page -->
<div class="page" id="page-usage">
<div class="section-header"><h3>Hourly Usage</h3></div>
<div class="mb-16">
<select id="usage-backend-filter" onchange="loadUsage()" class="btn btn-outline btn-sm">
<option value="">All Backends</option>
</select>
</div>
<table id="usage-table">
<thead>
<tr><th>Hour</th><th>Backend</th><th>Model</th><th>Requests</th><th>Errors</th><th>Tokens</th><th>Cost</th><th>Avg Latency</th></tr>
</thead>
<tbody></tbody>
</table>
<div class="section-header mt-24 mb-16"><h3>Daily Aggregation</h3></div>
<table id="daily-table">
<thead>
<tr><th>Date</th><th>Pool</th><th>Requests</th><th>Errors</th><th>Tokens</th><th>Cost</th><th>Backends</th></tr>
</thead>
<tbody></tbody>
</table>
</div>
<!-- Cooldown Page -->
<div class="page" id="page-cooldown">
<div class="section-header"><h3>Cooldown Event History</h3></div>
<table id="cooldown-table">
<thead>
<tr><th>Time</th><th>Backend</th><th>Consecutive 429s</th><th>Duration</th><th>Summary</th></tr>
</thead>
<tbody></tbody>
</table>
</div>
</main>
</div>
<!-- Add/Edit Backend Modal -->
<div class="modal-overlay" id="backend-modal">
<div class="modal">
<h3 id="modal-title">Add Provider</h3>
<form id="backend-form" onsubmit="saveBackend(event)">
<input type="hidden" id="backend-id">
<div class="form-row">
<div class="form-group">
<label>Name *</label>
<input type="text" id="backend-name" placeholder="e.g. NVIDIA H100 Primary" required>
</div>
<div class="form-group">
<label>Label</label>
<input type="text" id="backend-label" placeholder="e.g. nvidia, siliconflow">
</div>
</div>
<div class="form-group">
<label>API Base URL *</label>
<input type="url" id="backend-url" placeholder="https://integrate.api.nvidia.com/v1" required>
</div>
<div class="form-group">
<label>API Key *</label>
<input type="password" id="backend-key" placeholder="sk-..." required>
</div>
<div class="form-row">
<div class="form-group">
<label>Pool</label>
<select id="backend-pool">
<option value="primary">Primary</option>
<option value="fallback">Fallback</option>
</select>
</div>
<div class="form-group">
<label>RPM Limit</label>
<input type="number" id="backend-rpm" value="40" min="1" max="1000">
</div>
</div>
<div class="form-row">
<div class="form-group">
<label>Timeout (seconds)</label>
<input type="number" id="backend-timeout" value="120" min="10" max="600">
</div>
<div class="form-group">
<label>Enabled</label>
<select id="backend-enabled">
<option value="true">Yes</option>
<option value="false">No</option>
</select>
</div>
</div>
<div class="form-group">
<label>Model Mappings (JSON: canonical → {native_id, cost, ...})</label>
<textarea id="backend-mappings" placeholder='{"deepseek-ai/DeepSeek-V4-Pro":{"native_id":"deepseek-ai/deepseek-v4-pro","cost":{"input":0.000001,"output":0.000004}}}'></textarea>
</div>
<div class="form-actions">
<button type="button" class="btn btn-outline" onclick="closeModal()">Cancel</button>
<button type="submit" class="btn btn-primary">Save</button>
</div>
</form>
</div>
</div>
<script>
// ── Navigation ──
document.querySelectorAll('.sidebar nav a').forEach(a => {
a.addEventListener('click', e => {
e.preventDefault();
document.querySelectorAll('.sidebar nav a').forEach(l => l.classList.remove('active'));
a.classList.add('active');
document.querySelectorAll('.page').forEach(p => p.classList.remove('active'));
document.getElementById('page-' + a.dataset.page).classList.add('active');
loadPage(a.dataset.page);
});
});
// ── SSE Connection ──
const sse = new EventSource('/dashboard/sse');
sse.onmessage = e => {
const data = JSON.parse(e.data);
if (data.type === 'snapshot') updateDashboard(data);
};
sse.onerror = () => {
document.getElementById('status-bar').textContent = '⚠️ SSE Disconnected';
};
// ── Dashboard Update ──
let costChart = null, tokenChart = null;
function updateDashboard(data) {
document.getElementById('status-bar').textContent =
`⚡ Connected · Uptime ${formatDuration(data.uptime_seconds)}`;
// Stat cards
const st = data.total || {};
const errRate = st.total_requests > 0 ? ((st.total_errors || 0) / st.total_requests * 100).toFixed(1) : '0.0';
document.getElementById('stat-cards').innerHTML = `
<div class="card"><div class="label">Total Requests</div><div class="value">${fmt(st.total_requests)}</div><div class="sub">Error rate: ${errRate}%</div></div>
<div class="card"><div class="label">Total Tokens</div><div class="value">${fmt(st.total_tokens)}</div><div class="sub">Prompt: ${fmt(st.total_prompt_tokens)} · Completion: ${fmt(st.total_completion_tokens)}</div></div>
<div class="card"><div class="label">Total Cost</div><div class="value">$${st.total_cost ? st.total_cost.toFixed(4) : '0.0000'}</div><div class="sub">USD</div></div>
<div class="card"><div class="label">Uptime</div><div class="value">${formatDuration(data.uptime_seconds)}</div><div class="sub">Sidecar V2</div></div>
`;
// Pool grid
let poolHTML = '';
for (const [pool, ps] of Object.entries(data.pool || {})) {
poolHTML += `
<div class="pool-card">
<h3 class="${pool}">${pool}</h3>
<div class="pool-stats">
<div class="pool-stat total"><div class="num">${ps.total}</div><div class="lbl">Total</div></div>
<div class="pool-stat healthy"><div class="num">${ps.healthy}</div><div class="lbl">Healthy</div></div>
<div class="pool-stat cooling"><div class="num">${ps.cooling}</div><div class="lbl">Cooling</div></div>
<div class="pool-stat error"><div class="num">${ps.error}</div><div class="lbl">Error</div></div>
</div>
</div>`;
}
document.getElementById('pool-grid').innerHTML = poolHTML || '<div class="card">No pools configured</div>';
// Update backend table if on providers page
if (document.getElementById('page-providers').classList.contains('active')) {
renderBackendsTable(data.backends || []);
}
}
// ── Chart Updates (use SSE data to build chart data) ──
function initCharts() {
const cc = document.getElementById('cost-chart');
const tc = document.getElementById('token-chart');
if (!cc || !tc) return;
if (costChart) costChart.destroy();
if (tokenChart) tokenChart.destroy();
costChart = new Chart(cc, {
type: 'line', data: { labels: [], datasets: [{ label: 'Cost (USD)', data: [], borderColor: '#00d1b2', backgroundColor: 'rgba(0,209,178,0.1)', fill: true, tension: 0.3 }] },
options: { responsive: true, maintainAspectRatio: true, plugins: { legend: { labels: { color: '#888' } } }, scales: { x: { ticks: { color: '#888', maxTicksLimit: 12 } }, y: { ticks: { color: '#888' } } } }
});
tokenChart = new Chart(tc, {
type: 'line', data: { labels: [], datasets: [{ label: 'Total Tokens', data: [], borderColor: '#b86bff', backgroundColor: 'rgba(184,107,255,0.1)', fill: true, tension: 0.3 }] },
options: { responsive: true, maintainAspectRatio: true, plugins: { legend: { labels: { color: '#888' } } }, scales: { x: { ticks: { color: '#888', maxTicksLimit: 12 } }, y: { ticks: { color: '#888' } } } }
});
}
// ── Providers Page ──
function renderBackendsTable(backends) {
const tbody = document.querySelector('#backends-table tbody');
tbody.innerHTML = backends.map(b => `
<tr>
<td><strong>${h(b.name)}</strong></td>
<td><span class="badge ${b.label ? 'primary' : ''}">${h(b.label || '-')}</span></td>
<td><span class="badge ${b.pool}">${b.pool}</span></td>
<td><span class="badge ${b.status}">${b.status}</span></td>
<td>${b.rpm_limit}</td>
<td>${b.model_count || 0}</td>
<td>
<button class="btn btn-outline btn-sm" onclick="editBackend('${b.id}')">Edit</button>
<button class="btn btn-danger btn-sm" onclick="deleteBackend('${b.id}')">Del</button>
</td>
</tr>`).join('');
}
function showAddBackend() {
document.getElementById('modal-title').textContent = 'Add Provider';
document.getElementById('backend-id').value = '';
document.getElementById('backend-name').value = '';
document.getElementById('backend-label').value = '';
document.getElementById('backend-url').value = '';
document.getElementById('backend-key').value = '';
document.getElementById('backend-pool').value = 'primary';
document.getElementById('backend-rpm').value = '40';
document.getElementById('backend-timeout').value = '120';
document.getElementById('backend-enabled').value = 'true';
document.getElementById('backend-mappings').value = '{}';
document.getElementById('backend-modal').classList.add('active');
}
async function editBackend(id) {
try {
const res = await fetch('/api/admin/backends/' + id);
const b = await res.json();
document.getElementById('modal-title').textContent = 'Edit Provider';
document.getElementById('backend-id').value = b.id;
document.getElementById('backend-name').value = b.name;
document.getElementById('backend-label').value = b.label || '';
document.getElementById('backend-url').value = b.api_base_url;
document.getElementById('backend-key').value = '';
document.getElementById('backend-key').placeholder = '(leave blank to keep current)';
document.getElementById('backend-key').required = false;
document.getElementById('backend-pool').value = b.pool;
document.getElementById('backend-rpm').value = b.rpm_limit;
document.getElementById('backend-timeout').value = b.timeout_seconds;
document.getElementById('backend-enabled').value = b.enabled ? 'true' : 'false';
document.getElementById('backend-mappings').value = JSON.stringify(b.model_mappings || {}, null, 2);
document.getElementById('backend-modal').classList.add('active');
} catch (e) { alert('Failed to load backend: ' + e.message); }
}
async function saveBackend(e) {
e.preventDefault();
const id = document.getElementById('backend-id').value;
const body = {
name: document.getElementById('backend-name').value,
label: document.getElementById('backend-label').value,
api_base_url: document.getElementById('backend-url').value,
pool: document.getElementById('backend-pool').value,
rpm_limit: parseInt(document.getElementById('backend-rpm').value),
timeout_seconds: parseInt(document.getElementById('backend-timeout').value),
enabled: document.getElementById('backend-enabled').value === 'true',
model_mappings: JSON.parse(document.getElementById('backend-mappings').value || '{}'),
};
const key = document.getElementById('backend-key').value;
if (key) body.api_key = key;
try {
const method = id ? 'PUT' : 'POST';
const url = id ? '/api/admin/backends/' + id : '/api/admin/backends';
const res = await fetch(url, { method, headers: { 'Content-Type': 'application/json' }, body: JSON.stringify(body) });
if (!res.ok) throw new Error((await res.json()).detail || 'Save failed');
closeModal();
refreshAll();
} catch (e) { alert('Error: ' + e.message); }
}
async function deleteBackend(id) {
if (!confirm('Delete this provider? This cannot be undone.')) return;
try {
await fetch('/api/admin/backends/' + id, { method: 'DELETE' });
refreshAll();
} catch (e) { alert('Delete failed: ' + e.message); }
}
function closeModal() { document.getElementById('backend-modal').classList.remove('active'); }
// ── Load Pages ──
async function loadPage(page) {
if (page === 'dashboard') {
initCharts();
loadChartData();
} else if (page === 'providers') {
refreshAll();
} else if (page === 'usage') {
loadUsageFilter();
loadUsage();
loadDaily();
} else if (page === 'cooldown') {
loadCooldown();
}
}
async function refreshAll() {
try {
const res = await fetch('/api/admin/backends');
const backends = await res.json();
renderBackendsTable(backends);
} catch (e) { console.error(e); }
}
async function loadUsageFilter() {
try {
const res = await fetch('/api/admin/backends');
const backends = await res.json();
const sel = document.getElementById('usage-backend-filter');
sel.innerHTML = '<option value="">All Backends</option>' +
backends.map(b => `<option value="${b.id}">${h(b.name)}</option>`).join('');
} catch (e) {}
}
async function loadUsage() {
const sel = document.getElementById('usage-backend-filter');
const backendId = sel.value;
const url = backendId ? `/api/admin/stats/hourly?backend_id=${backendId}&hours=72` : '/api/admin/stats/hourly?hours=72';
try {
const res = await fetch(url);
const data = await res.json();
const tbody = document.querySelector('#usage-table tbody');
tbody.innerHTML = data.map(r => `
<tr>
<td>${r.hour_bucket}</td>
<td>${r.backend_id}</td>
<td>${h(r.model)}</td>
<td>${fmt(r.request_count)}</td>
<td class="${r.error_count > 0 ? 'text-red' : 'text-green'}">${r.error_count}</td>
<td>${fmt(r.total_tokens)}</td>
<td>$${(r.cost || 0).toFixed(6)}</td>
<td>${r.avg_latency_ms}ms</td>
</tr>`).join('');
} catch (e) { console.error(e); }
}
async function loadDaily() {
try {
const res = await fetch('/api/admin/stats/daily?days=30');
const data = await res.json();
const tbody = document.querySelector('#daily-table tbody');
tbody.innerHTML = data.map(r => `
<tr>
<td>${r.date}</td>
<td><span class="badge ${r.pool}">${r.pool}</span></td>
<td>${fmt(r.total_requests)}</td>
<td>${fmt(r.total_errors)}</td>
<td>${fmt(r.total_tokens)}</td>
<td>$${(r.total_cost || 0).toFixed(6)}</td>
<td>${r.unique_backends}</td>
</tr>`).join('');
} catch (e) { console.error(e); }
}
async function loadCooldown() {
try {
const res = await fetch('/api/admin/stats/cooldown?limit=100');
const data = await res.json();
const tbody = document.querySelector('#cooldown-table tbody');
tbody.innerHTML = data.map(r => `
<tr>
<td>${r.started_at}</td>
<td>${r.backend_id}</td>
<td>${r.consecutive_count}</td>
<td>${r.cooldown_seconds}s</td>
<td>${h(r.response_summary)}</td>
</tr>`).join('');
} catch (e) { console.error(e); }
}
async function loadChartData() {
try {
const res = await fetch('/api/admin/stats/hourly?hours=168');
const data = await res.json();
// Group by hour, sum
const byHour = {};
data.forEach(r => {
const hour = r.hour_bucket.slice(0, 13);
if (!byHour[hour]) byHour[hour] = { cost: 0, tokens: 0 };
byHour[hour].cost += (r.cost || 0);
byHour[hour].tokens += (r.total_tokens || 0);
});
const hours = Object.keys(byHour).sort();
const costs = hours.map(h => byHour[h].cost);
const tokens = hours.map(h => byHour[h].tokens);
const labels = hours.map(h => h.slice(11, 16) + ' ' + h.slice(5, 10));
if (costChart) {
costChart.data.labels = labels;
costChart.data.datasets[0].data = costs;
costChart.update();
}
if (tokenChart) {
tokenChart.data.labels = labels;
tokenChart.data.datasets[0].data = tokens;
tokenChart.update();
}
} catch (e) { console.error(e); }
}
// ── Helpers ──
function fmt(n) { return (n || 0).toLocaleString(); }
function h(s) { const d=document.createElement('div'); d.textContent=s||''; return d.innerHTML; }
function formatDuration(s) {
const d = Math.floor(s / 86400);
const h = Math.floor((s % 86400) / 3600);
const m = Math.floor((s % 3600) / 60);
const parts = [];
if (d) parts.push(d + 'd');
if (h) parts.push(h + 'h');
if (m || !parts.length) parts.push(m + 'm');
return parts.join(' ');
}
// Initial load
document.addEventListener('DOMContentLoaded', () => {
// Ensure chart containers exist
if (!document.getElementById('cost-chart')) {
const chartsDiv = document.getElementById('charts');
if (chartsDiv) {
chartsDiv.innerHTML = `
<div class="chart-card"><h3>Cost Over Time</h3><canvas id="cost-chart"></canvas></div>
<div class="chart-card"><h3>Token Usage Over Time</h3><canvas id="token-chart"></canvas></div>`;
}
}
initCharts();
loadChartData();
});
</script>
</body>
</html>
@@ -0,0 +1,90 @@
# Sidecar V2 — API Key Encryption Rotation SOP
> 版本: v1.0 | 维护者: 严维序 (opengineer)
## 背景
Sidecar V2 使用 AES-256-GCM 加密存储所有 Provider 的 API Key。加密密钥通过 `SIDECAR_ENCRYPTION_KEY` 环境变量传入,启动时通过 `init_crypto()` 初始化。
## ⚠️ 关键警告
**更换 SIDECAR_ENCRYPTION_KEY 会导致所有已存储的 API Key 永久不可恢复!**
`crypto.py``try_decrypt_existing()` 在密钥变更时会静默返回 `None`,已有加密数据将无法解密。请在轮换密钥前执行以下步骤。
## 安全轮换步骤
### Step 1: 导出当前 API Key 明文(必须)
```bash
# 使用旧密钥启动 sidecar,通过 admin API 导出
curl -s -H "Authorization: Bearer <ADMIN_TOKEN>" \
http://127.0.0.1:9190/api/admin/backends | \
python3 -c "
import json, sys
data = json.load(sys.stdin)
# 注意:api_key 是 masked 的,需要重新从安全渠道获取原始 key
print(json.dumps(data, indent=2))
"
```
### Step 2: 停止服务
```bash
systemctl stop sidecar-v2
# 或
docker compose down
```
### Step 3: 备份数据库
```bash
cp /app/data/sidecar_v2.db /app/data/backups/pre-rotation-$(date +%Y%m%d_%H%M%S).db
```
### Step 4: 更新密钥
更新 `/etc/sidecar-v2/env` 或 docker `.env` 文件中的 `SIDECAR_ENCRYPTION_KEY`
```
SIDECAR_ENCRYPTION_KEY=<new_64_hex_char_key>
```
生成新密钥:
```bash
python3 -c "import secrets; print(secrets.token_hex(32))"
```
### Step 5: 清空加密 Key 并重新录入
由于密钥变更后旧加密数据不可读,需要:
1. 启动服务(此时所有旧 Provider 的 API Key 不可用)
2. 通过 Admin API 重新录入所有 Provider 的 API Key
```bash
curl -s -X PUT -H "Authorization: Bearer <ADMIN_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"api_key": "<NEW_PLAIN_KEY>"}' \
http://127.0.0.1:9190/api/admin/backends/<backend_id>
```
### Step 6: 验证
```bash
# 确认 Provider 状态为 healthy
curl -s http://127.0.0.1:9190/api/admin/pools
# 发送测试请求
curl -s -X POST http://127.0.0.1:9190/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"<model_name>","messages":[{"role":"user","content":"test"}],"max_tokens":5}'
```
## 应急预案
如果在密钥轮换过程中出错:
1. 恢复旧密钥环境变量
2. 恢复旧数据库备份
3. 重启服务
旧 Key 会正常工作,因为未被覆盖的数据仍然用旧密钥加密。
@@ -0,0 +1,56 @@
# Sidecar V2 — Nginx reverse proxy config (reference)
# Place at /etc/nginx/sites-available/sidecar-v2.conf
# SSL certs managed by certbot or manually
upstream sidecar_v2_main {
server 127.0.0.1:9190;
}
upstream sidecar_v2_metrics {
server 127.0.0.1:9191;
}
server {
listen 443 ssl http2;
server_name sidecar.example.com;
ssl_certificate /etc/ssl/certs/sidecar.pem;
ssl_certificate_key /etc/ssl/private/sidecar.key;
# Dashboard + Admin API (main port)
location / {
proxy_pass http://sidecar_v2_main;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# SSE support for dashboard real-time data
location /dashboard/sse {
proxy_pass http://sidecar_v2_main;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding off;
proxy_read_timeout 86400s;
}
# Prometheus metrics
location /metrics {
proxy_pass http://sidecar_v2_metrics;
proxy_http_version 1.1;
proxy_set_header Host $host;
}
# Health check
location /health {
proxy_pass http://sidecar_v2_main;
proxy_http_version 1.1;
proxy_set_header Host $host;
}
}
@@ -0,0 +1,23 @@
[Unit]
Description=Sidecar V2 — Multi-Pool Provider Proxy
After=network.target
[Service]
Type=simple
User=openclaw
Group=openclaw
WorkingDirectory=/opt/sidecar-v2
EnvironmentFile=/etc/sidecar-v2/env
ExecStart=/opt/sidecar-v2/.venv/bin/python3 main.py
Restart=always
RestartSec=5
# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
ReadWritePaths=/opt/sidecar-v2/data
PrivateTmp=yes
[Install]
WantedBy=multi-user.target
@@ -0,0 +1,26 @@
# Sidecar V2 — Multi-Pool Provider Proxy
version: "3.9"
services:
sidecar-v2:
build: .
container_name: sidecar-v2
restart: unless-stopped
ports:
- "9190:9190" # Main proxy + admin API + dashboard
- "9191:9191" # Prometheus metrics
environment:
- SIDECAR_ENCRYPTION_KEY=${SIDECAR_ENCRYPTION_KEY}
- SIDECAR_ADMIN_TOKEN=${SIDECAR_ADMIN_TOKEN:-change-me}
- LOG_FORMAT=${LOG_FORMAT:-json}
- SIDECAR_HOST=0.0.0.0
- SIDECAR_PORT=9190
- SIDECAR_METRICS_PORT=9191
- SIDECAR_DB_PATH=/app/data/sidecar_v2.db
- SIDECAR_BACKUP_DIR=/app/data/backups
volumes:
- sidecar-data:/app/data
volumes:
sidecar-data:
driver: local
+17
View File
@@ -0,0 +1,17 @@
"""Sidecar V2 entry point."""
import uvicorn
from config import config
def main():
uvicorn.run(
"server:app",
host=config.host,
port=config.port,
log_level=config.log_level.lower(),
)
if __name__ == "__main__":
main()
+83
View File
@@ -0,0 +1,83 @@
"""Provider pool management: primary / fallback pool routing."""
import structlog
from typing import Optional
from storage.backend_store import list_backends, get_pool_stats
from storage.models import Backend
logger = structlog.get_logger("sidecar_v2.pool_manager")
class PoolManager:
"""Manages provider pools and selects healthy backends for a given model.
Priority: primary pool → fallback pool.
Within a pool: healthy backends only, sorted by availability.
"""
def __init__(self):
self._pool_order = ["primary", "fallback"]
def get_available_backends(
self, canonical_model: str, pool: Optional[str] = None
) -> list[Backend]:
"""Get all healthy, enabled backends that serve a model, in pool order.
Args:
canonical_model: Canonical model name to match.
pool: Optional pool filter (primary/fallback). None = all pools.
Returns:
List of ready backends sorted by pool priority, then RPM utilization.
"""
backends: list[Backend] = []
pools_to_check = [pool] if pool else self._pool_order
for p in pools_to_check:
pool_backends = list_backends(pool=p, enabled_only=True, decrypt_key=True)
for b in pool_backends:
if b.status == "healthy" and b.has_model(canonical_model):
backends.append(b)
if pool:
break
return backends
def get_any_healthy_backends(self, pool: Optional[str] = None) -> list[Backend]:
"""Get all healthy, enabled backends regardless of model."""
backends: list[Backend] = []
pools_to_check = [pool] if pool else self._pool_order
for p in pools_to_check:
pool_backends = list_backends(pool=p, enabled_only=True, decrypt_key=True)
for b in pool_backends:
if b.status == "healthy":
backends.append(b)
if pool:
break
return backends
def get_pool_status(self) -> dict:
"""Get pool summary for dashboard."""
stats = get_pool_stats()
result = {}
for pool in self._pool_order:
s = stats.get(pool, {"total": 0, "enabled": 0, "healthy": 0, "cooling": 0, "error": 0})
result[pool] = s
# Also include any other pools
for pool, s in stats.items():
if pool not in result:
result[pool] = s
return result
def is_pool_available(self, canonical_model: str, pool: str = "primary") -> bool:
"""Check if a pool has any healthy backends for a model."""
backends = self.get_available_backends(canonical_model, pool=pool)
return len(backends) > 0
def is_any_pool_available(self, canonical_model: str) -> bool:
"""Check if any pool has healthy backends for a model."""
for pool in self._pool_order:
if self.is_pool_available(canonical_model, pool):
return True
return False
+383
View File
@@ -0,0 +1,383 @@
"""Proxy request handling for Sidecar V2 — multi-pool routing + cooldown + rate limiting."""
import asyncio
import json
import time
from typing import Any, Optional
import httpx
import structlog
from fastapi import Request
from fastapi.responses import JSONResponse, Response, StreamingResponse
from config import config
from pool_manager import PoolManager
from rate_limiter import PerBackendRateLimiter
from router import Router
from cooldown_manager import start_cooldown, check_and_clear_cooldown
from storage.models import Backend
from storage.usage_store import record_usage
# Emergency activation counter (read by metrics endpoint)
_emergency_count: int = 0
def get_emergency_count() -> int:
return _emergency_count
logger: structlog.stdlib.BoundLogger = structlog.get_logger("sidecar_v2.proxy")
def extract_model(body: dict[str, Any]) -> str:
"""Extract model identifier from request body."""
return str(body.get("model", "unknown"))
def build_error_response(status: int, message: str, error_type: str = "") -> JSONResponse:
"""Build a standard error response."""
return JSONResponse(
status_code=status,
content={
"error": {
"message": message,
"type": error_type or f"Error_{status}",
}
},
)
async def forward_to_backend(
backend: Backend,
method: str,
path: str,
body: bytes | None,
headers: dict[str, str],
stream: bool = False,
) -> httpx.Response:
"""Forward a request to a specific backend."""
upstream_url = backend.api_base_url.rstrip("/") + path
forward_headers = {
k: v
for k, v in headers.items()
if k.lower() not in ("host", "content-length", "transfer-encoding")
}
if backend.api_key_plain:
forward_headers["authorization"] = f"Bearer {backend.api_key_plain}"
elif "authorization" not in {k.lower() for k in forward_headers}:
forward_headers["authorization"] = "Bearer nvidia"
timeout = httpx.Timeout(backend.timeout_seconds)
async with httpx.AsyncClient(timeout=timeout) as client:
req = client.build_request(
method=method,
url=upstream_url,
headers=forward_headers,
content=body,
)
return await client.send(req, stream=stream)
def build_response(resp: httpx.Response) -> Response:
"""Convert httpx.Response to FastAPI Response."""
content_type = resp.headers.get("content-type", "")
headers = {
k: v
for k, v in resp.headers.items()
if k.lower() not in ("content-encoding", "transfer-encoding")
}
is_sse = "text/event-stream" in content_type
is_chunked = resp.headers.get("transfer-encoding", "").lower() == "chunked"
if is_sse or (is_chunked and headers.get("content-type", "") != "application/octet-stream"):
return StreamingResponse(
content=resp.aiter_bytes(),
status_code=resp.status_code,
headers=headers,
media_type=content_type or "text/event-stream",
)
return Response(
content=resp.content,
status_code=resp.status_code,
headers=headers,
media_type=content_type or "application/json",
)
def extract_usage_from_response(
resp: httpx.Response,
resp_json: dict[str, Any],
model: str,
) -> tuple[int, int, int]:
"""Extract token usage from response body (OpenAI-compatible)."""
usage = resp_json.get("usage", {})
prompt_tokens = usage.get("prompt_tokens", 0) or 0
completion_tokens = usage.get("completion_tokens", 0) or 0
# Try streaming chunks: aggregate from choices
if not prompt_tokens and not completion_tokens:
choices = resp_json.get("choices", [])
for choice in choices:
if isinstance(choice, dict):
tokens = choice.get("usage", {})
prompt_tokens += tokens.get("prompt_tokens", 0) or 0
completion_tokens += tokens.get("completion_tokens", 0) or 0
total_tokens = prompt_tokens + completion_tokens
if total_tokens == 0:
total_tokens = usage.get("total_tokens", 0) or 0
return prompt_tokens, completion_tokens, total_tokens
def calculate_cost(
backend: Backend,
model: str,
prompt_tokens: int,
completion_tokens: int,
) -> float:
"""Calculate cost using backend's model pricing."""
cost_info = backend.get_model_cost(model)
input_cost = cost_info.get("input", 0.0)
output_cost = cost_info.get("output", 0.0)
# Costs are per token
return (prompt_tokens * input_cost + completion_tokens * output_cost)
async def handle_proxy_request(
pool_manager: PoolManager,
rate_limiter: PerBackendRateLimiter,
router: Router,
request: Request,
path: str,
) -> Response:
"""Main proxy handler: multi-pool routing with cooldown and rate limiting.
Flow:
1. Extract model → canonical name
2. Pick backend via Router (primary → fallback)
3. Forward request
4. If 429 → cooldown backend, retry with another
5. If all pools exhausted → emergency mode
6. Track usage
"""
start_time = time.monotonic()
body_bytes: bytes = await request.body()
raw_headers: dict[str, str] = dict(request.headers)
body_json: dict[str, Any] = {}
try:
if body_bytes:
parsed = json.loads(body_bytes)
if isinstance(parsed, dict):
body_json = parsed
except (ValueError, TypeError):
body_json = {}
canonical_model = extract_model(body_json)
is_stream = body_json.get("stream", False)
# Try with pool routing
max_retries = config.max_pool_retries
for attempt in range(max_retries):
# Check and clear expired cooldowns before picking
_refresh_cooldowns()
backend = router.pick_backend(canonical_model)
if backend is None:
break # No backend available, fall through to emergency
try:
resp = await forward_to_backend(
backend=backend,
method=request.method,
path=path,
body=body_bytes if body_bytes else None,
headers=raw_headers,
stream=is_stream,
)
elapsed_ms = int((time.monotonic() - start_time) * 1000)
# Handle 429 — cooldown and retry
if resp.status_code == 429:
new_count = backend.consecutive_429_count + 1
start_cooldown(backend.id, new_count)
resp_body = ""
try:
resp_body = resp.text[:200]
except Exception:
pass
logger.warning(
"backend_429_cooldown",
backend_id=backend.id,
pool=backend.pool,
consecutive=new_count,
model=canonical_model,
)
# Track the error
record_usage(
backend_id=backend.id,
model=canonical_model,
prompt_tokens=0,
completion_tokens=0,
cost=0.0,
latency_ms=elapsed_ms,
is_error=True,
)
continue # Retry with another backend
# Success — track usage
resp_json: dict[str, Any] = {}
try:
if not is_stream and resp.content:
resp_json = json.loads(resp.content)
except (ValueError, TypeError):
pass
prompt_tokens, completion_tokens, total_tokens = extract_usage_from_response(
resp, resp_json, canonical_model
)
cost = calculate_cost(
backend, canonical_model, prompt_tokens, completion_tokens
)
record_usage(
backend_id=backend.id,
model=canonical_model,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
cost=cost,
latency_ms=elapsed_ms,
)
logger.info(
"request_completed",
backend_id=backend.id,
pool=backend.pool,
model=canonical_model,
status=resp.status_code,
tokens=total_tokens,
cost=round(cost, 6),
elapsed_ms=elapsed_ms,
)
return build_response(resp)
except httpx.TimeoutException:
logger.warning(
"backend_timeout",
backend_id=backend.id,
model=canonical_model,
)
continue
except (httpx.ConnectError, httpx.RemoteProtocolError) as exc:
logger.warning(
"backend_connection_error",
backend_id=backend.id,
model=canonical_model,
error=str(exc),
)
continue
except Exception as exc:
logger.error(
"proxy_error",
backend_id=backend.id,
model=canonical_model,
error=str(exc),
)
continue
# All pools exhausted — emergency rate-limited passthrough
emergency_rpm = int(config.default_rpm_limit * config.emergency_rpm_fraction)
if emergency_rpm < 1:
emergency_rpm = 1
logger.warning(
"all_pools_exhausted_emergency",
model=canonical_model,
emergency_rpm=emergency_rpm,
)
# Track emergency activation for metrics
_emergency_count += 1
# Emergency: try to get a token from any fallback backend at reduced RPM
emergency_retries = 3
for attempt in range(emergency_retries):
backends = pool_manager.get_any_healthy_backends()
for backend in backends:
if rate_limiter.consume(backend.id, emergency_rpm):
try:
resp = await forward_to_backend(
backend=backend,
method=request.method,
path=path,
body=body_bytes if body_bytes else None,
headers=raw_headers,
stream=is_stream,
)
elapsed_ms = int((time.monotonic() - start_time) * 1000)
if resp.status_code == 429:
start_cooldown(backend.id, backend.consecutive_429_count + 1)
continue
# Success in emergency mode
try:
resp_json: dict[str, Any] = {}
if not is_stream and resp.content:
resp_json = json.loads(resp.content)
except Exception:
resp_json = {}
prompt_tokens, completion_tokens, total_tokens = extract_usage_from_response(
resp, resp_json, canonical_model
)
cost_em = calculate_cost(backend, canonical_model, prompt_tokens, completion_tokens)
record_usage(
backend_id=backend.id,
model=canonical_model,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
cost=cost_em,
latency_ms=elapsed_ms,
)
logger.info(
"emergency_passthrough_success",
backend_id=backend.id,
model=canonical_model,
emergency_rpm=emergency_rpm,
)
return build_response(resp)
except Exception:
continue
# All emergency attempts failed — return 503 for OpenClaw fallback chain
return build_error_response(
503,
"All provider pools exhausted. OpenClaw fallback chain should activate.",
"AllPoolsExhausted",
)
def _refresh_cooldowns() -> None:
"""Check and clear expired cooldowns for backends currently in cooling state.
Only queries backends with status='cooling' (the health_check_loop handles
the periodic scanning; this is the on-demand refresh before proxy routing)."""
from storage.backend_store import list_backends
backends = list_backends(decrypt_key=False)
for backend in backends:
if backend.status == "cooling":
check_and_clear_cooldown(backend.id)
+111
View File
@@ -0,0 +1,111 @@
"""Per-backend rate limiter using token bucket algorithm."""
import threading
import time
from typing import Any
class PerBackendRateLimiter:
"""Manages independent token buckets for each backend.
Thread-safe. Each backend gets its own bucket with configurable RPM.
"""
def __init__(self, refill_interval_ms: int = 50):
self._buckets: dict[str, _TokenBucket] = {}
self._lock = threading.Lock()
self._refill_interval_ms = refill_interval_ms
def ensure_bucket(self, backend_id: str, rpm_limit: int) -> None:
"""Create or update a bucket for a backend."""
with self._lock:
if backend_id in self._buckets:
existing = self._buckets[backend_id]
existing.update_rate(rpm_limit)
else:
self._buckets[backend_id] = _TokenBucket(
rate=rpm_limit / 60.0,
capacity=max(rpm_limit, 1),
)
def remove_bucket(self, backend_id: str) -> None:
"""Remove a backend's bucket."""
with self._lock:
self._buckets.pop(backend_id, None)
def consume(self, backend_id: str, rpm_limit: int, tokens: int = 1) -> bool:
"""Try to consume tokens for a backend. Returns True if allowed.
Auto-creates the bucket if needed.
"""
self.ensure_bucket(backend_id, rpm_limit)
with self._lock:
bucket = self._buckets.get(backend_id)
if bucket is None:
return False
return bucket.consume(tokens)
def get_status(self, backend_id: str) -> dict[str, Any] | None:
"""Get bucket status for a backend."""
with self._lock:
bucket = self._buckets.get(backend_id)
if bucket is None:
return None
return bucket.get_status()
def get_all_status(self) -> dict[str, dict[str, Any]]:
"""Get status of all buckets."""
with self._lock:
return {bid: b.get_status() for bid, b in self._buckets.items()}
class _TokenBucket:
"""Internal token bucket with refill."""
def __init__(self, rate: float, capacity: int):
self._rate = float(rate)
self._capacity = int(capacity)
self._tokens = float(capacity)
self._last_refill = time.monotonic()
self._lock = threading.Lock()
def _refill(self) -> None:
now = time.monotonic()
elapsed = now - self._last_refill
if elapsed > 0 and self._rate > 0:
self._tokens = min(self._tokens + elapsed * self._rate, float(self._capacity))
self._last_refill = now
def consume(self, tokens: int = 1) -> bool:
if tokens <= 0:
return True
with self._lock:
self._refill()
if self._tokens >= tokens:
self._tokens -= tokens
return True
return False
def update_rate(self, rpm_limit: int) -> None:
new_rate = rpm_limit / 60.0
with self._lock:
self._refill()
self._rate = new_rate
self._capacity = max(rpm_limit, 1)
self._tokens = min(self._tokens, float(self._capacity))
def get_status(self) -> dict[str, Any]:
with self._lock:
self._refill()
rate_per_minute = self._rate * 60.0
utilization = 0.0 if self._capacity == 0 else (
(self._capacity - self._tokens) / self._capacity
)
return {
"tokens": round(self._tokens, 2),
"capacity": self._capacity,
"rate_per_minute": round(rate_per_minute, 1),
"utilization": round(utilization, 4),
}
+6
View File
@@ -0,0 +1,6 @@
# Sidecar V2 — Multi-Pool Provider Proxy
fastapi>=0.115.0,<1.0.0
uvicorn[standard]>=0.30.0,<1.0.0
httpx>=0.27.0,<1.0.0
structlog>=24.0.0,<25.0.0
cryptography>=42.0.0,<44.0.0
+62
View File
@@ -0,0 +1,62 @@
"""Model → Backend routing logic for Sidecar V2."""
import structlog
from typing import Optional
from storage.models import Backend
from pool_manager import PoolManager
from rate_limiter import PerBackendRateLimiter
logger = structlog.get_logger("sidecar_v2.router")
class Router:
"""Routes model requests to the best available backend.
Pick strategy:
1. Primary pool → healthy backends supporting the model
2. Rate-limiter check → skip if RPM exhausted
3. Fallback pool → repeat above
4. If all exhausted → return None (caller handles emergency)
"""
def __init__(self, pool_manager: PoolManager, rate_limiter: PerBackendRateLimiter):
self._pool_manager = pool_manager
self._rate_limiter = rate_limiter
def pick_backend(self, canonical_model: str) -> Optional[Backend]:
"""Pick the best available backend for a model.
Tries primary pool first, then fallback.
Within each pool, skips backends at RPM limit.
Returns None if no backend available.
"""
# Try pools in order
for pool in ["primary", "fallback"]:
backends = self._pool_manager.get_available_backends(
canonical_model, pool=pool
)
for backend in backends:
# Rate-limit check
if self._rate_limiter.consume(
backend.id, backend.rpm_limit
):
return backend
# Skip this backend, try next
logger.debug(
"backend_rate_limited",
backend_id=backend.id,
pool=pool,
model=canonical_model,
)
if not backends:
logger.debug("pool_exhausted", pool=pool, model=canonical_model)
else:
logger.debug("pool_rpm_exhausted", pool=pool, model=canonical_model)
return None
def get_all_pools_exhausted_info(self, canonical_model: str) -> bool:
"""Check if ALL pools are exhausted for a model."""
return not self._pool_manager.is_any_pool_available(canonical_model)
+712
View File
@@ -0,0 +1,712 @@
"""Sidecar V2 — FastAPI server with multi-pool routing, admin API, dashboard SSE."""
import asyncio
import json
import os
import sys
import time
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from typing import Any, Optional
import structlog
from fastapi import Depends, FastAPI, HTTPException, Request, Response
from fastapi.responses import FileResponse, HTMLResponse, JSONResponse, StreamingResponse
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from config import config as app_config
from crypto import init_crypto, is_initialized
from pool_manager import PoolManager
from rate_limiter import PerBackendRateLimiter
from router import Router
from proxy import handle_proxy_request, get_emergency_count
from storage.db import init_db, create_tables, run_integrity_check, get_connection, _DB_PATH
from storage.backend_store import (
create_backend, get_backend, list_backends, update_backend,
delete_backend, get_pool_stats,
)
from storage.usage_store import get_total_stats, get_hourly_usage, get_daily_stats, aggregate_daily_stats
from storage.cooldown_store import get_cooldown_history
from storage.config_store import get_config, set_config, list_configs, delete_config
from storage.models import Backend, ModelMapping
# ──────────────────────────────────────────────────────────
# Logging
# ──────────────────────────────────────────────────────────
_LOG_FORMAT = os.getenv("LOG_FORMAT", "console").lower()
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
(
structlog.processors.JSONRenderer()
if _LOG_FORMAT == "json"
else structlog.dev.ConsoleRenderer()
),
],
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
wrapper_class=structlog.stdlib.BoundLogger,
cache_logger_on_first_use=True,
)
logger: structlog.stdlib.BoundLogger = structlog.get_logger("sidecar_v2.server")
# ──────────────────────────────────────────────────────────
# Admin Auth middleware
# ──────────────────────────────────────────────────────────
_security = HTTPBearer(auto_error=False)
def verify_admin_token(
credentials: Optional[HTTPAuthorizationCredentials] = Depends(_security),
) -> bool:
"""Verify Bearer Token against config.admin_token.
If admin_token is empty, write operations are rejected.
READ operations are allowed without auth for dashboard use.
"""
if not app_config.admin_token:
# No token configured — allow read, reject write (checked per-endpoint)
if credentials is None:
return False
return False
if credentials is None:
return False
return credentials.credentials == app_config.admin_token
def require_admin(credentials: Optional[HTTPAuthorizationCredentials] = Depends(_security)):
"""Require admin auth — raise 401 if not authorized."""
if not app_config.admin_token:
raise HTTPException(
status_code=401,
detail="Admin API not configured: set SIDECAR_ADMIN_TOKEN",
)
if credentials is None:
raise HTTPException(
status_code=401,
detail="Missing Authorization header",
headers={"WWW-Authenticate": "Bearer"},
)
if credentials.credentials != app_config.admin_token:
raise HTTPException(
status_code=401,
detail="Invalid admin token",
)
# ──────────────────────────────────────────────────────────
# Global runtime state
# ──────────────────────────────────────────────────────────
pool_manager: Optional[PoolManager] = None
rate_limiter: Optional[PerBackendRateLimiter] = None
router: Optional[Router] = None
start_time: float = 0.0
# In-memory metrics counters
_metrics_counters: dict[str, int] = {}
_metrics_lock = asyncio.Lock()
def _inc_metric(key: str, delta: int = 1) -> None:
"""Thread-safe counter increment (deferred via asyncio)."""
_metrics_counters[key] = _metrics_counters.get(key, 0) + delta
def get_pm() -> PoolManager:
assert pool_manager is not None
return pool_manager
def get_rl() -> PerBackendRateLimiter:
assert rate_limiter is not None
return rate_limiter
def get_router() -> Router:
assert router is not None
return router
# ──────────────────────────────────────────────────────────
# Lifespan
# ──────────────────────────────────────────────────────────
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, Any]:
global pool_manager, rate_limiter, router, start_time
# P0: Encryption key is mandatory — refuse to start without it
if not app_config.encryption_key:
logger.critical(
"missing_encryption_key",
hint="Set SIDECAR_ENCRYPTION_KEY (64 hex chars). Refusing to start."
)
sys.exit(1)
init_crypto(app_config.encryption_key)
logger.info("crypto_initialized")
# P0: Warn if admin_token not set
if not app_config.admin_token:
logger.warning(
"admin_token_not_set",
hint="Admin write endpoints disabled until SIDECAR_ADMIN_TOKEN is configured."
)
# Init DB
init_db()
create_tables()
ok = run_integrity_check()
if not ok:
logger.error("db_integrity_check_failed")
# Init runtime components
pool_manager = PoolManager()
rate_limiter = PerBackendRateLimiter(
refill_interval_ms=app_config.rate_limiter_refill_interval_ms,
)
router = Router(pool_manager, rate_limiter)
start_time = time.time()
# Start background tasks
health_task = asyncio.create_task(_health_check_loop())
stats_task = asyncio.create_task(_stats_aggregation_loop())
backup_task = asyncio.create_task(_backup_loop())
logger.info(
"sidecar_v2_started",
host=app_config.host,
port=app_config.port,
metrics_port=app_config.metrics_port,
)
try:
yield
finally:
for task in [health_task, stats_task, backup_task]:
task.cancel()
try:
await task
except asyncio.CancelledError:
pass
logger.info("sidecar_v2_stopped")
app = FastAPI(
title="Sidecar V2 — Multi-Pool Provider Proxy",
version="2.0.0",
lifespan=lifespan,
)
# ──────────────────────────────────────────────────────────
# Background tasks
# ──────────────────────────────────────────────────────────
async def _health_check_loop() -> None:
"""Periodic health checks: clear expired cooldowns + active probing of backends."""
from cooldown_manager import check_and_clear_cooldown
import httpx
while True:
try:
backends = list_backends(decrypt_key=True)
for b in backends:
# 1. Clear expired cooldowns
if b.status == "cooling":
check_and_clear_cooldown(b.id)
# 2. Active health probing for healthy/enabled backends
if b.status == "healthy" and b.enabled:
try:
async with httpx.AsyncClient(timeout=httpx.Timeout(
app_config.health_check_timeout_seconds
)) as client:
probe_url = b.api_base_url.rstrip("/") + app_config.health_probe_endpoint
headers = {}
if b.api_key_plain:
headers["Authorization"] = f"Bearer {b.api_key_plain}"
start = time.monotonic()
resp = await client.get(probe_url, headers=headers)
elapsed_ms = int((time.monotonic() - start) * 1000)
# Update health state in DB
from storage.db import get_connection as _gc
with _gc() as conn:
conn.execute(
"""INSERT INTO backend_health
(backend_id, state, last_latency_ms, last_status_code,
last_check_at)
VALUES (?, 'healthy', ?, ?, datetime('now'))
ON CONFLICT(backend_id) DO UPDATE SET
state = excluded.state,
last_latency_ms = excluded.last_latency_ms,
last_status_code = excluded.last_status_code,
last_check_at = excluded.last_check_at""",
(b.id, elapsed_ms, resp.status_code),
)
conn.commit()
logger.debug(
"health_probe_ok",
backend_id=b.id,
status=resp.status_code,
latency_ms=elapsed_ms,
)
except Exception as probe_err:
logger.warning(
"health_probe_failed",
backend_id=b.id,
error=str(probe_err),
)
# Mark as degraded
from storage.db import get_connection as _gc
with _gc() as conn:
conn.execute(
"""INSERT INTO backend_health
(backend_id, state, last_check_at)
VALUES (?, 'degraded', datetime('now'))
ON CONFLICT(backend_id) DO UPDATE SET
state = 'degraded',
last_check_at = excluded.last_check_at""",
(b.id,),
)
conn.execute(
"""UPDATE backend_health SET
consecutive_failures = consecutive_failures + 1
WHERE backend_id = ?""",
(b.id,),
)
conn.commit()
except Exception:
logger.exception("health_check_error")
await asyncio.sleep(app_config.health_check_interval_seconds)
async def _stats_aggregation_loop() -> None:
"""Periodically aggregate daily stats."""
while True:
try:
today = time.strftime("%Y-%m-%d", time.gmtime())
aggregate_daily_stats(today)
except Exception:
logger.exception("stats_aggregation_error")
await asyncio.sleep(app_config.stats_refresh_interval_seconds)
async def _backup_loop() -> None:
"""Daily SQLite backup with retention."""
import shutil
while True:
try:
await asyncio.sleep(86400) # 24 hours
backup_dir = app_config.backup_dir
if not backup_dir:
continue
os.makedirs(backup_dir, exist_ok=True)
backup_name = f"sidecar_v2_{time.strftime('%Y%m%d_%H%M%S', time.gmtime())}.db"
backup_path = os.path.join(backup_dir, backup_name)
from storage.db import _DB_PATH as db_path
import sqlite3
source = sqlite3.connect(db_path)
dest = sqlite3.connect(backup_path)
source.backup(dest)
dest.close()
source.close()
logger.info("db_backup_created", path=backup_path)
# Retention: remove old backups
retention_days = app_config.backup_retention_days
cutoff = time.time() - retention_days * 86400
for fname in os.listdir(backup_dir):
if fname.startswith("sidecar_v2_") and fname.endswith(".db"):
fpath = os.path.join(backup_dir, fname)
if os.path.getmtime(fpath) < cutoff:
os.remove(fpath)
logger.info("db_backup_retired", path=fpath)
except Exception:
logger.exception("backup_error")
# ──────────────────────────────────────────────────────────
# Health / Metrics
# ──────────────────────────────────────────────────────────
@app.get("/health")
async def health() -> dict[str, Any]:
return {
"status": "ok",
"version": "2.0.0",
"uptime_seconds": int(time.time() - start_time),
}
@app.get("/metrics")
async def metrics() -> Response:
"""Prometheus-compatible metrics endpoint."""
lines = []
# Pool provider counts
pool_status = pool_manager.get_pool_status()
for pool_name, stats in pool_status.items():
for key, val in stats.items():
lines.append(
f"sidecar_pool_providers{{pool=\"{pool_name}\",type=\"{key}\"}} {val}"
)
# Cooldown status
all_backends = list_backends(decrypt_key=False)
cooling_count = sum(1 for b in all_backends if b.status == "cooling")
lines.append(f"sidecar_cooldown_active {cooling_count}")
# Emergency count (from proxy module)
lines.append(f"sidecar_emergency_count {get_emergency_count()}")
# DB sizes
from storage.db import get_db_sizes
sizes = get_db_sizes()
lines.append(f"sidecar_db_size_bytes {sizes.get('db_bytes', 0)}")
lines.append(f"sidecar_wal_size_bytes {sizes.get('wal_bytes', 0)}")
# Total stats
total = get_total_stats()
lines.append(f"sidecar_requests_total {total.get('total_requests', 0) or 0}")
lines.append(f"sidecar_errors_total {total.get('total_errors', 0) or 0}")
lines.append(f"sidecar_tokens_total {total.get('total_tokens', 0) or 0}")
cost = total.get('total_cost', 0) or 0.0
lines.append(f"sidecar_cost_total {cost}")
# Uptime
lines.append(f"sidecar_uptime_seconds {int(time.time() - start_time)}")
return Response(
content="\n".join(lines) + "\n",
media_type="text/plain; charset=utf-8",
)
# ──────────────────────────────────────────────────────────
# Dashboard SSE
# ──────────────────────────────────────────────────────────
@app.get("/dashboard/sse")
async def dashboard_sse() -> StreamingResponse:
"""SSE endpoint for real-time dashboard data."""
async def event_generator():
while True:
try:
pool_status = pool_manager.get_pool_status()
total_stats = get_total_stats()
all_backends = list_backends(decrypt_key=False)
backends_list = []
for b in all_backends:
rl_status = rate_limiter.get_status(b.id)
backends_list.append({
"id": b.id,
"name": b.name,
"label": b.label,
"pool": b.pool,
"enabled": b.enabled,
"status": b.status,
"rpm_limit": b.rpm_limit,
"cooldown_until": b.cooldown_until,
"consecutive_429_count": b.consecutive_429_count,
"model_count": len(b.model_mappings),
"rate_limiter": rl_status,
})
snapshot = {
"type": "snapshot",
"pool": pool_status,
"total": total_stats,
"backends": backends_list,
"uptime_seconds": int(time.time() - start_time),
"timestamp": time.time(),
}
yield f"data: {json.dumps(snapshot)}\n\n"
except Exception:
logger.exception("sse_error")
await asyncio.sleep(app_config.dashboard_sse_interval_seconds)
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"X-Accel-Buffering": "no",
},
)
# ──────────────────────────────────────────────────────────
# Admin: Backend CRUD (READ: public, WRITE: auth required)
# ──────────────────────────────────────────────────────────
@app.get("/api/admin/backends")
async def admin_list_backends(pool: Optional[str] = None) -> list[dict]:
"""List all backends with masked keys (public read)."""
backends = list_backends(pool=pool, decrypt_key=True)
return [b.to_dict(mask_key=True) for b in backends]
@app.get("/api/admin/backends/{backend_id}")
async def admin_get_backend(backend_id: str) -> dict:
"""Get a single backend (public read, key masked)."""
b = get_backend(backend_id, decrypt_key=True)
if b is None:
raise HTTPException(404, "Backend not found")
return b.to_dict(mask_key=True)
@app.post("/api/admin/backends")
async def admin_create_backend(
body: dict[str, Any],
_auth=Depends(require_admin),
) -> dict:
"""Create a new backend (auth required)."""
required = ["name", "api_base_url", "api_key"]
for field in required:
if field not in body:
raise HTTPException(400, f"Missing required field: {field}")
model_mappings_raw = body.get("model_mappings", {})
model_mappings = {}
for canonical_name, mm in model_mappings_raw.items():
model_mappings[canonical_name] = ModelMapping.from_dict(mm)
backend = Backend(
name=body["name"],
label=body.get("label", ""),
api_base_url=body["api_base_url"],
api_key_plain=body["api_key"],
api=body.get("api", "openai-completions"),
timeout_seconds=body.get("timeout_seconds", 120),
rpm_limit=body.get("rpm_limit", app_config.default_rpm_limit),
pool=body.get("pool", "primary"),
enabled=body.get("enabled", True),
model_mappings=model_mappings,
source=body.get("source", "webui"),
)
created = create_backend(backend)
return created.to_dict(mask_key=True)
@app.put("/api/admin/backends/{backend_id}")
async def admin_update_backend(
backend_id: str,
body: dict[str, Any],
_auth=Depends(require_admin),
) -> dict:
"""Update a backend (auth required)."""
updates = dict(body)
if "model_mappings" in updates:
raw = updates["model_mappings"]
updates["model_mappings"] = {
k: ModelMapping.from_dict(v) for k, v in raw.items()
}
if "api_key" in updates:
updates["api_key_plain"] = updates.pop("api_key")
updated = update_backend(backend_id, updates)
if updated is None:
raise HTTPException(404, "Backend not found")
return updated.to_dict(mask_key=True)
@app.delete("/api/admin/backends/{backend_id}")
async def admin_delete_backend(
backend_id: str,
_auth=Depends(require_admin),
) -> dict:
"""Delete a backend (auth required)."""
ok = delete_backend(backend_id)
if not ok:
raise HTTPException(404, "Backend not found")
return {"status": "deleted", "id": backend_id}
# ──────────────────────────────────────────────────────────
# Admin: Pool Status (public read)
# ──────────────────────────────────────────────────────────
@app.get("/api/admin/pools")
async def admin_pool_status() -> dict:
return pool_manager.get_pool_status()
# ──────────────────────────────────────────────────────────
# Admin: Usage / Stats (public read)
# ──────────────────────────────────────────────────────────
@app.get("/api/admin/stats/total")
async def admin_total_stats() -> dict:
return get_total_stats()
@app.get("/api/admin/stats/hourly")
async def admin_hourly_usage(
backend_id: Optional[str] = None,
hours: int = 168,
) -> list[dict]:
since = None
if hours > 0:
since = time.strftime(
"%Y-%m-%dT%H:%M:%SZ",
time.gmtime(time.time() - hours * 3600),
)
return get_hourly_usage(backend_id=backend_id, since=since, limit=hours)
@app.get("/api/admin/stats/daily")
async def admin_daily_stats(days: int = 30) -> list[dict]:
return get_daily_stats(days=days)
@app.get("/api/admin/stats/cooldown")
async def admin_cooldown_history(
backend_id: Optional[str] = None,
limit: int = 50,
) -> list[dict]:
return get_cooldown_history(backend_id=backend_id, limit=limit)
# ──────────────────────────────────────────────────────────
# Admin: System Config (read public, write auth required)
# ──────────────────────────────────────────────────────────
@app.get("/api/admin/config")
async def admin_get_all_config() -> list[dict]:
return list_configs()
@app.get("/api/admin/config/{key}")
async def admin_get_config(key: str) -> dict:
value = get_config(key)
if value is None:
raise HTTPException(404, "Config not found")
return {"key": key, "value": value}
@app.put("/api/admin/config/{key}")
async def admin_set_config(
key: str,
body: dict[str, Any],
_auth=Depends(require_admin),
) -> dict:
value = str(body.get("value", ""))
description = str(body.get("description", ""))
set_config(key, value, description)
return {"key": key, "value": value}
@app.delete("/api/admin/config/{key}")
async def admin_delete_config(
key: str,
_auth=Depends(require_admin),
) -> dict:
ok = delete_config(key)
if not ok:
raise HTTPException(404, "Config not found")
return {"status": "deleted", "key": key}
# ──────────────────────────────────────────────────────────
# Dashboard HTML (public, but respects admin_token for writes in JS)
# ──────────────────────────────────────────────────────────
@app.get("/dashboard")
async def dashboard_html() -> HTMLResponse:
dashboard_path = os.path.join(
os.path.dirname(__file__), "dashboard.html"
)
if os.path.exists(dashboard_path):
with open(dashboard_path, "r") as f:
return HTMLResponse(f.read())
return HTMLResponse("<h1>Dashboard not found</h1>", status_code=404)
# ──────────────────────────────────────────────────────────
# Proxy Endpoints
# ──────────────────────────────────────────────────────────
@app.post("/v1/chat/completions")
async def chat_completions(request: Request) -> Response:
_inc_metric("proxy_requests_total")
return await handle_proxy_request(
pool_manager, rate_limiter, router, request, "/v1/chat/completions"
)
@app.post("/v1/completions")
async def completions(request: Request) -> Response:
return await handle_proxy_request(
pool_manager, rate_limiter, router, request, "/v1/completions"
)
@app.post("/v1/embeddings")
async def embeddings(request: Request) -> Response:
return await handle_proxy_request(
pool_manager, rate_limiter, router, request, "/v1/embeddings"
)
@app.get("/v1/models")
@app.get("/v1/models/{model_id:path}")
async def list_models(request: Request, model_id: Optional[str] = None) -> Response:
path = f"/v1/models/{model_id}" if model_id else "/v1/models"
return await handle_proxy_request(
pool_manager, rate_limiter, router, request, path
)
@app.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH", "OPTIONS"])
async def catch_all(request: Request, path: str) -> Response:
target_path = f"/{path}" if not path.startswith("/") else path
return await handle_proxy_request(
pool_manager, rate_limiter, router, request, target_path
)
# ──────────────────────────────────────────────────────────
# Main
# ──────────────────────────────────────────────────────────
def main() -> None:
import uvicorn
uvicorn.run(
"server:app",
host=app_config.host,
port=app_config.port,
log_level=app_config.log_level.lower(),
)
if __name__ == "__main__":
main()
@@ -0,0 +1 @@
# Sidecar V2 storage module
@@ -0,0 +1,252 @@
"""CRUD operations for Backend (provider) management."""
import json
import time
from typing import Optional
from storage.db import get_connection, generate_id
from storage.models import Backend, ModelMapping
from crypto import encrypt, decrypt
def create_backend(backend: Backend) -> Backend:
"""Create a new backend. Encrypts API key before storage."""
if not backend.id:
backend.id = generate_id("bkd")
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
backend.created_at = now
backend.updated_at = now
api_key_encrypted = encrypt(backend.api_key_plain)
with get_connection() as conn:
conn.execute(
"""INSERT INTO backends (id, name, label, api_base_url, api_key_encrypted,
api, timeout_seconds, rpm_limit, pool, enabled, status, model_mappings_json,
source, cooldown_until, consecutive_429_count, metadata_json, created_at, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
backend.id, backend.name, backend.label, backend.api_base_url,
api_key_encrypted, backend.api, backend.timeout_seconds,
backend.rpm_limit, backend.pool, 1 if backend.enabled else 0,
backend.status, json.dumps(_mappings_to_dict(backend.model_mappings)),
backend.source, backend.cooldown_until,
backend.consecutive_429_count,
json.dumps(backend.metadata), backend.created_at, backend.updated_at,
),
)
conn.commit()
return backend
def get_backend(backend_id: str, decrypt_key: bool = True) -> Optional[Backend]:
"""Get a single backend by ID."""
with get_connection() as conn:
row = conn.execute(
"SELECT * FROM backends WHERE id = ?", (backend_id,)
).fetchone()
if row is None:
return None
return _row_to_backend(row, decrypt_key=decrypt_key)
def list_backends(
pool: Optional[str] = None,
enabled_only: bool = False,
decrypt_key: bool = False,
) -> list[Backend]:
"""List backends, optionally filtered by pool."""
with get_connection() as conn:
if pool:
rows = conn.execute(
"SELECT * FROM backends WHERE pool = ? ORDER BY created_at",
(pool,),
).fetchall()
else:
rows = conn.execute(
"SELECT * FROM backends ORDER BY pool, created_at"
).fetchall()
backends = [_row_to_backend(r, decrypt_key=decrypt_key) for r in rows]
if enabled_only:
backends = [b for b in backends if b.enabled]
return backends
def update_backend(backend_id: str, updates: dict) -> Optional[Backend]:
"""Update backend fields. If api_key_plain is provided, re-encrypt."""
current = get_backend(backend_id, decrypt_key=True)
if current is None:
return None
# Apply updates
allowed = {
"name", "label", "api_base_url", "api", "timeout_seconds",
"rpm_limit", "pool", "enabled", "status", "source",
"cooldown_until", "consecutive_429_count", "metadata",
}
for key, value in updates.items():
if key in allowed:
setattr(current, key, value)
current.updated_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
# Handle API key update
api_key_encrypted = None
if "api_key_plain" in updates and updates["api_key_plain"]:
current.api_key_plain = updates["api_key_plain"]
api_key_encrypted = encrypt(updates["api_key_plain"])
# Handle model_mappings update
mappings_json = None
if "model_mappings" in updates:
current.model_mappings = updates["model_mappings"]
mappings_json = json.dumps(_mappings_to_dict(current.model_mappings))
with get_connection() as conn:
# Build dynamic UPDATE
set_clauses = [
"name = ?", "label = ?", "api_base_url = ?", "api = ?",
"timeout_seconds = ?", "rpm_limit = ?", "pool = ?", "enabled = ?",
"status = ?", "source = ?", "cooldown_until = ?",
"consecutive_429_count = ?", "metadata_json = ?", "updated_at = ?",
]
params = [
current.name, current.label, current.api_base_url, current.api,
current.timeout_seconds, current.rpm_limit, current.pool,
1 if current.enabled else 0, current.status, current.source,
current.cooldown_until, current.consecutive_429_count,
json.dumps(current.metadata), current.updated_at,
]
if api_key_encrypted:
set_clauses.append("api_key_encrypted = ?")
params.append(api_key_encrypted)
if mappings_json is not None:
set_clauses.append("model_mappings_json = ?")
params.append(mappings_json)
params.append(backend_id)
conn.execute(
f"UPDATE backends SET {', '.join(set_clauses)} WHERE id = ?",
params,
)
conn.commit()
return get_backend(backend_id, decrypt_key=False)
def delete_backend(backend_id: str) -> bool:
"""Delete a backend. Returns True if deleted."""
with get_connection() as conn:
cursor = conn.execute("DELETE FROM backends WHERE id = ?", (backend_id,))
conn.commit()
return cursor.rowcount > 0
def set_backend_status(backend_id: str, status: str) -> bool:
"""Quickly set backend status (healthy/cooling/error/disabled)."""
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with get_connection() as conn:
cursor = conn.execute(
"UPDATE backends SET status = ?, updated_at = ? WHERE id = ?",
(status, now, backend_id),
)
conn.commit()
return cursor.rowcount > 0
def set_backend_cooldown(backend_id: str, cooldown_until: str, count: int) -> bool:
"""Set cooldown state on a backend."""
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with get_connection() as conn:
cursor = conn.execute(
"""UPDATE backends SET status = 'cooling', cooldown_until = ?,
consecutive_429_count = ?, updated_at = ? WHERE id = ?""",
(cooldown_until, count, now, backend_id),
)
conn.commit()
return cursor.rowcount > 0
def clear_backend_cooldown(backend_id: str) -> bool:
"""Clear cooldown (back to healthy)."""
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with get_connection() as conn:
cursor = conn.execute(
"""UPDATE backends SET status = 'healthy', cooldown_until = NULL,
consecutive_429_count = 0, updated_at = ? WHERE id = ?""",
(now, backend_id),
)
conn.commit()
return cursor.rowcount > 0
def get_pool_stats() -> dict:
"""Get summary stats per pool."""
with get_connection() as conn:
rows = conn.execute(
"""SELECT pool, COUNT(*) as total,
SUM(CASE WHEN enabled = 1 THEN 1 ELSE 0 END) as enabled,
SUM(CASE WHEN status = 'healthy' THEN 1 ELSE 0 END) as healthy,
SUM(CASE WHEN status = 'cooling' THEN 1 ELSE 0 END) as cooling,
SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as error
FROM backends GROUP BY pool"""
).fetchall()
stats = {}
for row in rows:
stats[row["pool"]] = {
"total": row["total"],
"enabled": row["enabled"],
"healthy": row["healthy"],
"cooling": row["cooling"],
"error": row["error"],
}
return stats
def _row_to_backend(row, decrypt_key: bool = True) -> Backend:
"""Convert a DB row to a Backend instance."""
mappings_raw = row["model_mappings_json"] or "{}"
mappings_dict = json.loads(mappings_raw)
model_mappings = {}
for canonical_name, mm in mappings_dict.items():
model_mappings[canonical_name] = ModelMapping.from_dict(mm)
backend = Backend(
id=row["id"],
name=row["name"],
label=row["label"],
api_base_url=row["api_base_url"],
api_key_encrypted=row["api_key_encrypted"] or "",
api=row["api"],
timeout_seconds=row["timeout_seconds"],
rpm_limit=row["rpm_limit"],
pool=row["pool"],
enabled=bool(row["enabled"]),
status=row["status"],
model_mappings=model_mappings,
source=row["source"],
cooldown_until=row["cooldown_until"],
consecutive_429_count=row["consecutive_429_count"],
metadata=json.loads(row["metadata_json"] or "{}"),
created_at=row["created_at"],
updated_at=row["updated_at"],
)
if decrypt_key and backend.api_key_encrypted:
from crypto import try_decrypt_existing
plain = try_decrypt_existing(backend.api_key_encrypted)
if plain:
backend.api_key_plain = plain
return backend
def _mappings_to_dict(mappings: dict[str, ModelMapping]) -> dict:
"""Convert ModelMapping dict to JSON-safe dict."""
return {k: v.to_dict() for k, v in mappings.items()}
@@ -0,0 +1,55 @@
"""System configuration KV store operations."""
import time
from typing import Optional, Any
from storage.db import get_connection
def get_config(key: str) -> Optional[str]:
"""Get a single config value."""
with get_connection() as conn:
row = conn.execute(
"SELECT value FROM system_config WHERE key = ?", (key,)
).fetchone()
return row["value"] if row else None
def set_config(key: str, value: str, description: str = "") -> None:
"""Set or update a config value."""
now = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with get_connection() as conn:
conn.execute(
"""INSERT INTO system_config (key, value, description, updated_at)
VALUES (?, ?, ?, ?)
ON CONFLICT(key) DO UPDATE SET
value = excluded.value,
description = excluded.description,
updated_at = excluded.updated_at""",
(key, value, description, now),
)
conn.commit()
def delete_config(key: str) -> bool:
"""Delete a config value."""
with get_connection() as conn:
cursor = conn.execute(
"DELETE FROM system_config WHERE key = ?", (key,)
)
conn.commit()
return cursor.rowcount > 0
def list_configs() -> list[dict]:
"""List all system config entries."""
with get_connection() as conn:
rows = conn.execute("SELECT * FROM system_config ORDER BY key").fetchall()
return [dict(row) for row in rows]
def get_all_configs_as_dict() -> dict[str, str]:
"""Get all configs as a simple dict."""
with get_connection() as conn:
rows = conn.execute("SELECT key, value FROM system_config").fetchall()
return {row["key"]: row["value"] for row in rows}
@@ -0,0 +1,74 @@
"""Cooldown event logging."""
import time
from typing import Optional
from storage.db import get_connection, generate_id
from storage.models import CooldownEvent
def log_cooldown_event(
backend_id: str,
consecutive_count: int,
cooldown_seconds: int,
response_summary: str = "",
) -> CooldownEvent:
"""Record a cooldown event."""
event = CooldownEvent(
id=generate_id("cev"),
backend_id=backend_id,
consecutive_count=consecutive_count,
cooldown_seconds=cooldown_seconds,
response_summary=response_summary,
started_at=time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
)
with get_connection() as conn:
conn.execute(
"""INSERT INTO cooldown_events
(id, backend_id, consecutive_count, cooldown_seconds,
response_summary, started_at)
VALUES (?, ?, ?, ?, ?, ?)""",
(event.id, event.backend_id, event.consecutive_count,
event.cooldown_seconds, event.response_summary, event.started_at),
)
conn.commit()
return event
def end_cooldown_event(backend_id: str) -> bool:
"""Mark the latest open cooldown event as ended."""
ended_at = time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime())
with get_connection() as conn:
# Find the latest event for this backend that hasn't ended
cursor = conn.execute(
"""UPDATE cooldown_events SET ended_at = ?
WHERE backend_id = ? AND ended_at IS NULL
ORDER BY started_at DESC LIMIT 1""",
(ended_at, backend_id),
)
conn.commit()
return cursor.rowcount > 0
def get_cooldown_history(
backend_id: Optional[str] = None,
limit: int = 50,
) -> list[dict]:
"""Get cooldown event history."""
with get_connection() as conn:
if backend_id:
rows = conn.execute(
"""SELECT * FROM cooldown_events
WHERE backend_id = ?
ORDER BY started_at DESC LIMIT ?""",
(backend_id, limit),
).fetchall()
else:
rows = conn.execute(
"""SELECT * FROM cooldown_events
ORDER BY started_at DESC LIMIT ?""",
(limit,),
).fetchall()
return [dict(row) for row in rows]
+193
View File
@@ -0,0 +1,193 @@
"""SQLite database connection management with WAL mode."""
import os
import sqlite3
import uuid
import structlog
from contextlib import contextmanager
from typing import Generator
from config import config
logger = structlog.get_logger()
# Module-level DB path
_DB_PATH: str = ""
def init_db(db_path: str = "") -> None:
"""Initialize the database connection and ensure WAL mode.
Creates the data directory if needed and verifies integrity.
"""
global _DB_PATH
_DB_PATH = db_path or config.db_path
# Ensure data directory exists
os.makedirs(os.path.dirname(_DB_PATH), exist_ok=True)
# Test connection and enable WAL
conn = _get_raw_connection()
try:
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA wal_autocheckpoint=1000")
conn.execute("PRAGMA foreign_keys=ON")
conn.execute("PRAGMA busy_timeout=5000")
logger.info("db_initialized", path=_DB_PATH, mode="WAL")
finally:
conn.close()
def _get_raw_connection() -> sqlite3.Connection:
"""Get a raw sqlite3 connection."""
conn = sqlite3.connect(_DB_PATH, check_same_thread=False)
conn.row_factory = sqlite3.Row
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA foreign_keys=ON")
return conn
@contextmanager
def get_connection() -> Generator[sqlite3.Connection, None, None]:
"""Get a database connection with WAL enabled."""
conn = _get_raw_connection()
try:
yield conn
finally:
conn.close()
def generate_id(prefix: str = "") -> str:
"""Generate a unique ID with optional prefix."""
uid = uuid.uuid4().hex[:12]
return f"{prefix}_{uid}" if prefix else uid
def create_tables() -> None:
"""Create all tables if they don't exist."""
with get_connection() as conn:
conn.executescript(_DDL)
conn.commit()
logger.info("tables_created")
def run_integrity_check() -> bool:
"""Run PRAGMA integrity_check and return True if OK."""
with get_connection() as conn:
result = conn.execute("PRAGMA integrity_check").fetchone()
ok = result[0] == "ok"
if not ok:
logger.error("integrity_check_failed", result=result[0])
return ok
def get_db_sizes() -> dict:
"""Get database and WAL file sizes."""
result = {"db_bytes": 0, "wal_bytes": 0}
db_path = _DB_PATH
if os.path.exists(db_path):
result["db_bytes"] = os.path.getsize(db_path)
wal_path = db_path + "-wal"
if os.path.exists(wal_path):
result["wal_bytes"] = os.path.getsize(wal_path)
return result
def wal_checkpoint(mode: str = "TRUNCATE") -> None:
"""Execute WAL checkpoint."""
with get_connection() as conn:
conn.execute(f"PRAGMA wal_checkpoint({mode})")
_DDL = """
-- Backend configuration table (core)
CREATE TABLE IF NOT EXISTS backends (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
label TEXT DEFAULT '',
api_base_url TEXT NOT NULL,
api_key_encrypted TEXT NOT NULL,
api TEXT NOT NULL DEFAULT 'openai-completions',
timeout_seconds INTEGER NOT NULL DEFAULT 120,
rpm_limit INTEGER NOT NULL DEFAULT 40,
pool TEXT NOT NULL DEFAULT 'primary'
CHECK(pool IN ('primary', 'fallback')),
enabled INTEGER NOT NULL DEFAULT 1,
status TEXT NOT NULL DEFAULT 'healthy'
CHECK(status IN ('healthy', 'cooling', 'error', 'disabled')),
model_mappings_json TEXT DEFAULT '{}',
source TEXT NOT NULL DEFAULT 'webui'
CHECK(source IN ('webui', 'env', 'import')),
cooldown_until TEXT,
consecutive_429_count INTEGER DEFAULT 0,
metadata_json TEXT DEFAULT '{}',
created_at TEXT NOT NULL DEFAULT (datetime('now')),
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
-- Usage logs (hour-bucketed, UPSERT-safe)
CREATE TABLE IF NOT EXISTS backend_usage_logs (
id TEXT PRIMARY KEY,
backend_id TEXT NOT NULL REFERENCES backends(id) ON DELETE CASCADE,
model TEXT DEFAULT 'unknown',
prompt_tokens INTEGER DEFAULT 0,
completion_tokens INTEGER DEFAULT 0,
total_tokens INTEGER DEFAULT 0,
cost REAL DEFAULT 0.0,
request_count INTEGER DEFAULT 0,
error_count INTEGER DEFAULT 0,
avg_latency_ms INTEGER DEFAULT 0,
ttft_ms INTEGER DEFAULT 0,
hour_bucket TEXT NOT NULL,
created_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_usage_backend_hour
ON backend_usage_logs(backend_id, hour_bucket);
-- Cooldown event log
CREATE TABLE IF NOT EXISTS cooldown_events (
id TEXT PRIMARY KEY,
backend_id TEXT NOT NULL REFERENCES backends(id) ON DELETE CASCADE,
consecutive_count INTEGER NOT NULL DEFAULT 1,
cooldown_seconds INTEGER NOT NULL,
response_summary TEXT DEFAULT '',
started_at TEXT NOT NULL DEFAULT (datetime('now')),
ended_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_cooldown_backend_time
ON cooldown_events(backend_id, started_at);
-- Backend health state
CREATE TABLE IF NOT EXISTS backend_health (
backend_id TEXT PRIMARY KEY REFERENCES backends(id) ON DELETE CASCADE,
state TEXT NOT NULL DEFAULT 'healthy'
CHECK(state IN ('healthy', 'degraded', 'down')),
last_latency_ms INTEGER DEFAULT 0,
last_status_code INTEGER DEFAULT 200,
success_rate_5m REAL DEFAULT 1.0,
consecutive_failures INTEGER DEFAULT 0,
last_check_at TEXT NOT NULL DEFAULT (datetime('now'))
);
-- System configuration KV store
CREATE TABLE IF NOT EXISTS system_config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
description TEXT DEFAULT '',
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
-- Daily aggregated stats
CREATE TABLE IF NOT EXISTS daily_stats (
id TEXT PRIMARY KEY,
date TEXT NOT NULL,
pool TEXT NOT NULL CHECK(pool IN ('primary', 'fallback')),
total_requests INTEGER DEFAULT 0,
total_errors INTEGER DEFAULT 0,
total_tokens INTEGER DEFAULT 0,
total_cost REAL DEFAULT 0.0,
unique_backends INTEGER DEFAULT 0,
created_at TEXT NOT NULL DEFAULT (datetime('now'))
);
CREATE UNIQUE INDEX IF NOT EXISTS idx_daily_date_pool ON daily_stats(date, pool);
"""
+161
View File
@@ -0,0 +1,161 @@
"""Data models for Sidecar V2 — backend-centric, Canonical Name routing."""
from dataclasses import dataclass, field, asdict
from typing import Optional
import json
@dataclass
class ModelMapping:
"""A single model mapping within a backend: Canonical Name → native_id + properties."""
native_id: str
reasoning: bool = False
reasoning_effort: bool = False
input_modalities: list[str] = field(default_factory=lambda: ["text"])
cost: dict = field(default_factory=lambda: {
"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0
})
context_window: int = 128000
max_tokens: int = 65536
compat: dict = field(default_factory=dict)
def to_dict(self) -> dict:
return asdict(self)
@classmethod
def from_dict(cls, d: dict) -> "ModelMapping":
defaults = {
"native_id": "",
"reasoning": False,
"reasoning_effort": False,
"input_modalities": ["text"],
"cost": {"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0},
"context_window": 128000,
"max_tokens": 65536,
"compat": {},
}
defaults.update(d)
return cls(**{k: v for k, v in defaults.items() if k in cls.__dataclass_fields__})
@dataclass
class Backend:
"""A physical API backend (API Key + URL).
Represents a single API key endpoint. Multiple backends can serve the same
Canonical Models through their model_mappings.
"""
id: str = ""
name: str = ""
label: str = "" # e.g., "nvidia", "siliconflow" — WebUI tag only
api_base_url: str = ""
api_key_encrypted: str = ""
api: str = "openai-completions"
timeout_seconds: int = 120
rpm_limit: int = 40
pool: str = "primary" # primary | fallback
enabled: bool = True
status: str = "healthy" # healthy | cooling | error | disabled
model_mappings: dict[str, ModelMapping] = field(default_factory=dict)
source: str = "webui" # webui | env | import
cooldown_until: Optional[str] = None
consecutive_429_count: int = 0
metadata: dict = field(default_factory=dict)
created_at: str = ""
updated_at: str = ""
# Runtime fields (not persisted)
api_key_plain: str = "" # decrypted at load time, not serialized to DB
def has_model(self, canonical_name: str) -> bool:
"""Check if backend supports a given Canonical Model."""
return canonical_name in self.model_mappings
def get_native_id(self, canonical_name: str) -> str:
"""Get this backend's native model ID for a Canonical Name."""
mm = self.model_mappings.get(canonical_name)
return mm.native_id if mm else canonical_name
def get_model_cost(self, canonical_name: str) -> dict:
"""Get cost info for a Canonical Model on this backend."""
mm = self.model_mappings.get(canonical_name)
return mm.cost if mm else {"input": 0.0, "output": 0.0, "cacheRead": 0.0, "cacheWrite": 0.0}
def to_dict(self, mask_key: bool = True) -> dict:
"""Convert to dict for API responses."""
d = asdict(self)
# Remove runtime-only fields
d.pop("api_key_plain", None)
d.pop("api_key_encrypted", None)
# Mask API key
if mask_key and self.api_key_plain:
d["api_key"] = _mask_key(self.api_key_plain)
elif self.api_key_plain:
d["api_key"] = self.api_key_plain
else:
d["api_key"] = ""
# Convert model_mappings to dict for serialization
d["model_mappings"] = {
k: v.to_dict() for k, v in self.model_mappings.items()
}
return d
def _mask_key(key: str) -> str:
if len(key) <= 10:
return key[:2] + "****"
return key[:6] + "****" + key[-4:]
@dataclass
class CooldownEvent:
id: str = ""
backend_id: str = ""
consecutive_count: int = 1
cooldown_seconds: int = 60
response_summary: str = ""
started_at: str = ""
ended_at: Optional[str] = None
@dataclass
class BackendHealth:
backend_id: str = ""
state: str = "healthy" # healthy | degraded | down
last_latency_ms: int = 0
last_status_code: int = 200
success_rate_5m: float = 1.0
consecutive_failures: int = 0
last_check_at: str = ""
@dataclass
class UsageLog:
id: str = ""
backend_id: str = ""
model: str = "unknown"
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
cost: float = 0.0
request_count: int = 0
error_count: int = 0
avg_latency_ms: int = 0
ttft_ms: int = 0
hour_bucket: str = ""
@dataclass
class DailyStats:
id: str = ""
date: str = ""
pool: str = "primary"
total_requests: int = 0
total_errors: int = 0
total_tokens: int = 0
total_cost: float = 0.0
unique_backends: int = 0
@@ -0,0 +1,155 @@
"""Usage logging and daily statistics aggregation."""
import time
from typing import Optional
from storage.db import get_connection, generate_id
def record_usage(
backend_id: str,
model: str,
prompt_tokens: int,
completion_tokens: int,
cost: float,
latency_ms: int,
ttft_ms: int = 0,
is_error: bool = False,
) -> None:
"""Record a single request's usage, hour-bucketed with UPSERT."""
hour_bucket = time.strftime("%Y-%m-%dT%H:00:00Z", time.gmtime())
uid = generate_id("use")
with get_connection() as conn:
# Try update existing hour bucket
cursor = conn.execute(
"""UPDATE backend_usage_logs SET
prompt_tokens = prompt_tokens + ?,
completion_tokens = completion_tokens + ?,
total_tokens = total_tokens + ?,
cost = cost + ?,
request_count = request_count + 1,
error_count = error_count + ?,
avg_latency_ms = CAST((avg_latency_ms * request_count + ?) / (request_count + 1) AS INTEGER),
ttft_ms = CASE WHEN ? > 0 THEN CAST((ttft_ms * request_count + ?) / (request_count + 1) AS INTEGER) ELSE ttft_ms END
WHERE backend_id = ? AND hour_bucket = ?""",
(
prompt_tokens, completion_tokens,
prompt_tokens + completion_tokens,
cost,
1 if is_error else 0,
latency_ms,
ttft_ms, ttft_ms,
backend_id, hour_bucket,
),
)
if cursor.rowcount == 0:
# Insert new hour bucket
conn.execute(
"""INSERT INTO backend_usage_logs
(id, backend_id, model, prompt_tokens, completion_tokens,
total_tokens, cost, request_count, error_count,
avg_latency_ms, ttft_ms, hour_bucket)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""",
(
uid, backend_id, model,
prompt_tokens, completion_tokens,
prompt_tokens + completion_tokens,
cost, 1, 1 if is_error else 0,
latency_ms, ttft_ms, hour_bucket,
),
)
conn.commit()
def get_hourly_usage(
backend_id: Optional[str] = None,
since: Optional[str] = None,
limit: int = 168,
) -> list[dict]:
"""Get hourly usage data, optionally filtered by backend and time range."""
with get_connection() as conn:
if backend_id and since:
rows = conn.execute(
"""SELECT * FROM backend_usage_logs
WHERE backend_id = ? AND hour_bucket >= ?
ORDER BY hour_bucket DESC LIMIT ?""",
(backend_id, since, limit),
).fetchall()
elif backend_id:
rows = conn.execute(
"""SELECT * FROM backend_usage_logs
WHERE backend_id = ? ORDER BY hour_bucket DESC LIMIT ?""",
(backend_id, limit),
).fetchall()
elif since:
rows = conn.execute(
"""SELECT * FROM backend_usage_logs
WHERE hour_bucket >= ? ORDER BY hour_bucket DESC LIMIT ?""",
(since, limit),
).fetchall()
else:
rows = conn.execute(
"""SELECT * FROM backend_usage_logs
ORDER BY hour_bucket DESC LIMIT ?""",
(limit,),
).fetchall()
return [dict(row) for row in rows]
def get_total_stats() -> dict:
"""Get aggregate stats across all backends."""
with get_connection() as conn:
row = conn.execute(
"""SELECT
SUM(request_count) as total_requests,
SUM(error_count) as total_errors,
SUM(total_tokens) as total_tokens,
SUM(prompt_tokens) as total_prompt_tokens,
SUM(completion_tokens) as total_completion_tokens,
SUM(cost) as total_cost
FROM backend_usage_logs"""
).fetchone()
if row is None:
return {
"total_requests": 0, "total_errors": 0,
"total_tokens": 0, "total_prompt_tokens": 0,
"total_completion_tokens": 0, "total_cost": 0.0,
}
return dict(row)
def aggregate_daily_stats(date: str) -> None:
"""Aggregate hourly usage into daily stats table."""
with get_connection() as conn:
# Aggregate per pool
conn.execute("""DELETE FROM daily_stats WHERE date = ?""", (date,))
conn.execute(
"""INSERT INTO daily_stats (id, date, pool, total_requests,
total_errors, total_tokens, total_cost, unique_backends)
SELECT
? || '-' || b.pool,
?,
b.pool,
SUM(u.request_count),
SUM(u.error_count),
SUM(u.total_tokens),
SUM(u.cost),
COUNT(DISTINCT u.backend_id)
FROM backend_usage_logs u
JOIN backends b ON u.backend_id = b.id
WHERE u.hour_bucket LIKE ?
GROUP BY b.pool""",
(generate_id("day"), date, date + "%"),
)
conn.commit()
def get_daily_stats(days: int = 30) -> list[dict]:
"""Get daily aggregated stats."""
with get_connection() as conn:
rows = conn.execute(
"""SELECT * FROM daily_stats ORDER BY date DESC LIMIT ?""",
(days,),
).fetchall()
return [dict(row) for row in rows]