fix: add Primary-Wait Prometheus counters + conservative defaults — BIZ-60 review

P0 changes per 4-reviewer consensus (严维序/陆怀瑾/沈路明/梁思筑):

1. Prometheus metrics counters (proxy.py + server.py):
   - sidecar_primary_wait_enter_total: requests entering Primary-Wait
   - sidecar_primary_wait_recovery_total: successful primary recoveries
   - sidecar_primary_wait_exhausted_total: wait exhausted → emergency

2. Conservative default (config.py):
   - primary_wait_max_retries: 6 → 3 (15s total wait, safe start)
   - Observe recovery rate before increasing to 6

Counters form complete funnel: enter - recovery = exhausted,
enabling Grafana monitoring and ROI validation per COO/PM/Ops.
This commit is contained in:
2026-06-25 22:48:09 +08:00
parent 376ce97d91
commit 18dfb2901b
3 changed files with 37 additions and 2 deletions
+1 -1
View File
@@ -75,7 +75,7 @@ class Config:
# Primary-Wait: when all primary backends are cooling, wait before fallback
primary_wait_ms: int = 5000
primary_wait_max_retries: int = 6
primary_wait_max_retries: int = 3
# Request timeout
default_request_timeout_seconds: int = 120