When all primary backends are in cooldown, wait and retry the primary pool
before falling through to fallback/emergency. This reduces unnecessary
spend on paid fallback providers during temporary 429 storms.
Config:
- primary_wait_ms (default 5000, env SIDECAR_PRIMARY_WAIT_MS)
- primary_wait_max_retries (default 6, env SIDECAR_PRIMARY_WAIT_MAX_RETRIES)
Implementation:
- config.py: 2 new config fields + env var loading
- router.py: pick_primary_backend() — primary-pool-only selection
- proxy.py: primary-wait loop between standard retries and emergency
Expected win: 17% error rate during high concurrency drops, emergency
passthrough count falls as requests wait for NVIDIA pool recovery
instead of immediately routing to SiliconFlow fallback.