feat: Primary-Wait backoff queuing — BIZ-60
When all primary backends are in cooldown, wait and retry the primary pool before falling through to fallback/emergency. This reduces unnecessary spend on paid fallback providers during temporary 429 storms. Config: - primary_wait_ms (default 5000, env SIDECAR_PRIMARY_WAIT_MS) - primary_wait_max_retries (default 6, env SIDECAR_PRIMARY_WAIT_MAX_RETRIES) Implementation: - config.py: 2 new config fields + env var loading - router.py: pick_primary_backend() — primary-pool-only selection - proxy.py: primary-wait loop between standard retries and emergency Expected win: 17% error rate during high concurrency drops, emergency passthrough count falls as requests wait for NVIDIA pool recovery instead of immediately routing to SiliconFlow fallback.
This commit is contained in:
@@ -57,6 +57,20 @@ class Router:
|
||||
|
||||
return None
|
||||
|
||||
def pick_primary_backend(self, canonical_model: str) -> Optional[Backend]:
|
||||
"""Pick a backend from primary pool only (no fallback).
|
||||
|
||||
Used by Primary-Wait: when all primary backends are cooling,
|
||||
wait and retry primary exclusively before falling through to fallback.
|
||||
"""
|
||||
backends = self._pool_manager.get_available_backends(
|
||||
canonical_model, pool="primary"
|
||||
)
|
||||
for backend in backends:
|
||||
if self._rate_limiter.consume(backend.id, backend.rpm_limit):
|
||||
return backend
|
||||
return None
|
||||
|
||||
def get_all_pools_exhausted_info(self, canonical_model: str) -> bool:
|
||||
"""Check if ALL pools are exhausted for a model."""
|
||||
return not self._pool_manager.is_any_pool_available(canonical_model)
|
||||
Reference in New Issue
Block a user