feat: Primary-Wait backoff queuing — BIZ-60

When all primary backends are in cooldown, wait and retry the primary pool before falling through to fallback/emergency. This reduces unnecessary spend on paid fallback providers during temporary 429 storms. Config: - primary_wait_ms (default 5000, env SIDECAR_PRIMARY_WAIT_MS) - primary_wait_max_retries (default 6, env SIDECAR_PRIMARY_WAIT_MAX_RETRIES) Implementation: - config.py: 2 new config fields + env var loading - router.py: pick_primary_backend() — primary-pool-only selection - proxy.py: primary-wait loop between standard retries and emergency Expected win: 17% error rate during high concurrency drops, emergency passthrough count falls as requests wait for NVIDIA pool recovery instead of immediately routing to SiliconFlow fallback.
2026-06-25 22:22:02 +08:00
parent 4bdf6ddf32
commit 376ce97d91
3 changed files with 138 additions and 1 deletions
@@ -57,6 +57,20 @@ class Router:

        return None

+    def pick_primary_backend(self, canonical_model: str) -> Optional[Backend]:
+        """Pick a backend from primary pool only (no fallback).
+
+        Used by Primary-Wait: when all primary backends are cooling,
+        wait and retry primary exclusively before falling through to fallback.
+        """
+        backends = self._pool_manager.get_available_backends(
+            canonical_model, pool="primary"
+        )
+        for backend in backends:
+            if self._rate_limiter.consume(backend.id, backend.rpm_limit):
+                return backend
+        return None
+
    def get_all_pools_exhausted_info(self, canonical_model: str) -> bool:
        """Check if ALL pools are exhausted for a model."""
        return not self._pool_manager.is_any_pool_available(canonical_model)