Routing Strategies
The routing service (RoutingService) resolves an ordered chain of providers for each request. After loading the routing configuration and filtering out disabled or unhealthy providers, it applies one of ten strategies to determine the execution order.
Strategy Overview
Request arrives | vLoad routing config for (tenant, capability) | vLoad provider configs from routes | vFilter: remove disabled, unhealthy, not-in-allowed-list, missing-model | vApply strategy --> ordered candidate list | vAppend fallback chain | vReturn resolved provider chain1. Priority
Algorithm: Sort candidates by the priority field (ascending). Lower number = higher preference.
Behavior: The first healthy provider in priority order handles the request. If it fails, the next in order is tried. This is deterministic — the same provider always handles requests when healthy.
When to use: When you have a clear primary provider and want predictable failover. Good for compliance scenarios where a specific provider must be preferred.
Configuration: Set priority: 1 on your primary provider, priority: 2 on your secondary, and so on.
2. Round-Robin
Algorithm: Uses a Redis counter (INCR) scoped to (tenant, capability). The counter modulo the provider count determines the starting index. Candidates are first sorted by priority, then rotated from the starting index.
Behavior: Each request shifts to the next provider in order. Distribution is even over time. Falls back to priority strategy if Redis is unavailable.
When to use: When you want to spread load evenly across providers with similar capabilities and pricing. Useful for maximizing aggregate rate limits.
Details: The Redis counter key has a 1-hour TTL and auto-resets periodically.
3. Weighted
Algorithm: Fisher-Yates weighted shuffle. Each candidate has a weight value. On each iteration, a random number is drawn proportional to total remaining weight; the selected candidate is placed next in the ordered list.
Behavior: Higher-weight providers are more likely to be selected first, but all providers get traffic proportional to their weight. The result is probabilistic — distribution converges to weight ratios over many requests.
When to use: When you want proportional traffic splitting. For example, send 70% of traffic to a primary provider and 30% to a secondary for gradual migration or A/B testing.
Configuration: Set weight: 70 on provider A and weight: 30 on provider B.
4. Least-Cost
Algorithm: Batch-loads ModelConfig documents to retrieve pricing data (input + output cost per million tokens). Sorts candidates by total cost (ascending). Cheapest provider is tried first.
Behavior: Always prefers the cheapest available provider for the requested model. Providers without pricing data are sorted last (infinite cost).
When to use: When minimizing cost is the primary concern and you are willing to accept any provider that supports the model. Ideal for high-volume, cost-sensitive workloads.
Details: Cost is computed as inputPerMillionTokens + outputPerMillionTokens from the model’s pricing configuration. Ensure pricing is configured on your model entries for this strategy to work correctly.
5. Least-Latency
Algorithm: Sorts candidates by providerConfig.health.avgLatencyMs (ascending). Fastest provider is tried first. Providers without latency data are sorted last (infinite latency).
Behavior: Always prefers the provider with the lowest measured latency. Latency values come from health check measurements and are updated periodically.
When to use: When response speed is critical. Good for real-time applications, chatbots, and interactive use cases.
Details: Latency data is updated by the health check system. The values reflect provider-side latency, not end-to-end latency including network transit.
6. Free-Tier-First
Algorithm: Separates candidates into two groups: providers with an active free tier and providers without. For each free-tier provider, checks whether the daily limits (requests and tokens) have been exhausted via the FreeTierTrackingService. Exhausted free-tier providers are demoted to the paid group. Each group is sorted by priority.
Behavior: Free-tier providers are always tried before paid providers, until their daily quotas are exhausted. This maximizes use of free allowances before incurring costs.
When to use: When you have providers offering free tiers (e.g., Gemini, Groq) and want to exhaust free capacity before paying. Particularly useful for development, testing, or cost-conscious workloads.
Details: Free-tier limits are tracked per (tenant, providerConfig, model) using the FreeTierTrackingService. Limits can be set at the provider level (freeTier.requestsPerDay, freeTier.tokensPerDay) or overridden per model (freeTierLimits).
7. Task-Optimized
Algorithm: Delegates to the ModelIntelligenceService.applyTaskOptimizedRouting() method, which analyzes the request prompt and optional max-tokens to determine the task type (coding, creative writing, analysis, etc.) and selects the best model for that task.
Behavior: Routes requests to the provider/model combination that is best suited for the detected task type based on model intelligence data. This is the most sophisticated strategy and requires the model intelligence system to be populated with benchmark data.
When to use: When you have multiple providers with different model strengths and want the gateway to automatically select the best model for each request type.
Inputs: The prompt (last user message) and maxTokens from the request are passed to the intelligence service for task classification.
8. Cost-Optimized
Algorithm: Combines model pricing data with request characteristics (estimated input tokens, expected output tokens) to compute a total estimated cost per provider. Candidates are sorted by estimated cost (ascending). Unlike least-cost, which uses static per-million-token rates, cost-optimized factors in the actual request size and any provider-specific pricing tiers or volume discounts.
Behavior: Routes each request to the provider that will handle it at the lowest estimated total cost. Providers without pricing data are sorted last.
When to use: When you want fine-grained cost optimization that accounts for request size and provider pricing structures. Ideal for workloads with variable request sizes where a flat per-token comparison is insufficient.
Details: Cost estimation uses (estimatedInputTokens * inputPerMillionTokens + estimatedOutputTokens * outputPerMillionTokens) / 1_000_000. The estimated token counts are derived from the request payload.
9. Failover
Algorithm: Orders candidates by the priority field (ascending), identical to the priority strategy. Additionally, providers in a degraded health state (elevated error rate or latency above threshold) are automatically demoted to the end of the candidate list, behind all healthy providers.
Behavior: Under normal conditions, behaves identically to priority routing. When a provider becomes degraded (but not fully down), it is moved behind healthier alternatives without manual intervention. Once the provider recovers, it resumes its original position.
When to use: When you want priority-based routing with automatic, temporary demotion of providers that are experiencing issues. This avoids sending traffic to a struggling provider when healthier alternatives exist, without requiring manual route changes.
Details: Degradation is determined by providerConfig.health.status === 'degraded'. Fully down providers are filtered out entirely (as in all strategies). The demotion is transient and reverses when health checks report the provider as healthy again.
10. Random
Algorithm: Fisher-Yates shuffle with uniform random selection. All candidates receive equal probability regardless of priority or weight.
Behavior: Each request is routed to a randomly selected provider from the available pool. Over a large number of requests, traffic distributes uniformly across all candidates.
When to use: When you want simple, unbiased load distribution without any preference for specific providers. Useful for testing, benchmarking, or scenarios where all providers are equivalent and no ordering preference exists.
Details: Uses Math.random() for selection. The full candidate list is shuffled, so if the first randomly selected provider fails, the next in the shuffled order is tried.
Fallback Chain
After the strategy orders the main candidates, the routing service appends the fallback chain — a separately configured list of last-resort providers. The fallback chain is useful for providers that should only be used when all primary routes fail.
Additionally, a local fallback can be configured (e.g., an Ollama or vLLM instance) for complete offline resilience.
The routing service detects circular references in the fallback chain and breaks the cycle with a warning log.
Caching
Resolved provider chains are cached in an in-memory LRU cache (max 1000 entries, 60-second TTL with 10% jitter). The cache key is (tenantId, capability, model). The cache is invalidated when routing configurations change.
Next Steps
- Set up routing rules using these strategies
- Provider Adapters — How each provider executes requests