Routing¶
Fast LiteLLM provides advanced routing capabilities for distributing requests across multiple model deployments. The router supports multiple strategies for optimal load distribution.
Overview¶
The router helps you:
- Distribute requests across multiple deployments
- Implement failover and load balancing
- Optimize for latency or cost
- Handle blocked or unavailable models
Key Features¶
- Multiple routing strategies (shuffle, least busy, latency-based, cost-based)
- Thread-safe concurrent access using DashMap
- Real-time metrics tracking
- Automatic failover to available deployments
Performance¶
Note
Routing performance depends heavily on the complexity of your model list and the routing strategy used. For simple cases, Python may be faster due to FFI overhead. For complex deployments with many models, Rust provides better scalability.
Basic Usage¶
Automatic Acceleration¶
Routing is automatically accelerated when you import fast_litellm:
import fast_litellm
import litellm
# Configure multiple deployments
litellm.model_list = [
{
"model_name": "gpt-4",
"litellm_params": {"model": "openai/gpt-4", "api_key": "key1"}
},
{
"model_name": "gpt-4",
"litellm_params": {"model": "azure/gpt-4", "api_key": "key2"}
},
]
# Routing is now accelerated
response = litellm.completion(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
Direct API Access¶
Use the router directly:
from fast_litellm import AdvancedRouter
router = AdvancedRouter(strategy="simple_shuffle")
# Define your deployments
deployments = [
{"model_name": "gpt-4", "endpoint": "https://api.openai.com"},
{"model_name": "gpt-4", "endpoint": "https://api.azure.com"},
{"model_name": "gpt-3.5-turbo", "endpoint": "https://api.openai.com"},
]
# Get an available deployment
deployment = router.get_available_deployment(
model_list=deployments,
model="gpt-4"
)
if deployment:
print(f"Using: {deployment['endpoint']}")
Routing Strategies¶
Simple Shuffle (Default)¶
Randomly selects from available deployments:
Best for: Even distribution across healthy deployments
Least Busy¶
Routes to the deployment with the fewest active requests:
Best for: Balancing load across deployments
Latency-Based¶
Routes to the deployment with the lowest average latency:
Best for: Minimizing response time
Cost-Based¶
Routes to the most cost-effective deployment:
Best for: Minimizing API costs
API Reference¶
AdvancedRouter¶
class AdvancedRouter:
def __init__(self, strategy: str = "simple_shuffle") -> None:
"""
Create a router with specified strategy.
Args:
strategy: One of "simple_shuffle", "least_busy",
"latency_based", "cost_based"
"""
def get_available_deployment(
self,
model_list: List[Dict],
model: str,
blocked_models: Optional[List[str]] = None
) -> Optional[Dict]:
"""
Get an available deployment for the specified model.
Args:
model_list: List of deployment configurations
model: The model name to route to
blocked_models: Models to exclude from routing
Returns:
A deployment dict or None if no deployment available
"""
@property
def strategy(self) -> str:
"""Get the current routing strategy."""
Standalone Function¶
deployment = fast_litellm.get_available_deployment(
model_list=[...],
model="gpt-4",
blocked_models=["gpt-4-preview"],
context=None,
settings=None
)
Blocking Models¶
Exclude specific models from routing:
from fast_litellm import AdvancedRouter
router = AdvancedRouter()
# Block a problematic deployment
deployment = router.get_available_deployment(
model_list=deployments,
model="gpt-4",
blocked_models=["gpt-4-azure-east"] # Skip this deployment
)
Deployment Configuration¶
Each deployment in the model list should include:
deployment = {
"model_name": "gpt-4", # Model identifier
"litellm_params": {
"model": "openai/gpt-4", # Provider/model
"api_key": "your-api-key", # API credentials
"api_base": "https://...", # Optional: custom endpoint
},
# Optional metadata
"metadata": {
"region": "us-east-1",
"priority": 1,
}
}
Failover Example¶
Implement automatic failover:
import fast_litellm
import litellm
def call_with_failover(messages, max_retries=3):
deployments = litellm.model_list.copy()
blocked = []
for attempt in range(max_retries):
deployment = fast_litellm.get_available_deployment(
model_list=deployments,
model="gpt-4",
blocked_models=blocked
)
if not deployment:
raise Exception("No deployments available")
try:
return litellm.completion(
model=deployment["model_name"],
messages=messages
)
except Exception as e:
# Block this deployment and retry
blocked.append(deployment["model_name"])
print(f"Deployment failed, trying next: {e}")
raise Exception("All deployments failed")
Load Balancing Example¶
Implement weighted load balancing:
from fast_litellm import AdvancedRouter
router = AdvancedRouter(strategy="least_busy")
# Deployments with different capacities
deployments = [
{"model_name": "gpt-4", "capacity": 100, "endpoint": "primary"},
{"model_name": "gpt-4", "capacity": 50, "endpoint": "secondary"},
]
def get_best_deployment():
return router.get_available_deployment(
model_list=deployments,
model="gpt-4"
)
How It Works¶
The Rust implementation uses DashMap for thread-safe concurrent access:
// Simplified implementation
use dashmap::DashMap;
struct Router {
metrics: DashMap<String, DeploymentMetrics>,
strategy: String,
}
impl Router {
fn get_deployment(&self, model: &str) -> Option<Deployment> {
match self.strategy.as_str() {
"least_busy" => self.get_least_busy(model),
"latency_based" => self.get_lowest_latency(model),
_ => self.get_random(model),
}
}
}
Next Steps¶
- Configuration - Configure routing behavior
- Performance Tuning - Optimize routing performance