Performance Tuning¶

Optimize Fast LiteLLM for your specific workload and requirements.

Understanding Performance Characteristics¶

When Rust Excels¶

Scenario	Speedup	Reason
Connection pooling	3.2x	Lock-free DashMap
Rate limiting	1.6x	Atomic operations
Large text tokenization	1.7x	Optimized tiktoken-rs
High-cardinality rate limiting	42x memory savings	Efficient data structures

When Python May Be Better¶

Scenario	Overhead	Reason
Small text tokenization	~0.5x	FFI overhead dominates
Simple routing	~0.4x	Python dict access is fast
Single-threaded workloads	Variable	FFI overhead

Optimization Strategies¶

1. Profile Your Workload¶

Start by understanding where time is spent:

import fast_litellm
import time

# Enable performance tracking
start = time.perf_counter()

# Your workload here
for i in range(1000):
    fast_litellm.check_rate_limit(f"user_{i}")

elapsed = time.perf_counter() - start
print(f"1000 rate limit checks: {elapsed:.3f}s")

# Get detailed stats
stats = fast_litellm.get_performance_stats()
for key, value in stats.items():
    print(f"  {key}: {value}")

2. Batch Operations¶

For token counting, batch operations are more efficient:

from fast_litellm import SimpleTokenCounter

counter = SimpleTokenCounter()

# Less efficient: individual calls
texts = ["text1", "text2", "text3", ...]
counts = [counter.count_tokens(t) for t in texts]

# More efficient: batch call
counts = counter.count_tokens_batch(texts)

3. Connection Pool Sizing¶

Optimize connection pool for your traffic:

# For high-throughput APIs
# Configure via environment or config file
# Default is usually sufficient for most use cases

import fast_litellm

# Monitor pool usage
stats = fast_litellm.get_connection_pool_stats()
active = stats.get('active_connections', 0)
idle = stats.get('idle_connections', 0)

print(f"Active: {active}, Idle: {idle}")

4. Rate Limiter Tuning¶

Choose appropriate rate limits:

from fast_litellm import SimpleRateLimiter

# For burst traffic (allows temporary spikes)
limiter = SimpleRateLimiter(requests_per_minute=100)

# The internal token bucket allows some burst capacity
# Burst size is automatically calculated as rpm/10

5. Feature Selection¶

Disable features that don't benefit your workload:

# If routing overhead outweighs benefits
export FAST_LITELLM_RUST_ROUTING=false

# If only processing small texts
export FAST_LITELLM_RUST_TOKEN_COUNTING=false

Benchmarking Your Setup¶

Built-in Benchmarks¶

Run the built-in benchmark suite:

python scripts/run_benchmarks.py --iterations 200

Custom Benchmarks¶

Create benchmarks for your specific use case:

import time
import statistics
import fast_litellm
from fast_litellm import SimpleRateLimiter

def benchmark_rate_limiter(iterations=1000):
    limiter = SimpleRateLimiter()
    times = []

    for i in range(iterations):
        start = time.perf_counter()
        limiter.check(f"key_{i % 100}")
        elapsed = time.perf_counter() - start
        times.append(elapsed * 1000)  # Convert to ms

    return {
        "mean_ms": statistics.mean(times),
        "median_ms": statistics.median(times),
        "stdev_ms": statistics.stdev(times),
        "min_ms": min(times),
        "max_ms": max(times),
        "throughput_ops": iterations / sum(times) * 1000,
    }

results = benchmark_rate_limiter()
print(f"Mean: {results['mean_ms']:.3f}ms")
print(f"Throughput: {results['throughput_ops']:.0f} ops/s")

Memory Optimization¶

High-Cardinality Workloads¶

For applications with many unique keys:

# Rust is 42x more memory efficient for 1000+ unique keys
from fast_litellm import SimpleRateLimiter

limiter = SimpleRateLimiter()

# Handle many unique users efficiently
for user_id in unique_user_ids:  # Could be millions
    result = limiter.check(user_id)

Monitor Memory Usage¶

import psutil
import os

def get_memory_mb():
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / 1024 / 1024

print(f"Memory: {get_memory_mb():.1f} MB")

Concurrency Optimization¶

Thread Pool Sizing¶

Fast LiteLLM's Rust components are thread-safe. Optimize thread count:

import concurrent.futures
from fast_litellm import SimpleRateLimiter

limiter = SimpleRateLimiter()

def check_rate(user_id):
    return limiter.check(user_id)

# Optimal thread count depends on your workload
# Start with CPU count, adjust based on benchmarks
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(check_rate, user_ids))

Async Integration¶

For async applications:

import asyncio
from fast_litellm import SimpleRateLimiter

limiter = SimpleRateLimiter()

async def check_rate_async(user_id):
    # Run in thread pool for blocking Rust calls
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, limiter.check, user_id)

async def main():
    tasks = [check_rate_async(f"user_{i}") for i in range(100)]
    results = await asyncio.gather(*tasks)

Production Checklist¶

Before Deployment¶

[ ] Run benchmarks with production-like data
[ ] Configure feature flags appropriately
[ ] Set up monitoring and alerting
[ ] Test fallback behavior
[ ] Configure connection pool cleanup

Monitoring in Production¶

import fast_litellm

# Periodic health check
health = fast_litellm.health_check()
assert health['status'] == 'ok'

# Monitor performance
stats = fast_litellm.get_performance_stats()

# Check for recommendations
recommendations = fast_litellm.get_recommendations()
for rec in recommendations:
    print(f"Recommendation: {rec}")

Gradual Rollout¶

Start at 5% traffic
Monitor error rates and latency
Increase to 25%, 50%, 100%
Roll back if issues detected

Troubleshooting Performance Issues¶

Symptom: Rust Slower Than Expected¶

Check FFI overhead vs operation size
Verify Rust extensions are loaded
Profile with get_performance_stats()

Symptom: High Memory Usage¶

Check for connection leaks
Run cleanup_expired_connections()
Monitor with get_connection_pool_stats()

Symptom: Rate Limit Errors¶

Verify rate limit configuration
Check for clock skew in distributed systems
Reset errors with reset_errors()

Next Steps¶

Monitoring - Set up comprehensive monitoring
Configuration - Fine-tune settings