Features Overview¶

Fast LiteLLM provides Rust-accelerated implementations of core LiteLLM components. Each component is designed to be a drop-in replacement with significant performance improvements.

Accelerated Components¶

Component	Speedup	Description
Connection Pool	3.2x	Lock-free connection management using DashMap
Rate Limiting	1.6x	Atomic rate limiting with token bucket algorithm
Token Counting	1.5-1.7x	Fast token counting for large texts
Routing	Variable	Advanced deployment routing strategies

How Acceleration Works¶

Fast LiteLLM uses a monkeypatching strategy to replace LiteLLM's Python implementations with Rust-accelerated versions:

                    ┌─────────────────────┐
  User Code ───────▶│ import fast_litellm │
                    └─────────────────────┘
                              │
                              ▼
                    ┌─────────────────────┐
                    │ Apply Monkeypatches │
                    └─────────────────────┘
                              │
                              ▼
                    ┌─────────────────────┐
  User Code ───────▶│   import litellm    │
                    └─────────────────────┘
                              │
                              ▼
                    ┌─────────────────────┐
                    │ Accelerated Calls   │
                    │ ┌─────────────────┐ │
                    │ │  Rust Backend   │ │
                    │ └─────────────────┘ │
                    └─────────────────────┘

Safety Features¶

Automatic Fallback¶

If the Rust implementation encounters an error, Fast LiteLLM automatically falls back to the Python implementation:

import fast_litellm

# If Rust fails, Python implementation is used transparently
response = litellm.completion(...)

Feature Flags¶

Each accelerated component can be individually enabled or disabled:

import fast_litellm

# Check feature status
features = fast_litellm.get_feature_status()
print(features)
# {'rust_routing': {'enabled': True, ...},
#  'rust_token_counting': {'enabled': True, ...}, ...}

Error Tracking¶

Fast LiteLLM tracks errors per feature and can automatically disable problematic features:

import fast_litellm

# Reset error counts
fast_litellm.reset_errors()

# Or reset for a specific feature
fast_litellm.reset_errors("rust_routing")

When to Use Rust Acceleration¶

Best For¶

Connection pooling - 3x+ speedup with lock-free DashMap
Rate limiting - 1.5x+ speedup with atomic operations
Large text token counting - 1.5x+ speedup for longer texts
High-cardinality workloads - 40x+ lower memory for many unique keys
Production deployments - Thread-safety guarantees

Consider Carefully¶

Small text token counting - Python tiktoken has lower FFI overhead
Routing with Python objects - FFI conversion overhead may dominate
Simple single-threaded use cases - FFI overhead may not be worth it

Performance Monitoring¶

Monitor the performance of accelerated operations in real-time:

import fast_litellm

# Get statistics
stats = fast_litellm.get_performance_stats()

# Compare implementations
comparison = fast_litellm.compare_implementations(
    "rust_rate_limiter",
    "python_rate_limiter"
)

# Get recommendations
recommendations = fast_litellm.get_recommendations()

Next Steps¶

Connection Pool - Learn about accelerated connection management
Rate Limiting - Explore atomic rate limiting
Token Counting - Fast token counting details
Routing - Advanced routing strategies