API Reference¶

Complete API reference for Fast LiteLLM, maintained by Dipankar Sarkar at Neul Labs (https://www.neul.uk, [email protected]).

Module: fast_litellm¶

The main module provides access to all accelerated functionality.

Constants¶

RUST_ACCELERATION_AVAILABLE¶

fast_litellm.RUST_ACCELERATION_AVAILABLE: bool

True if Rust acceleration is available, False if falling back to Python.

import fast_litellm

if fast_litellm.RUST_ACCELERATION_AVAILABLE:
    print("Rust acceleration is active")

Core Functions¶

health_check¶

fast_litellm.health_check() -> Dict[str, Any]

Perform a health check on the acceleration system.

Returns:

{
    "status": "ok",
    "rust_available": True,
    "components": ["core", "tokens", "connection_pool", "rate_limiter"]
}

Example:

health = fast_litellm.health_check()
print(f"Status: {health['status']}")

apply_acceleration¶

fast_litellm.apply_acceleration() -> bool

Manually apply acceleration patches. Called automatically on import.

Returns: True if patches were applied successfully.

remove_acceleration¶

fast_litellm.remove_acceleration() -> None

Remove acceleration patches and restore original implementations.

get_patch_status¶

fast_litellm.get_patch_status() -> Dict[str, Any]

Get the current status of acceleration patches.

Returns:

{
    "applied": True,
    "components": ["routing", "token_counting", "rate_limiting", "connection_pooling"]
}

Feature Flags¶

is_enabled¶

fast_litellm.is_enabled(
    feature_name: str,
    request_id: Optional[str] = None
) -> bool

Check if a feature is enabled.

Parameters:

Name	Type	Description
`feature_name`	str	Name of the feature to check
`request_id`	str (optional)	Request ID for gradual rollout

Example:

if fast_litellm.is_enabled("rust_routing"):
    # Use Rust routing
    pass

get_feature_status¶

fast_litellm.get_feature_status() -> Dict[str, Dict[str, Any]]

Get status of all features.

Returns:

{
    "rust_routing": {"enabled": True, "errors": 0, "rollout_percentage": 100},
    "rust_token_counting": {"enabled": True, "errors": 0, "rollout_percentage": 100},
    "rust_rate_limiting": {"enabled": True, "errors": 0, "rollout_percentage": 100},
    "rust_connection_pool": {"enabled": True, "errors": 0, "rollout_percentage": 100}
}

reset_errors¶

fast_litellm.reset_errors(feature_name: Optional[str] = None) -> None

Reset error counts for features.

Parameters:

Name	Type	Description
`feature_name`	str (optional)	Reset specific feature, or all if None

Performance Monitoring¶

record_performance¶

fast_litellm.record_performance(
    component: str,
    operation: str,
    duration_ms: float,
    success: Optional[bool] = True,
    input_size: Optional[int] = None,
    output_size: Optional[int] = None
) -> None

Record a performance metric.

Parameters:

Name	Type	Description
`component`	str	Component name (e.g., "rate_limiter")
`operation`	str	Operation name (e.g., "check")
`duration_ms`	float	Duration in milliseconds
`success`	bool	Whether the operation succeeded
`input_size`	int	Optional input size in bytes
`output_size`	int	Optional output size in bytes

get_performance_stats¶

fast_litellm.get_performance_stats(
    component: Optional[str] = None
) -> Dict[str, Any]

Get performance statistics.

Parameters:

Name	Type	Description
`component`	str (optional)	Filter by component

Returns: Dictionary of performance metrics.

compare_implementations¶

fast_litellm.compare_implementations(
    rust_component: str,
    python_component: str
) -> Dict[str, Any]

Compare Rust and Python implementation performance.

Returns:

{
    "rust_avg_ms": 0.5,
    "python_avg_ms": 1.2,
    "speedup": 2.4,
    "recommendation": "use_rust"
}

get_recommendations¶

fast_litellm.get_recommendations() -> List[Dict[str, Any]]

Get optimization recommendations based on collected metrics.

Returns: List of recommendation dictionaries.

export_performance_data¶

fast_litellm.export_performance_data(
    component: Optional[str] = None,
    format: Optional[str] = "json"
) -> str

Export performance data.

Parameters:

Name	Type	Description
`component`	str (optional)	Filter by component
`format`	str	Output format ("json" or "csv")

Returns: Formatted performance data string.

Rate Limiting¶

check_rate_limit¶

fast_litellm.check_rate_limit(key: str) -> Dict[str, Any]

Check if a request is allowed under rate limits.

Returns:

{
    "allowed": True,
    "reason": "ok",
    "remaining_requests": 59,
    "retry_after_ms": None  # Only present if not allowed
}

get_rate_limit_stats¶

fast_litellm.get_rate_limit_stats() -> Dict[str, Any]

Get rate limiter statistics.

Connection Pool¶

get_connection¶

fast_litellm.get_connection(endpoint: str) -> Optional[str]

Get a connection ID for an endpoint.

Returns: Connection ID string or None.

return_connection¶

fast_litellm.return_connection(connection_id: str) -> None

Return a connection to the pool.

remove_connection¶

fast_litellm.remove_connection(connection_id: str) -> None

Remove a connection from the pool.

health_check_connection¶

fast_litellm.health_check_connection(connection_id: str) -> bool

Check if a connection is healthy.

cleanup_expired_connections¶

fast_litellm.cleanup_expired_connections() -> None

Clean up expired/idle connections.

get_connection_pool_stats¶

fast_litellm.get_connection_pool_stats() -> Dict[str, Any]

Get connection pool statistics.

Routing¶

get_available_deployment¶

fast_litellm.get_available_deployment(
    model_list: List[Dict],
    model: str,
    blocked_models: Optional[List[str]] = None,
    context: Optional[Any] = None,
    settings: Optional[Any] = None
) -> Optional[Dict]

Get an available deployment for a model.

Parameters:

Name	Type	Description
`model_list`	List[Dict]	List of deployment configurations
`model`	str	Model name to route to
`blocked_models`	List[str]	Models to exclude
`context`	Any	Optional request context
`settings`	Any	Optional settings

Returns: Deployment dictionary or None.

Classes¶

SimpleTokenCounter¶

Token counting with cost estimation.

class SimpleTokenCounter:
    def __init__(self, model_max_tokens: int = 4096) -> None: ...
    def count_tokens(self, text: str, model: Optional[str] = None) -> int: ...
    def count_tokens_batch(self, texts: List[str], model: Optional[str] = None) -> List[int]: ...
    def estimate_cost(self, input_tokens: int, output_tokens: int, model: str) -> float: ...
    def get_model_limits(self, model: str) -> Dict[str, Any]: ...
    def validate_input(self, text: str, model: str) -> bool: ...
    @property
    def model_max_tokens(self) -> int: ...

SimpleRateLimiter¶

Rate limiting with token bucket algorithm.

class SimpleRateLimiter:
    def __init__(self, requests_per_minute: int = 60) -> None: ...
    def check(self, key: Optional[str] = None) -> Dict[str, Any]: ...
    def is_allowed(self, key: Optional[str] = None) -> bool: ...
    def get_remaining(self, key: Optional[str] = None) -> int: ...
    def get_stats(self) -> Dict[str, Any]: ...

SimpleConnectionPool¶

Connection pool management.

class SimpleConnectionPool:
    def __init__(self, pool_name: str = "default") -> None: ...
    def get_connection(self, endpoint: str) -> Optional[str]: ...
    def return_connection(self, connection_id: str) -> None: ...
    def health_check(self, connection_id: str) -> bool: ...
    def cleanup(self) -> None: ...
    def get_stats(self) -> Dict[str, Any]: ...

AdvancedRouter¶

Advanced routing with multiple strategies.

class AdvancedRouter:
    def __init__(self, strategy: str = "simple_shuffle") -> None: ...
    def get_available_deployment(
        self,
        model_list: List[Dict],
        model: str,
        blocked_models: Optional[List[str]] = None
    ) -> Optional[Dict]: ...
    @property
    def strategy(self) -> str: ...

Strategies:

simple_shuffle - Random selection
least_busy - Lowest active requests
latency_based - Lowest average latency
cost_based - Most cost-effective

Environment Variables¶

Variable	Default	Description
`FAST_LITELLM_ENABLED`	`true`	Enable/disable all acceleration
`FAST_LITELLM_RUST_ROUTING`	`true`	Enable Rust routing
`FAST_LITELLM_RUST_TOKEN_COUNTING`	`true`	Enable Rust token counting
`FAST_LITELLM_RUST_RATE_LIMITING`	`true`	Enable Rust rate limiting
`FAST_LITELLM_RUST_CONNECTION_POOL`	`true`	Enable Rust connection pool
`FAST_LITELLM_FEATURE_CONFIG`	-	Path to feature config file
`FAST_LITELLM_BATCH_TOKEN_COUNTING`	`true`	Enable batch token counting