Token Counting¶

Fast LiteLLM provides accelerated token counting using Rust and tiktoken. For large texts, this achieves 1.5-1.7x faster token counting compared to Python implementations.

Overview¶

Token counting is essential for:

Estimating API costs
Validating input lengths
Managing context windows
Batch processing optimization

Key Features¶

tiktoken-based counting for accurate BPE tokenization
Model-specific encodings for accurate results
Batch processing for multiple texts
Cost estimation based on model pricing

Performance¶

Text Size	Python	Rust	Improvement
Small (< 100 tokens)	1.6ms	3.1ms	Python faster
Large (1000+ chars)	23.4ms	13.9ms	1.7x faster

Note

For small texts, Python's tiktoken has lower overhead due to FFI costs. Rust acceleration is most beneficial for large texts and batch operations.

Basic Usage¶

Automatic Acceleration¶

Token counting is automatically accelerated when you import fast_litellm:

import fast_litellm
import litellm

# Token counting is now accelerated
tokens = litellm.encode(model="gpt-3.5-turbo", text="Hello, world!")
print(f"Token count: {len(tokens)}")

# Count tokens in messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Python?"},
]
count = litellm.token_counter(model="gpt-3.5-turbo", messages=messages)
print(f"Message tokens: {count}")

Direct API Access¶

Use the token counter directly:

from fast_litellm import SimpleTokenCounter

counter = SimpleTokenCounter()

# Count tokens
count = counter.count_tokens("Hello, world!", model="gpt-3.5-turbo")
print(f"Tokens: {count}")

# Batch counting
texts = [
    "First text to count",
    "Second text to count",
    "Third text to count",
]
counts = counter.count_tokens_batch(texts, model="gpt-3.5-turbo")
print(f"Counts: {counts}")

API Reference¶

SimpleTokenCounter¶

class SimpleTokenCounter:
    def __init__(self, model_max_tokens: int = 4096) -> None:
        """Create a token counter with optional max tokens limit."""

    def count_tokens(self, text: str, model: Optional[str] = None) -> int:
        """Count tokens in a text string."""

    def count_tokens_batch(
        self,
        texts: List[str],
        model: Optional[str] = None
    ) -> List[int]:
        """Count tokens for multiple texts at once."""

    def estimate_cost(
        self,
        input_tokens: int,
        output_tokens: int,
        model: str
    ) -> float:
        """Estimate cost for a request in USD."""

    def get_model_limits(self, model: str) -> Dict[str, Any]:
        """Get token limits for a model."""

    def validate_input(self, text: str, model: str) -> bool:
        """Validate that input doesn't exceed model limits."""

    @property
    def model_max_tokens(self) -> int:
        """Get the configured max tokens limit."""

Model Support¶

The token counter supports all major models:

Provider	Models	Encoding
OpenAI	GPT-3.5, GPT-4, GPT-4o	cl100k_base
Anthropic	Claude 2, Claude 3	cl100k_base
Google	Gemini, PaLM	cl100k_base

Cost Estimation¶

Estimate API costs before making requests:

from fast_litellm import SimpleTokenCounter

counter = SimpleTokenCounter()

# Count tokens
text = "Your long prompt here..."
input_tokens = counter.count_tokens(text, "gpt-4")

# Estimate cost (assuming 100 output tokens)
cost = counter.estimate_cost(
    input_tokens=input_tokens,
    output_tokens=100,
    model="gpt-4"
)
print(f"Estimated cost: ${cost:.4f}")

Input Validation¶

Validate inputs before sending to the API:

from fast_litellm import SimpleTokenCounter

counter = SimpleTokenCounter()

text = "Your potentially long text..."

if counter.validate_input(text, "gpt-3.5-turbo"):
    # Text is within limits
    response = make_api_call(text)
else:
    # Text exceeds model limits
    print("Text is too long for this model")

Model Limits¶

Get token limits for any model:

from fast_litellm import SimpleTokenCounter

counter = SimpleTokenCounter()

limits = counter.get_model_limits("gpt-4")
print(f"Max input tokens: {limits.get('max_input_tokens', 'unknown')}")
print(f"Max output tokens: {limits.get('max_output_tokens', 'unknown')}")
print(f"Context window: {limits.get('context_window', 'unknown')}")

Batch Processing¶

For processing multiple texts efficiently:

from fast_litellm import SimpleTokenCounter

counter = SimpleTokenCounter()

# Process a batch of texts
texts = [
    "First document content...",
    "Second document content...",
    "Third document content...",
    # ... potentially many more
]

# Count all at once (more efficient)
counts = counter.count_tokens_batch(texts, model="gpt-3.5-turbo")

# Filter texts that exceed limits
max_tokens = 4000
valid_texts = [
    text for text, count in zip(texts, counts)
    if count <= max_tokens
]

When to Use Rust vs Python¶

Use Rust Acceleration For:¶

Large documents (1000+ characters)
Batch processing of multiple texts
High-throughput token counting
Memory-constrained environments

Use Python For:¶

Small texts (< 100 tokens)
One-off token counts
When FFI overhead matters

How It Works¶

The Rust implementation uses tiktoken-rs with cached encodings:

// Simplified implementation
use std::sync::RwLock;
use tiktoken_rs::CoreBPE;

static ENCODINGS: RwLock<HashMap<String, CoreBPE>> = RwLock::new(HashMap::new());

fn count_tokens(text: &str, model: &str) -> usize {
    // Get or create encoding (cached)
    let encoding = get_encoding(model);
    encoding.encode_ordinary(text).len()
}

The caching strategy ensures that encoding initialization happens only once per model, amortizing the setup cost across many calls.

Next Steps¶

Routing - Learn about advanced routing
Performance Tuning - Optimize token counting