Token Counting¶
Fast LiteLLM provides accelerated token counting using Rust and tiktoken. For large texts, this achieves 1.5-1.7x faster token counting compared to Python implementations.
Overview¶
Token counting is essential for:
- Estimating API costs
- Validating input lengths
- Managing context windows
- Batch processing optimization
Key Features¶
- tiktoken-based counting for accurate BPE tokenization
- Model-specific encodings for accurate results
- Batch processing for multiple texts
- Cost estimation based on model pricing
Performance¶
| Text Size | Python | Rust | Improvement |
|---|---|---|---|
| Small (< 100 tokens) | 1.6ms | 3.1ms | Python faster |
| Large (1000+ chars) | 23.4ms | 13.9ms | 1.7x faster |
Note
For small texts, Python's tiktoken has lower overhead due to FFI costs. Rust acceleration is most beneficial for large texts and batch operations.
Basic Usage¶
Automatic Acceleration¶
Token counting is automatically accelerated when you import fast_litellm:
import fast_litellm
import litellm
# Token counting is now accelerated
tokens = litellm.encode(model="gpt-3.5-turbo", text="Hello, world!")
print(f"Token count: {len(tokens)}")
# Count tokens in messages
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"},
]
count = litellm.token_counter(model="gpt-3.5-turbo", messages=messages)
print(f"Message tokens: {count}")
Direct API Access¶
Use the token counter directly:
from fast_litellm import SimpleTokenCounter
counter = SimpleTokenCounter()
# Count tokens
count = counter.count_tokens("Hello, world!", model="gpt-3.5-turbo")
print(f"Tokens: {count}")
# Batch counting
texts = [
"First text to count",
"Second text to count",
"Third text to count",
]
counts = counter.count_tokens_batch(texts, model="gpt-3.5-turbo")
print(f"Counts: {counts}")
API Reference¶
SimpleTokenCounter¶
class SimpleTokenCounter:
def __init__(self, model_max_tokens: int = 4096) -> None:
"""Create a token counter with optional max tokens limit."""
def count_tokens(self, text: str, model: Optional[str] = None) -> int:
"""Count tokens in a text string."""
def count_tokens_batch(
self,
texts: List[str],
model: Optional[str] = None
) -> List[int]:
"""Count tokens for multiple texts at once."""
def estimate_cost(
self,
input_tokens: int,
output_tokens: int,
model: str
) -> float:
"""Estimate cost for a request in USD."""
def get_model_limits(self, model: str) -> Dict[str, Any]:
"""Get token limits for a model."""
def validate_input(self, text: str, model: str) -> bool:
"""Validate that input doesn't exceed model limits."""
@property
def model_max_tokens(self) -> int:
"""Get the configured max tokens limit."""
Model Support¶
The token counter supports all major models:
| Provider | Models | Encoding |
|---|---|---|
| OpenAI | GPT-3.5, GPT-4, GPT-4o | cl100k_base |
| Anthropic | Claude 2, Claude 3 | cl100k_base |
| Gemini, PaLM | cl100k_base |
Cost Estimation¶
Estimate API costs before making requests:
from fast_litellm import SimpleTokenCounter
counter = SimpleTokenCounter()
# Count tokens
text = "Your long prompt here..."
input_tokens = counter.count_tokens(text, "gpt-4")
# Estimate cost (assuming 100 output tokens)
cost = counter.estimate_cost(
input_tokens=input_tokens,
output_tokens=100,
model="gpt-4"
)
print(f"Estimated cost: ${cost:.4f}")
Input Validation¶
Validate inputs before sending to the API:
from fast_litellm import SimpleTokenCounter
counter = SimpleTokenCounter()
text = "Your potentially long text..."
if counter.validate_input(text, "gpt-3.5-turbo"):
# Text is within limits
response = make_api_call(text)
else:
# Text exceeds model limits
print("Text is too long for this model")
Model Limits¶
Get token limits for any model:
from fast_litellm import SimpleTokenCounter
counter = SimpleTokenCounter()
limits = counter.get_model_limits("gpt-4")
print(f"Max input tokens: {limits.get('max_input_tokens', 'unknown')}")
print(f"Max output tokens: {limits.get('max_output_tokens', 'unknown')}")
print(f"Context window: {limits.get('context_window', 'unknown')}")
Batch Processing¶
For processing multiple texts efficiently:
from fast_litellm import SimpleTokenCounter
counter = SimpleTokenCounter()
# Process a batch of texts
texts = [
"First document content...",
"Second document content...",
"Third document content...",
# ... potentially many more
]
# Count all at once (more efficient)
counts = counter.count_tokens_batch(texts, model="gpt-3.5-turbo")
# Filter texts that exceed limits
max_tokens = 4000
valid_texts = [
text for text, count in zip(texts, counts)
if count <= max_tokens
]
When to Use Rust vs Python¶
Use Rust Acceleration For:¶
- Large documents (1000+ characters)
- Batch processing of multiple texts
- High-throughput token counting
- Memory-constrained environments
Use Python For:¶
- Small texts (< 100 tokens)
- One-off token counts
- When FFI overhead matters
How It Works¶
The Rust implementation uses tiktoken-rs with cached encodings:
// Simplified implementation
use std::sync::RwLock;
use tiktoken_rs::CoreBPE;
static ENCODINGS: RwLock<HashMap<String, CoreBPE>> = RwLock::new(HashMap::new());
fn count_tokens(text: &str, model: &str) -> usize {
// Get or create encoding (cached)
let encoding = get_encoding(model);
encoding.encode_ordinary(text).len()
}
The caching strategy ensures that encoding initialization happens only once per model, amortizing the setup cost across many calls.
Next Steps¶
- Routing - Learn about advanced routing
- Performance Tuning - Optimize token counting