Architecture¶

How Fast-LangGraph achieves its performance improvements.

Overview¶

Fast-LangGraph is a hybrid Rust/Python library. Performance-critical components are implemented in Rust and exposed to Python via PyO3 bindings.

┌─────────────────────────────────────────────┐
│              Python Application             │
│                                             │
│   from fast_langgraph import cached         │
│   @cached                                   │
│   def call_llm(prompt): ...                 │
└─────────────────┬───────────────────────────┘
                  │
                  ▼
┌─────────────────────────────────────────────┐
│           fast_langgraph (Python)           │
│                                             │
│   - High-level API                          │
│   - Decorator wrappers                      │
│   - LangGraph integration                   │
└─────────────────┬───────────────────────────┘
                  │ PyO3 bindings
                  ▼
┌─────────────────────────────────────────────┐
│           fast_langgraph (Rust)             │
│                                             │
│   - LRU/TTL caches                          │
│   - SQLite checkpointer                     │
│   - State merge operations                  │
│   - Profiler                                │
└─────────────────────────────────────────────┘

Why Rust?¶

Python's Global Interpreter Lock (GIL) and dynamic typing create overhead for:

Object copying - deepcopy() is expensive for nested structures
Serialization - Checkpoint encoding/decoding
Memory management - Frequent allocations in hot paths

Rust provides:

Zero-cost abstractions
No GIL (true parallelism possible)
Predictable memory layout
Compile-time optimizations

Where Rust Excels¶

Based on benchmarks, Rust provides the most dramatic improvements for:

Operation	Speedup	Why
Checkpoint serialization	43-737x	Avoids Python `deepcopy()` overhead
Sustained state updates	13-46x	No intermediate Python object creation
E2E graph execution	2-3x	Combined state + checkpoint benefits

Python Dict Performance

Python's built-in dict is already implemented in C and highly optimized. For simple dict operations (lookups, single merges), Python is competitive. Rust's advantage comes from avoiding Python object overhead in complex, sustained operations.

Component Design¶

Caching System¶

The cache uses a Rust HashMap with LRU eviction:

┌─────────────────────────────────────────┐
│              RustLLMCache               │
├─────────────────────────────────────────┤
│  HashMap<String, CacheEntry>            │
│  ┌─────────┬─────────┬─────────┐       │
│  │  key    │  value  │  access │       │
│  ├─────────┼─────────┼─────────┤       │
│  │ hash(p) │ result  │  time   │       │
│  └─────────┴─────────┴─────────┘       │
│                                         │
│  max_size: usize                        │
│  hits: AtomicU64                        │
│  misses: AtomicU64                      │
└─────────────────────────────────────────┘

Performance gains come from:

Rust's efficient HashMap implementation
Pre-computed string hashing
Lock-free statistics counters
No Python object overhead for internal operations

Checkpointing System¶

The SQLite checkpointer optimizes:

Serialization - JSON encoding in Rust (serde) vs Python's json module
Batching - Groups writes to reduce SQLite transactions
Prepared statements - Reuses compiled SQL

Python checkpoint flow:
  dict → json.dumps → sqlite3.execute → disk

Rust checkpoint flow:
  dict → PyO3 extract → serde_json → rusqlite → disk
        └── happens once ──┘└── native speed ──┘

State Merge System¶

State updates in LangGraph involve:

Deep-copying state dictionaries
Merging updates
Handling append-only keys (like messages)

Rust implementation:

Avoids Python dict copying overhead
Uses efficient key lookup
Minimizes allocations for append operations

Python Integration¶

PyO3 Bindings¶

Rust structs are exposed as Python classes:

#[pyclass]
struct RustLLMCache {
    cache: HashMap<String, String>,
    max_size: usize,
}

#[pymethods]
impl RustLLMCache {
    #[new]
    fn new(max_size: usize) -> Self { ... }

    fn get(&self, key: &str) -> Option<String> { ... }
    fn put(&mut self, key: String, value: String) { ... }
}

The @cached Decorator¶

The decorator bridges Python functions with Rust caching:

def cached(fn=None, *, max_size=None):
    def decorator(func):
        cache = RustLLMCache(max_size or 10000)

        @wraps(func)
        def wrapper(*args, **kwargs):
            key = _make_key(args, kwargs)
            result = cache.get(key)
            if result is not None:
                return result
            result = func(*args, **kwargs)
            cache.put(key, result)
            return result

        wrapper.cache_stats = cache.stats
        wrapper.cache_clear = cache.clear
        return wrapper

    return decorator(fn) if fn else decorator

Shim Implementation¶

The shim patches LangGraph internals at runtime:

# executor_cache.py
_executor_cache = {}

def get_executor_for_config_cached(config):
    key = (max_workers, thread_name_prefix)
    if key not in _executor_cache:
        _executor_cache[key] = ThreadPoolExecutor(...)
    return _executor_cache[key]

# shim.py
def patch_langgraph():
    langchain_core.runnables.config.get_executor_for_config = \
        get_executor_for_config_cached

Project Structure¶

fast-langgraph/
├── src/                      # Rust source
│   ├── lib.rs               # Library entry point
│   ├── python.rs            # PyO3 bindings
│   ├── function_cache.rs    # LRU cache
│   ├── llm_cache.rs         # LLM-specific cache
│   ├── checkpoint_sqlite.rs # SQLite checkpointer
│   └── state_merge.rs       # State operations
├── fast_langgraph/          # Python package
│   ├── __init__.py          # Public API
│   ├── shim.py              # Auto-patching
│   ├── executor_cache.py    # ThreadPool caching
│   ├── accelerator.py       # Rust wrappers
│   └── profiler.py          # Performance profiling
├── tests/                   # Python tests
├── benches/                 # Rust benchmarks
└── examples/                # Usage examples

Build System¶

Fast-LangGraph uses Maturin to build Python wheels with embedded Rust:

pyproject.toml          # Python package config
Cargo.toml              # Rust package config
src/
  lib.rs                # Rust library entry
  python.rs             # PyO3 bindings
fast_langgraph/
  __init__.py           # Python package

Build produces a wheel containing:

Pure Python modules (fast_langgraph/*.py)
Compiled Rust extension (fast_langgraph.cpython-*.so)

Limitations & Trade-offs¶

PyO3 Boundary Overhead¶

Data crossing the Python/Rust boundary has overhead (~1-2μs per call):

Simple dict lookup:
  Python: 50ns
  Rust (via PyO3): 300ns + 1000ns boundary = 1300ns

This means:

Simple dict lookups are faster in pure Python
Rust wins when avoiding repeated Python operations

Best Use Cases for Rust¶

Use Case	Rust Advantage
Large state (>10KB)	Checkpoint 100x+ faster
Many operations (100+ steps)	State updates 10x+ faster
Simple, one-shot operations	Use Python instead

Thread Safety¶

Rust components are not thread-safe by default. For concurrent access:

import threading

lock = threading.Lock()

def thread_safe_cache_access(cache, key, value):
    with lock:
        cache.put(key, value)

Future Directions¶

Potential improvements:

Async cache operations
Distributed caching (Redis backend)
More LangGraph component acceleration
SIMD-optimized operations