Benchmarks¶

This page presents performance benchmarks comparing fast-axolotl's Rust implementations against Python baselines.

Summary¶

Feature	Speedup	Best Use Case
Streaming Data Loading	77x	Large dataset iteration
Parallel Hashing	1.9x	Dataset deduplication
Token Packing	Variable	Sequence concatenation
Batch Padding	Variable	Batch preprocessing

Streaming Data Loading¶

The streaming reader is fast-axolotl's most impactful feature, providing dramatic speedups for data loading.

Benchmark Setup¶

Dataset: 1M rows, Parquet format
Hardware: 8-core CPU, NVMe SSD
Python baseline: HuggingFace datasets streaming

Results¶

Batch Size	Python (rows/sec)	fast-axolotl (rows/sec)	Speedup
100	1,200	92,400	77x
500	1,400	98,000	70x
1000	1,500	105,000	70x
5000	1,600	112,000	70x

Format Comparison¶

Format	Throughput (rows/sec)	Relative Speed
Parquet	105,000	1.0x (baseline)
Arrow	98,000	0.93x
JSONL	45,000	0.43x
CSV	32,000	0.30x
JSON	28,000	0.27x

Recommendation

Use Parquet format for best streaming performance. ZSTD compression adds minimal overhead while reducing file size 3-5x.

Parallel Hashing¶

Multi-threaded SHA256 hashing for dataset deduplication.

Benchmark Setup¶

Dataset: Variable row counts
Row size: ~500 bytes average
Python baseline: hashlib.sha256()

Results¶

Rows	Python (sec)	fast-axolotl (sec)	Speedup
10,000	0.52	0.31	1.7x
100,000	5.2	2.7	1.9x
1,000,000	52	27	1.9x

Thread Scaling¶

CPU Cores	Throughput (rows/sec)	Efficiency
1	19,000	100%
4	61,000	80%
8	98,000	64%
16	152,000	50%
32	220,000	36%

Note

Parallel hashing automatically uses all available CPU cores. Efficiency decreases with more cores due to memory bandwidth limitations.

Token Packing¶

Performance depends on sequence length distribution.

Benchmark Setup¶

Sequences: 10,000 sequences
Max length: 2048 tokens
Python baseline: Pure Python loop with torch.cat()

Results by Sequence Length¶

Avg Sequence Length	Python (sec)	fast-axolotl (sec)	Speedup
64	0.8	1.9	0.42x
256	1.2	1.1	1.1x
512	2.1	1.2	1.8x
1024	4.5	1.4	3.2x

Warning

For very short sequences, Python may be faster due to PyO3 overhead. Use fast-axolotl packing when average sequence length > 200 tokens.

Memory Efficiency¶

Method	Peak Memory	Allocation Count
Python loop	2.1 GB	45,000
fast-axolotl	0.8 GB	12

Batch Padding¶

Performance varies with batch characteristics.

Benchmark Setup¶

Batch size: 32 sequences
Python baseline: PyTorch pad_sequence()

Results¶

Max Length	Python (ms)	fast-axolotl (ms)	Speedup
512	2.1	3.9	0.54x
1024	3.8	3.2	1.2x
2048	7.2	3.5	2.1x
4096	14.1	4.2	3.4x

Tip

Batch padding shows best speedups with longer sequences (>1024 tokens). For short sequences, PyTorch's optimized pad_sequence may be faster.

End-to-End Training¶

Measuring impact on actual training workflows.

Setup¶

Model: 7B parameter LLM
Dataset: 100K samples
Hardware: 8x A100 GPUs

Data Loading Impact¶

Component	Baseline (sec)	With fast-axolotl (sec)	Speedup
Data loading	245	3.2	77x
Tokenization	120	120	1.0x
Collation	15	12	1.25x
Total preprocessing	380	135	2.8x

Training Time Impact¶

Metric	Baseline	With fast-axolotl	Improvement
Time per epoch	45 min	41 min	9% faster
GPU utilization	78%	85%	+7%
Data loading stalls	12%	0.2%	-98%

Reproducing Benchmarks¶

Run the benchmark script yourself:

# Clone the repository
git clone https://github.com/neul-labs/fast-axolotl.git
cd fast-axolotl

# Install with dev dependencies
pip install -e ".[dev]"

# Run benchmarks
python scripts/benchmark.py

Results are saved to BENCHMARK.md.

Custom Benchmarks¶

import time
from fast_axolotl import streaming_dataset_reader

# Benchmark streaming
start = time.time()
rows = 0
for batch in streaming_dataset_reader("your_data.parquet", batch_size=1000):
    rows += len(batch["input_ids"])
elapsed = time.time() - start

print(f"Throughput: {rows / elapsed:.0f} rows/sec")

Hardware Recommendations¶

CPU¶

Workload	Recommendation
Streaming	Any modern CPU (I/O bound)
Hashing	More cores = better (8+ recommended)
Packing/Padding	Single-threaded (clock speed matters)

Storage¶

Storage Type	Streaming Performance
NVMe SSD	Excellent
SATA SSD	Good
HDD	Poor (I/O bottleneck)
Network (NFS)	Variable

Memory¶

Dataset Size	Recommended RAM
< 1M rows	8 GB
1-10M rows	16 GB
> 10M rows	32+ GB

Benchmarks¶

Summary¶

Streaming Data Loading¶

Benchmark Setup¶

Results¶

Format Comparison¶

Parallel Hashing¶

Benchmark Setup¶

Results¶

Thread Scaling¶

Token Packing¶

Benchmark Setup¶

Results by Sequence Length¶

Memory Efficiency¶

Batch Padding¶

Benchmark Setup¶

Results¶

End-to-End Training¶

Setup¶

Data Loading Impact¶

Training Time Impact¶

Reproducing Benchmarks¶

Custom Benchmarks¶

Hardware Recommendations¶

CPU¶

Storage¶

Memory¶

See Also¶