Skip to content

Streaming Data Loading

fast-axolotl's streaming data loader is one of its most powerful features, providing up to 77x faster data loading compared to Python-based solutions.

Overview

The streaming reader efficiently loads large datasets that don't fit in memory by:

  • Reading data in configurable batches
  • Supporting multiple file formats natively
  • Handling compression transparently
  • Using memory-mapped I/O where possible

Basic Usage

from fast_axolotl import streaming_dataset_reader

# Simple streaming from a single file
for batch in streaming_dataset_reader("data/train.parquet"):
    process(batch)

Configuration Options

Batch Size

Control how many rows are loaded per iteration:

# Load 1000 rows at a time
reader = streaming_dataset_reader(
    "data/train.parquet",
    batch_size=1000
)

Choosing batch size

Larger batch sizes improve throughput but use more memory. Start with 1000 and adjust based on your memory constraints.

Column Selection

Load only the columns you need:

reader = streaming_dataset_reader(
    "data/train.parquet",
    columns=["input_ids", "attention_mask", "labels"]
)

This reduces memory usage and improves performance when your dataset has many columns.

Multiple Files

Use glob patterns to stream from multiple files:

# All parquet files in a directory
reader = streaming_dataset_reader("data/*.parquet")

# Recursive glob
reader = streaming_dataset_reader("data/**/*.parquet")

# Multiple patterns
reader = streaming_dataset_reader("data/train_*.parquet")

Supported Formats

fast-axolotl automatically detects file formats:

Format Extensions Description
Parquet .parquet Columnar format, best performance
Arrow .arrow Zero-copy memory mapping
Feather .feather Fast binary format
JSON .json Standard JSON arrays
JSONL .jsonl, .ndjson Line-delimited JSON
CSV .csv Comma-separated values
Text .txt Plain text (one row per line)

Format Detection

from fast_axolotl import detect_format

# Automatic detection
format_info = detect_format("data/train.parquet.zst")
print(format_info)
# {'format': 'parquet', 'compression': 'zstd'}

Compression Support

ZSTD and Gzip compression are automatically handled:

# These all work automatically
reader = streaming_dataset_reader("data/train.parquet.zst")  # ZSTD
reader = streaming_dataset_reader("data/train.json.gz")      # Gzip
reader = streaming_dataset_reader("data/train.csv.zstd")     # ZSTD

Advanced Usage

Custom Iteration

reader = streaming_dataset_reader("data/train.parquet", batch_size=500)

# Manual iteration
batch = next(iter(reader))

# Check batch contents
print(batch.keys())      # Column names
print(len(batch["input_ids"]))  # Batch size

Memory-Efficient Processing

For very large datasets, process and discard batches to minimize memory:

def process_large_dataset(path):
    total_rows = 0

    for batch in streaming_dataset_reader(path, batch_size=1000):
        # Process batch
        total_rows += len(batch["input_ids"])

        # Batch is automatically freed when loop continues

    return total_rows

Combining with PyTorch DataLoader

from fast_axolotl import create_rust_streaming_dataset
from torch.utils.data import DataLoader

# Create HF-compatible dataset
dataset = create_rust_streaming_dataset(
    "data/train.parquet",
    batch_size=32
)

# Use with DataLoader (batch_size=None since dataset handles batching)
loader = DataLoader(
    dataset,
    batch_size=None,
    num_workers=0  # Rust handles parallelism
)

for batch in loader:
    model.train_step(batch)

Performance Tips

1. Use Parquet Format

Parquet offers the best performance due to columnar storage and efficient compression:

# Convert other formats to Parquet for best streaming performance
import pandas as pd

df = pd.read_json("data.json")
df.to_parquet("data.parquet")

2. Select Only Needed Columns

# Faster - only loads needed columns
reader = streaming_dataset_reader(
    "data/train.parquet",
    columns=["input_ids", "labels"]
)

# Slower - loads all columns
reader = streaming_dataset_reader("data/train.parquet")

3. Use ZSTD Compression

ZSTD offers excellent compression with fast decompression:

# Create ZSTD-compressed Parquet
import pyarrow.parquet as pq

pq.write_table(
    table,
    "data.parquet",
    compression="zstd"
)

4. Batch Size Tuning

Dataset Size Recommended Batch Size
< 100K rows 1000-5000
100K - 1M rows 500-2000
> 1M rows 100-1000

Error Handling

from fast_axolotl import streaming_dataset_reader

try:
    for batch in streaming_dataset_reader("data/train.parquet"):
        process(batch)
except FileNotFoundError:
    print("Data file not found")
except ValueError as e:
    print(f"Format error: {e}")

Comparison with Alternatives

Feature fast-axolotl HuggingFace datasets pandas
Memory efficiency Excellent Good Poor
Speed 77x faster 1x baseline 0.5x
Format support 7 formats Many Many
Compression Auto Manual Manual

Next Steps