Streaming API¶

This page documents the streaming data loading functions.

Primary Functions¶

`streaming_dataset_reader()`¶

Create a streaming reader for data files.

fast_axolotl.streaming_dataset_reader(
    path: str,
    batch_size: int = 1000,
    columns: list[str] | None = None
) -> Iterator[dict]

Parameters:

Parameter	Type	Default	Description
`path`	`str`	required	File path or glob pattern
`batch_size`	`int`	`1000`	Number of rows per batch
`columns`	`list[str]`	`None`	Columns to load (None = all)

Returns: An iterator yielding dictionaries with column names as keys and lists of values.

Supported Formats:

Parquet (.parquet)
Arrow (.arrow)
Feather (.feather)
JSON (.json)
JSONL (.jsonl, .ndjson)
CSV (.csv)
Text (.txt)

Compression: ZSTD (.zst, .zstd) and Gzip (.gz) are automatically detected.

Examples:

Basic usage:

from fast_axolotl import streaming_dataset_reader

for batch in streaming_dataset_reader("data/train.parquet"):
    print(batch.keys())
    print(len(batch["input_ids"]))

With options:

reader = streaming_dataset_reader(
    "data/**/*.parquet",
    batch_size=500,
    columns=["input_ids", "labels"]
)

for batch in reader:
    process(batch)

`create_rust_streaming_dataset()`¶

Create a HuggingFace-compatible streaming dataset.

fast_axolotl.create_rust_streaming_dataset(
    path: str,
    batch_size: int = 32,
    columns: list[str] | None = None
) -> RustStreamingDataset

Parameters:

Parameter	Type	Default	Description
`path`	`str`	required	File path or glob pattern
`batch_size`	`int`	`32`	Number of rows per batch
`columns`	`list[str]`	`None`	Columns to load (None = all)

Returns: A RustStreamingDataset object compatible with PyTorch DataLoader.

Example:

from fast_axolotl import create_rust_streaming_dataset
from torch.utils.data import DataLoader

dataset = create_rust_streaming_dataset(
    "data/train.parquet",
    batch_size=32
)

# Use with DataLoader (batch_size=None since dataset handles batching)
loader = DataLoader(dataset, batch_size=None, num_workers=0)

for batch in loader:
    model.train_step(batch)

Classes¶

`RustStreamingDataset`¶

A PyTorch-compatible dataset wrapper for streaming data.

class RustStreamingDataset:
    def __init__(
        self,
        path: str,
        batch_size: int = 32,
        columns: list[str] | None = None
    )

    def __iter__(self) -> Iterator[dict]
    def __len__(self) -> int  # Approximate length

Methods:

Method	Description
`__iter__()`	Iterate over batches
`__len__()`	Get approximate dataset length

Example:

from fast_axolotl import RustStreamingDataset

dataset = RustStreamingDataset("data.parquet", batch_size=32)

for batch in dataset:
    input_ids = batch["input_ids"]
    labels = batch["labels"]

Low-Level Functions¶

`_rust_streaming_reader()`¶

Low-level Rust streaming reader (internal use).

fast_axolotl._rust_streaming_reader(
    path: str,
    batch_size: int,
    columns: list[str] | None
) -> Iterator

Warning

This is an internal function. Use streaming_dataset_reader() instead.

Error Handling¶

Common Exceptions¶

Exception	Cause
`FileNotFoundError`	File or pattern doesn't match any files
`ValueError`	Unsupported format or invalid parameters
`IOError`	File read error or corruption

Example:

from fast_axolotl import streaming_dataset_reader

try:
    for batch in streaming_dataset_reader("data/*.parquet"):
        process(batch)
except FileNotFoundError:
    print("No matching files found")
except ValueError as e:
    print(f"Format error: {e}")

Performance Notes¶

Batch Size Selection¶

Dataset Size	Recommended Batch Size	Memory Usage
< 100K rows	1000-5000	Low
100K-1M rows	500-2000	Medium
> 1M rows	100-1000	High

Format Performance¶

Format	Read Speed	Memory Efficiency
Parquet	Fastest	Excellent
Arrow	Very Fast	Excellent
JSONL	Fast	Good
CSV	Moderate	Good
JSON	Slower	Moderate

Compression Performance¶

Compression	Decompression Speed	Compression Ratio
None	Fastest	1x
ZSTD	Fast	3-5x
Gzip	Moderate	2-4x

Streaming API¶

Primary Functions¶

streaming_dataset_reader()¶

create_rust_streaming_dataset()¶

Classes¶

RustStreamingDataset¶

Low-Level Functions¶

_rust_streaming_reader()¶

Error Handling¶

Common Exceptions¶

Performance Notes¶

Batch Size Selection¶

Format Performance¶

Compression Performance¶

See Also¶

`streaming_dataset_reader()`¶

`create_rust_streaming_dataset()`¶

`RustStreamingDataset`¶

`_rust_streaming_reader()`¶