Skip to content

Data Processing API

This page documents the data processing functions: token packing, parallel hashing, and batch padding.


Token Packing

pack_sequences()

Pack multiple sequences into fixed-length chunks.

fast_axolotl.pack_sequences(
    sequences: list[list[int]] | list[torch.Tensor],
    max_length: int,
    pad_token_id: int,
    return_boundaries: bool = False
) -> torch.Tensor | tuple[torch.Tensor, list]

Parameters:

Parameter Type Default Description
sequences list required List of token sequences
max_length int required Target length for packed sequences
pad_token_id int required Token ID used for padding
return_boundaries bool False Also return sequence boundaries

Returns:

  • If return_boundaries=False: A tensor of shape (num_packed, max_length)
  • If return_boundaries=True: A tuple of (tensor, boundaries list)

Example:

from fast_axolotl import pack_sequences
import torch

sequences = [
    torch.tensor([1, 2, 3]),
    torch.tensor([4, 5]),
    torch.tensor([6, 7, 8, 9]),
]

packed = pack_sequences(sequences, max_length=8, pad_token_id=0)
print(packed)
# tensor([[1, 2, 3, 4, 5, 0, 0, 0],
#         [6, 7, 8, 9, 0, 0, 0, 0]])


concatenate_and_pack()

Pack with separate input, label, and attention mask sequences.

fast_axolotl.concatenate_and_pack(
    input_ids: list[list[int]],
    labels: list[list[int]],
    attention_masks: list[list[int]],
    max_length: int,
    pad_token_id: int,
    label_pad_id: int = -100
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]

Parameters:

Parameter Type Default Description
input_ids list[list[int]] required Input token sequences
labels list[list[int]] required Label sequences
attention_masks list[list[int]] required Attention mask sequences
max_length int required Target length
pad_token_id int required Padding token for inputs
label_pad_id int -100 Padding token for labels

Returns: Tuple of (packed_inputs, packed_labels, packed_masks) tensors.

Example:

from fast_axolotl import concatenate_and_pack

inputs = [[1, 2, 3], [4, 5, 6]]
labels = [[-100, 2, 3], [-100, -100, 6]]
masks = [[1, 1, 1], [1, 1, 1]]

packed_in, packed_lab, packed_mask = concatenate_and_pack(
    inputs, labels, masks,
    max_length=8,
    pad_token_id=0,
    label_pad_id=-100
)


Parallel Hashing

parallel_hash_rows()

Compute SHA256 hashes of byte sequences in parallel.

fast_axolotl.parallel_hash_rows(
    rows: list[bytes]
) -> list[str]

Parameters:

Parameter Type Description
rows list[bytes] List of byte sequences to hash

Returns: List of hex-encoded SHA256 hash strings.

Example:

from fast_axolotl import parallel_hash_rows

rows = [
    b"First document content",
    b"Second document content",
    b"Third document content",
]

hashes = parallel_hash_rows(rows)
print(hashes[0])  # 64-character hex string

Performance: Uses all available CPU cores for parallel hashing.


deduplicate_indices()

Find indices of unique rows using parallel hashing.

fast_axolotl.deduplicate_indices(
    rows: list[bytes]
) -> list[int]

Parameters:

Parameter Type Description
rows list[bytes] List of byte sequences

Returns: List of indices pointing to the first occurrence of each unique row.

Example:

from fast_axolotl import deduplicate_indices

rows = [
    b"document A",
    b"document B",
    b"document A",  # duplicate
    b"document C",
    b"document B",  # duplicate
]

unique_idx = deduplicate_indices(rows)
print(unique_idx)  # [0, 1, 3]


Batch Padding

pad_sequences()

Pad sequences to uniform length.

fast_axolotl.pad_sequences(
    sequences: list[list[int]],
    pad_value: int,
    padding_side: str = "right",
    max_length: int | None = None
) -> list[list[int]]

Parameters:

Parameter Type Default Description
sequences list[list[int]] required Sequences to pad
pad_value int required Value used for padding
padding_side str "right" "left" or "right"
max_length int None Target length (None = longest)

Returns: List of padded sequences (all same length).

Example:

from fast_axolotl import pad_sequences

sequences = [[1, 2, 3], [4, 5], [6, 7, 8, 9]]

# Pad to longest
padded = pad_sequences(sequences, pad_value=0)
print(padded)
# [[1, 2, 3, 0], [4, 5, 0, 0], [6, 7, 8, 9]]

# Pad with specific length
padded = pad_sequences(sequences, pad_value=0, max_length=6)
print(padded)
# [[1, 2, 3, 0, 0, 0], [4, 5, 0, 0, 0, 0], [6, 7, 8, 9, 0, 0]]

# Left padding
padded = pad_sequences(sequences, pad_value=0, padding_side="left")
print(padded)
# [[0, 1, 2, 3], [0, 0, 4, 5], [6, 7, 8, 9]]


create_padding_mask()

Create attention masks for padded sequences.

fast_axolotl.create_padding_mask(
    padded_sequences: list[list[int]],
    pad_value: int
) -> list[list[int]]

Parameters:

Parameter Type Description
padded_sequences list[list[int]] Padded sequences
pad_value int The padding value to detect

Returns: Attention masks (1 for real tokens, 0 for padding).

Example:

from fast_axolotl import pad_sequences, create_padding_mask

sequences = [[1, 2, 3], [4, 5]]
padded = pad_sequences(sequences, pad_value=0)

mask = create_padding_mask(padded, pad_value=0)
print(mask)
# [[1, 1, 1], [1, 1, 0]]


Quick Reference

Token Packing

Function Description
pack_sequences() Pack sequences into fixed-length chunks
concatenate_and_pack() Pack with inputs, labels, masks

Hashing & Deduplication

Function Description
parallel_hash_rows() Parallel SHA256 hashing
deduplicate_indices() Find unique row indices

Padding

Function Description
pad_sequences() Pad to uniform length
create_padding_mask() Create attention masks

Type Annotations

from typing import List, Tuple, Optional, Union
import torch

# Token packing
def pack_sequences(
    sequences: Union[List[List[int]], List[torch.Tensor]],
    max_length: int,
    pad_token_id: int,
    return_boundaries: bool = False
) -> Union[torch.Tensor, Tuple[torch.Tensor, List]]: ...

# Hashing
def parallel_hash_rows(rows: List[bytes]) -> List[str]: ...
def deduplicate_indices(rows: List[bytes]) -> List[int]: ...

# Padding
def pad_sequences(
    sequences: List[List[int]],
    pad_value: int,
    padding_side: str = "right",
    max_length: Optional[int] = None
) -> List[List[int]]: ...

See Also