Parallel Hashing & Deduplication¶
fast-axolotl provides multi-threaded SHA256 hashing for efficient dataset deduplication, achieving 1.9x speedup over Python's hashlib.
Why Parallel Hashing?¶
Dataset deduplication is crucial for LLM training:
- Removes exact duplicates that can cause overfitting
- Reduces training time and costs
- Improves model generalization
The bottleneck is usually hashing millions of rows - fast-axolotl parallelizes this across all CPU cores.
Basic Usage¶
Computing Hashes¶
from fast_axolotl import parallel_hash_rows
# Your data rows (as bytes)
rows = [
b"This is the first document",
b"This is the second document",
b"This is the first document", # duplicate
b"This is the third document",
]
# Compute SHA256 hashes in parallel
hashes = parallel_hash_rows(rows)
print(hashes)
# ['a1b2c3...', 'd4e5f6...', 'a1b2c3...', 'g7h8i9...']
Finding Unique Indices¶
from fast_axolotl import deduplicate_indices
rows = [
b"row1",
b"row2",
b"row1", # duplicate of index 0
b"row3",
b"row2", # duplicate of index 1
]
# Get indices of unique rows
unique_idx = deduplicate_indices(rows)
print(unique_idx)
# [0, 1, 3] - first occurrence of each unique row
Working with Datasets¶
Deduplicating a HuggingFace Dataset¶
from datasets import load_dataset
from fast_axolotl import deduplicate_indices
# Load dataset
dataset = load_dataset("your_dataset")
# Convert rows to bytes for hashing
rows = [str(row).encode() for row in dataset["train"]]
# Find unique indices
unique_idx = deduplicate_indices(rows)
# Filter dataset
deduped_dataset = dataset["train"].select(unique_idx)
print(f"Original: {len(dataset['train'])}, Deduped: {len(deduped_dataset)}")
Deduplicating by Specific Columns¶
from fast_axolotl import deduplicate_indices
import json
def deduplicate_by_columns(dataset, columns):
"""Deduplicate based on specific columns only."""
# Create hash keys from selected columns
rows = []
for item in dataset:
key = json.dumps({col: item[col] for col in columns}).encode()
rows.append(key)
unique_idx = deduplicate_indices(rows)
return dataset.select(unique_idx)
# Deduplicate by 'text' column only
deduped = deduplicate_by_columns(dataset, ["text"])
Streaming Deduplication¶
For very large datasets that don't fit in memory:
from fast_axolotl import streaming_dataset_reader, parallel_hash_rows
def streaming_deduplicate(data_path, output_path):
seen_hashes = set()
unique_rows = []
for batch in streaming_dataset_reader(data_path, batch_size=10000):
# Convert batch to bytes
rows = [str(row).encode() for row in batch["text"]]
# Hash the batch
hashes = parallel_hash_rows(rows)
# Keep only new unique rows
for i, h in enumerate(hashes):
if h not in seen_hashes:
seen_hashes.add(h)
unique_rows.append(batch[i])
return unique_rows
Advanced Usage¶
Custom Hash Function¶
While fast-axolotl uses SHA256 by default, you can preprocess data for different deduplication strategies:
from fast_axolotl import deduplicate_indices
def normalize_and_dedupe(texts):
"""Deduplicate with normalization."""
# Normalize: lowercase, strip whitespace
normalized = [
text.lower().strip().encode()
for text in texts
]
return deduplicate_indices(normalized)
Fuzzy Deduplication Preparation¶
For near-duplicate detection, use hashing as a first pass:
from fast_axolotl import parallel_hash_rows
def find_candidate_duplicates(texts, n_shingles=3):
"""Find candidate near-duplicates using shingling."""
all_shingles = []
for text in texts:
# Create character n-grams
words = text.split()
shingles = [
" ".join(words[i:i+n_shingles]).encode()
for i in range(len(words) - n_shingles + 1)
]
all_shingles.append(shingles)
# Hash all shingles in parallel
flat_shingles = [s for shingles in all_shingles for s in shingles]
hashes = parallel_hash_rows(flat_shingles)
# Group by common shingles for further analysis
# ...
Performance Benchmarks¶
| Dataset Size | Python hashlib | fast-axolotl | Speedup |
|---|---|---|---|
| 10,000 rows | 0.5s | 0.3s | 1.7x |
| 100,000 rows | 5.2s | 2.7s | 1.9x |
| 1,000,000 rows | 52s | 27s | 1.9x |
Thread Scaling¶
fast-axolotl automatically uses all available CPU cores:
| Cores | Speedup vs Single Thread |
|---|---|
| 4 | 3.2x |
| 8 | 5.8x |
| 16 | 9.1x |
| 32 | 14.2x |
Integration with Axolotl¶
When shimming is enabled, Axolotl's deduplication automatically uses fast-axolotl:
import fast_axolotl
fast_axolotl.install()
# Axolotl's dedupe now uses Rust-accelerated hashing
from axolotl.utils.data import deduplicate_dataset
Memory Considerations¶
Hashes¶
Each SHA256 hash is 64 characters (hex string). For 1M rows: - Memory for hashes: ~64MB
Indices¶
The deduplicate_indices function returns a list of integers:
- Memory for indices: ~8 bytes per unique row
Tips for Large Datasets¶
# Process in chunks to limit memory
def chunked_dedupe(rows, chunk_size=100000):
all_unique = []
seen_hashes = set()
for i in range(0, len(rows), chunk_size):
chunk = rows[i:i+chunk_size]
hashes = parallel_hash_rows(chunk)
for j, h in enumerate(hashes):
if h not in seen_hashes:
seen_hashes.add(h)
all_unique.append(i + j)
return all_unique
Error Handling¶
from fast_axolotl import parallel_hash_rows, deduplicate_indices
# Empty input handling
hashes = parallel_hash_rows([]) # Returns []
indices = deduplicate_indices([]) # Returns []
# Invalid input
try:
hashes = parallel_hash_rows(["string", "not", "bytes"])
except TypeError as e:
print(f"Error: {e}") # Rows must be bytes
Next Steps¶
- Streaming Data - Combine with streaming for large datasets
- API Reference - Complete API docs
- Benchmarks - Detailed performance data