ALEA Preprocess
softwareHigh-performance data preprocessing library for large language model training, supporting pretraining, SFT, and DPO datasets with Rust-powered efficiency
══════════════════════════════════════════════════════════════════
An efficient, accessible data preprocessing library designed to handle the demanding requirements of large language model training pipelines, including support for KL3M and other pretraining datasets.
Overview
ALEA Preprocess addresses the critical need for high-performance data preprocessing in LLM development. By leveraging compiled Rust code with Python bindings, it provides the speed necessary for processing trillion-token datasets while maintaining ease of use.
Core Capabilities
Preprocessing Modes
- Pretraining: Large-scale text corpus preparation
- Supervised Fine-Tuning (SFT): Instruction dataset formatting
- Direct Preference Optimization (DPO): Preference pair preparation
- Custom Pipelines: Extensible architecture for new formats
Performance Features
- Rust Backend: Compiled performance for heavy processing
- Python Interface: Familiar API for ML practitioners
- Streaming Processing: Memory-efficient for large datasets
- Parallel Execution: Multi-core utilization
Technical Architecture
Language Stack
- Core Engine: Rust for performance-critical operations
- Python Bindings: maturin for seamless integration
- API Design: Pythonic interfaces hiding complexity
Processing Pipeline
- Input Handling: Multiple format support
- Tokenization: Integration with various tokenizers
- Filtering: Quality and safety checks
- Transformation: Format conversions
- Output Generation: Optimized data structures
Installation
From PyPI
pip install alea-preprocess
Development Setup
# Clone repository
git clone https://github.com/alea-institute/alea-preprocess
# Install with Poetry and maturin
poetry install
poetry run maturin develop
Key Features
Data Quality
- Deduplication algorithms
- Language detection
- Content filtering
- Format validation
Efficiency Optimizations
- Zero-copy operations where possible
- Lazy evaluation strategies
- Batched processing
- Cache-friendly algorithms
Format Support
- Raw text documents
- JSON/JSONL formats
- Parquet files
- Custom format plugins
Integration with KL3M
Specifically designed to support:
- KL3M data pipeline requirements
- Copyright-clean data processing
- Legal document handling
- Domain-specific preprocessing
Use Cases
Pretraining Preparation
- Web crawl processing
- Document deduplication
- Quality filtering
- Format standardization
Fine-Tuning Datasets
- Instruction formatting
- Response pairing
- Quality scoring
- Balance checking
Preference Learning
- Preference pair creation
- Ranking data preparation
- Reward model training data
- Human feedback processing
Example Usage
from alea_preprocess import PreprocessPipeline
# Initialize pipeline
pipeline = PreprocessPipeline(
mode="pretrain",
tokenizer="kl3m-004-128k-cased",
quality_threshold=0.8
)
# Process dataset
processed = pipeline.process_dataset(
input_path="raw_data/",
output_path="processed/",
num_workers=8
)
Development Status
Currently in active development:
- Core functionality implemented
- Performance optimizations ongoing
- Additional format support planned
- Documentation expansion in progress
Testing
Comprehensive test suite available:
- Unit tests for core functions
- Integration tests with real data
- Performance benchmarks
- Example notebooks in
tests/
Community
Open-source development approach:
- Issue tracking on GitHub
- Contribution guidelines
- Regular updates
- Community feedback integration
Future Roadmap
Planned enhancements:
- Additional preprocessing modes
- GPU acceleration options
- Distributed processing support
- Enhanced filtering algorithms
ALEA Preprocess represents essential infrastructure for modern LLM development, providing the tools needed to efficiently transform raw data into training-ready datasets while maintaining quality and compliance standards.