on this page

ALEA Preprocess

software

High-performance data preprocessing library for large language model training, supporting pretraining, SFT, and DPO datasets with Rust-powered efficiency

period: 2024-present
team: ALEA Institute
tech:
Machine Learning
══════════════════════════════════════════════════════════════════

An efficient, accessible data preprocessing library designed to handle the demanding requirements of large language model training pipelines, including support for KL3M and other pretraining datasets.

Overview

ALEA Preprocess addresses the critical need for high-performance data preprocessing in LLM development. By leveraging compiled Rust code with Python bindings, it provides the speed necessary for processing trillion-token datasets while maintaining ease of use.

Core Capabilities

Preprocessing Modes

  • Pretraining: Large-scale text corpus preparation
  • Supervised Fine-Tuning (SFT): Instruction dataset formatting
  • Direct Preference Optimization (DPO): Preference pair preparation
  • Custom Pipelines: Extensible architecture for new formats

Performance Features

  • Rust Backend: Compiled performance for heavy processing
  • Python Interface: Familiar API for ML practitioners
  • Streaming Processing: Memory-efficient for large datasets
  • Parallel Execution: Multi-core utilization

Technical Architecture

Language Stack

  • Core Engine: Rust for performance-critical operations
  • Python Bindings: maturin for seamless integration
  • API Design: Pythonic interfaces hiding complexity

Processing Pipeline

  1. Input Handling: Multiple format support
  2. Tokenization: Integration with various tokenizers
  3. Filtering: Quality and safety checks
  4. Transformation: Format conversions
  5. Output Generation: Optimized data structures

Installation

From PyPI

pip install alea-preprocess

Development Setup

# Clone repository
git clone https://github.com/alea-institute/alea-preprocess

# Install with Poetry and maturin
poetry install
poetry run maturin develop

Key Features

Data Quality

  • Deduplication algorithms
  • Language detection
  • Content filtering
  • Format validation

Efficiency Optimizations

  • Zero-copy operations where possible
  • Lazy evaluation strategies
  • Batched processing
  • Cache-friendly algorithms

Format Support

  • Raw text documents
  • JSON/JSONL formats
  • Parquet files
  • Custom format plugins

Integration with KL3M

Specifically designed to support:

  • KL3M data pipeline requirements
  • Copyright-clean data processing
  • Legal document handling
  • Domain-specific preprocessing

Use Cases

Pretraining Preparation

  • Web crawl processing
  • Document deduplication
  • Quality filtering
  • Format standardization

Fine-Tuning Datasets

  • Instruction formatting
  • Response pairing
  • Quality scoring
  • Balance checking

Preference Learning

  • Preference pair creation
  • Ranking data preparation
  • Reward model training data
  • Human feedback processing

Example Usage

from alea_preprocess import PreprocessPipeline

# Initialize pipeline
pipeline = PreprocessPipeline(
    mode="pretrain",
    tokenizer="kl3m-004-128k-cased",
    quality_threshold=0.8
)

# Process dataset
processed = pipeline.process_dataset(
    input_path="raw_data/",
    output_path="processed/",
    num_workers=8
)

Development Status

Currently in active development:

  • Core functionality implemented
  • Performance optimizations ongoing
  • Additional format support planned
  • Documentation expansion in progress

Testing

Comprehensive test suite available:

  • Unit tests for core functions
  • Integration tests with real data
  • Performance benchmarks
  • Example notebooks in tests/

Community

Open-source development approach:

  • Issue tracking on GitHub
  • Contribution guidelines
  • Regular updates
  • Community feedback integration

Future Roadmap

Planned enhancements:

  • Additional preprocessing modes
  • GPU acceleration options
  • Distributed processing support
  • Enhanced filtering algorithms

ALEA Preprocess represents essential infrastructure for modern LLM development, providing the tools needed to efficiently transform raw data into training-ready datasets while maintaining quality and compliance standards.

on this page