KL3M Data Project
datasetLarge-scale copyright-clean dataset containing 132M+ documents and trillions of tokens for training legal language models
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The largest comprehensive training data pipeline that minimizes copyright and contract breach risks for large language model development, containing over 132 million documents spanning trillions of tokens from verified legal sources.
Project Overview
The KL3M Data Project directly addresses the critical legal risks in AI training data by providing a copyright-clean alternative to conventional web-scraped datasets. This foundational dataset powers the KL3M family of legal language models.
Key Features
Copyright-Clean Approach
- Verified Sources: All data from government documents and legally permissible sources
- Risk Minimization: Eliminates copyright infringement concerns in model training
- Transparent Licensing: Clear provenance for every document in the corpus
Scale and Coverage
- 132+ Million Documents: Comprehensive coverage of legal and regulatory text
- Trillions of Tokens: Sufficient scale for modern LLM training
- 16 Verified Sources: Including US, EU, UK, and other government jurisdictions
Data Sources Include
- PACER: Federal court documents from multiple districts
- SEC EDGAR: 10-K reports, agreements, and financial filings
- Government Websites: DOL, USDA, ED, and other federal agencies
- GovInfo: Official government publications and reports
Technical Implementation
Access Methods
- Hugging Face Datasets: Direct integration with ML pipelines
- S3 Bucket Access: Bulk download for large-scale processing
- Project Gallery: Browse and explore individual sources
Processing Pipeline
- Multi-stage extraction, transformation, and loading
- Supports both original formats and pre-tokenized representations
- Uses KL3M-004-128k-cased tokenizer optimized for legal text
Research Publication
βThe KL3M Data Project: Copyright-Clean Training Resources for Large Language Modelsβ
- Authors: Michael J. Bommarito II, Jillian Bommarito, Daniel Martin Katz
- Published: 2025
- Available on SSRN
Impact
This project represents a paradigm shift in responsible AI development:
- Enables legal practitioners to use AI without copyright concerns
- Sets new standards for ethical data collection in AI
- Provides foundation for specialized legal AI applications
Open Source Commitment
Released under MIT License to promote:
- Transparency in AI training data
- Reproducible legal AI research
- Industry-wide adoption of ethical data practices
The KL3M Data Project demonstrates that large-scale AI training can be both powerful and legally compliant, setting a new standard for responsible AI development in the legal domain.