KL3M Dataset
datasetCopyright-clean training resources for large language models
period: 2025-present
tech:
Legal Informatics
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The KL3M Dataset is the largest comprehensive copyright-clean training data resource for large language models, containing over 132 million documents and trillions of tokens from 16 verified sources.
Scale and Scope
- 28TB compressed storage (as of April 2025)
- 1.35 trillion tokens across 16 sources
- 132+ million documents with clear licensing
Data Sources
All data is sourced from:
- Government documents exempt from copyright
- Public domain works
- Content with explicit open licenses
- Materials with verified consent
Access and Formats
The dataset is freely available through:
- Hugging Face Hub - Easy integration with ML pipelines
- S3 buckets - Direct bulk downloads
- GitHub - Documentation and processing code
All data is provided in efficient Parquet format with comprehensive metadata.
Impact
KL3M addresses the critical need for legally sound training data in the AI community, enabling:
- Risk-free commercial model development
- Academic research without copyright concerns
- Transparent and reproducible AI training
Related Publications
- The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
- Authors: Michael James Bommarito II, Daniel Martin Katz, Jillian Bommarito
- Published in 2025