Name: KL3M Dataset
Creator: Michael Bommarito
Published: 2025-04-10
Keywords: Legal Informatics

The KL3M Dataset is the largest comprehensive copyright-clean training data resource for large language models, containing over 132 million documents and trillions of tokens from 16 verified sources.

Scale and Scope

28TB compressed storage (as of April 2025)
1.35 trillion tokens across 16 sources
132+ million documents with clear licensing

Data Sources

All data is sourced from:

Government documents exempt from copyright
Public domain works
Content with explicit open licenses
Materials with verified consent

Access and Formats

The dataset is freely available through:

Hugging Face Hub - Easy integration with ML pipelines
S3 buckets - Direct bulk downloads
GitHub - Documentation and processing code

All data is provided in efficient Parquet format with comprehensive metadata.

Impact

KL3M addresses the critical need for legally sound training data in the AI community, enabling:

Risk-free commercial model development
Academic research without copyright concerns
Transparent and reproducible AI training

The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models
Authors: Michael James Bommarito II, Daniel Martin Katz, Jillian Bommarito
Published in 2025

Scale and Scope

Data Sources

Access and Formats

Impact

Related Publications