on this page

KL3M Dataset

dataset

Copyright-clean training resources for large language models

period: 2025-present
tech:
Legal Informatics
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

The KL3M Dataset is the largest comprehensive copyright-clean training data resource for large language models, containing over 132 million documents and trillions of tokens from 16 verified sources.

Scale and Scope

  • 28TB compressed storage (as of April 2025)
  • 1.35 trillion tokens across 16 sources
  • 132+ million documents with clear licensing

Data Sources

All data is sourced from:

  • Government documents exempt from copyright
  • Public domain works
  • Content with explicit open licenses
  • Materials with verified consent

Access and Formats

The dataset is freely available through:

  • Hugging Face Hub - Easy integration with ML pipelines
  • S3 buckets - Direct bulk downloads
  • GitHub - Documentation and processing code

All data is provided in efficient Parquet format with comprehensive metadata.

Impact

KL3M addresses the critical need for legally sound training data in the AI community, enabling:

  • Risk-free commercial model development
  • Academic research without copyright concerns
  • Transparent and reproducible AI training
on this page