The largest comprehensive training data pipeline that minimizes copyright and contract breach risks for large language model development, containing over 132 million documents spanning trillions of tokens from verified legal sources.

Project Overview

The KL3M Data Project directly addresses the critical legal risks in AI training data by providing a copyright-clean alternative to conventional web-scraped datasets. This foundational dataset powers the KL3M family of legal language models.

Key Features

Copyright-Clean Approach

Verified Sources: All data from government documents and legally permissible sources
Risk Minimization: Eliminates copyright infringement concerns in model training
Transparent Licensing: Clear provenance for every document in the corpus

Scale and Coverage

132+ Million Documents: Comprehensive coverage of legal and regulatory text
Trillions of Tokens: Sufficient scale for modern LLM training
16 Verified Sources: Including US, EU, UK, and other government jurisdictions

Data Sources Include

PACER: Federal court documents from multiple districts
SEC EDGAR: 10-K reports, agreements, and financial filings
Government Websites: DOL, USDA, ED, and other federal agencies
GovInfo: Official government publications and reports

Technical Implementation

Access Methods

Hugging Face Datasets: Direct integration with ML pipelines
S3 Bucket Access: Bulk download for large-scale processing
Project Gallery: Browse and explore individual sources

Processing Pipeline

Multi-stage extraction, transformation, and loading
Supports both original formats and pre-tokenized representations
Uses KL3M-004-128k-cased tokenizer optimized for legal text

Research Publication

“The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models”

Authors: Michael J. Bommarito II, Jillian Bommarito, Daniel Martin Katz
Published: 2025
Available on SSRN

Impact

This project represents a paradigm shift in responsible AI development:

Enables legal practitioners to use AI without copyright concerns
Sets new standards for ethical data collection in AI
Provides foundation for specialized legal AI applications

Open Source Commitment

Released under MIT License to promote:

Transparency in AI training data
Reproducible legal AI research
Industry-wide adoption of ethical data practices

The KL3M Data Project demonstrates that large-scale AI training can be both powerful and legally compliant, setting a new standard for responsible AI development in the legal domain.