on this page

KL3M Data Project

dataset

Large-scale copyright-clean dataset containing 132M+ documents and trillions of tokens for training legal language models

period: 2024-present
team: ALEA Institute, 273 Ventures
tech:
Legal Informatics
══════════════════════════════════════════════════════════════════

The largest comprehensive training data pipeline that minimizes copyright and contract breach risks for large language model development, containing over 132 million documents spanning trillions of tokens from verified legal sources.

Project Overview

The KL3M Data Project directly addresses the critical legal risks in AI training data by providing a copyright-clean alternative to conventional web-scraped datasets. This foundational dataset powers the KL3M family of legal language models.

Key Features

  • Verified Sources: All data from government documents and legally permissible sources
  • Risk Minimization: Eliminates copyright infringement concerns in model training
  • Transparent Licensing: Clear provenance for every document in the corpus

Scale and Coverage

  • 132+ Million Documents: Comprehensive coverage of legal and regulatory text
  • Trillions of Tokens: Sufficient scale for modern LLM training
  • 16 Verified Sources: Including US, EU, UK, and other government jurisdictions

Data Sources Include

  • PACER: Federal court documents from multiple districts
  • SEC EDGAR: 10-K reports, agreements, and financial filings
  • Government Websites: DOL, USDA, ED, and other federal agencies
  • GovInfo: Official government publications and reports

Technical Implementation

Access Methods

  1. Hugging Face Datasets: Direct integration with ML pipelines
  2. S3 Bucket Access: Bulk download for large-scale processing
  3. Project Gallery: Browse and explore individual sources

Processing Pipeline

  • Multi-stage extraction, transformation, and loading
  • Supports both original formats and pre-tokenized representations
  • Uses KL3M-004-128k-cased tokenizer optimized for legal text

Research Publication

β€œThe KL3M Data Project: Copyright-Clean Training Resources for Large Language Models”

  • Authors: Michael J. Bommarito II, Jillian Bommarito, Daniel Martin Katz
  • Published: 2025
  • Available on SSRN

Impact

This project represents a paradigm shift in responsible AI development:

  • Enables legal practitioners to use AI without copyright concerns
  • Sets new standards for ethical data collection in AI
  • Provides foundation for specialized legal AI applications

Open Source Commitment

Released under MIT License to promote:

  • Transparency in AI training data
  • Reproducible legal AI research
  • Industry-wide adoption of ethical data practices

The KL3M Data Project demonstrates that large-scale AI training can be both powerful and legally compliant, setting a new standard for responsible AI development in the legal domain.

on this page