KL3M Tokenizers | mike bommarito

A family of specialized tokenizers optimized for legal, financial, and governmental text processing, demonstrating that tokenization remains critical for domain-specific optimization in professional AI applications.

Research Impact

KL3M tokenizers achieve significant efficiency improvements over state-of-the-art models:

83% improvement for legal terminology (4.20 vs 7.70 tokens per term)
9-17% reduction in token utilization compared to GPT-4o and LLaMA3
Maintains superior performance despite smaller vocabulary sizes

Publication

“KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications”

Authors: Michael J. Bommarito, Daniel Martin Katz, Jillian Bommarito
Published: 2025
Available on arXiv

Tokenizer Family

Domain-Specific BPE Tokenizers

kl3m-003-64k: 64k vocabulary general-purpose tokenizer
kl3m-004-128k-cased: Case-sensitive 128k vocabulary for precise legal/financial text
kl3m-004-128k-uncased: Case-insensitive variant for flexible applications

Character-Level BPE Tokenizers

kl3m-004-char-4k: 4k vocabulary for fine-grained control
kl3m-004-char-8k: 8k vocabulary balanced approach
kl3m-004-char-16k: 16k vocabulary for broader coverage

Key Features

Specialized Token Support

Legal citations and references
Financial abbreviations and symbols
JSON/HTML parsing tokens
Government document structures

Clean Training Data

Trained exclusively on copyright-free sources
No licensing restrictions on usage
Full transparency in data provenance

Professional Optimization

Minimizes toxic language representation
Optimized for formal document processing
Maintains critical case sensitivity for legal accuracy

Technical Advantages

The tokenizers address specific challenges in professional domains:

Legal Precision: Preserves exact terminology and citations
Financial Accuracy: Handles numerical formats and abbreviations
Efficiency: Reduces computational costs through better tokenization
Compatibility: Works seamlessly with existing transformer architectures

Availability

All tokenizers are freely available:

Hugging Face Hub: Under alea-institute/ organization
GitHub: Complete source code and training scripts
Licensing: MIT and CC-BY 4.0 for maximum accessibility

Impact on Professional AI

This work establishes that domain-specific tokenization can significantly improve:

Model efficiency and cost reduction
Accuracy in specialized terminology
Processing speed for professional documents
Overall quality of legal and financial AI applications

The KL3M tokenizers demonstrate that foundational NLP components like tokenization remain critical optimization points for specialized AI deployments.