fast zero-dependency sentence splitting in python with nupunkt
on this page
overview
nupunkt - sentence boundary detection for legal and general text
- extends the Punkt algorithm (Kiss & Strunk) with 4,000+ abbreviations
- trained on the KL3M legal corpus, but effective across domains — used in biomedical, financial, and general NLP pipelines
- 91.1% precision on legal benchmarks, significantly above spaCy and NLTK
- two implementations: pure python (
nupunkt) and rust with python bindings (nupunkt-rs) - developed by the ALEA Institute (501(c)(3) nonprofit)
- ~7,000 monthly PyPI downloads
when to use
use nupunkt when:
- you need accurate sentence splitting without heavy dependencies
- precision matters more than recall for downstream retrieval or RAG pipelines
- processing large volumes of text at scale (10M+ chars/sec)
- working with text that contains abbreviations, citations, or other tricky punctuation (legal, biomedical, financial, academic)
don’t use when:
- you need a full NLP pipeline (use spaCy instead)
- recall is more important than precision for your use case
installation
# pure python (3.11+, zero dependencies)
uv pip install nupunkt
# rust implementation (drop-in replacement, ~3x faster, ~36x less memory)
uv pip install nupunkt-rs basic usage
sentence tokenization
from nupunkt import sent_tokenize
text = """The court held in Smith v. Jones, 123 F.3d 456 (9th Cir. 2024),
that the statute applies. The defendant appealed."""
sentences = sent_tokenize(text)
# ['The court held in Smith v. Jones, 123 F.3d 456 (9th Cir. 2024),\nthat the statute applies.',
# 'The defendant appealed.'] adaptive tokenizer with confidence (v0.6.0+)
from nupunkt import sent_tokenize_adaptive
sentences = sent_tokenize_adaptive(
text,
threshold=0.7,
return_confidence=True
)
# returns sentences with boundary confidence scores nupunkt-rs (drop-in replacement)
import nupunkt_rs
sentences = nupunkt_rs.sent_tokenize(text)
# precision/recall tuning (0.0 = max recall, 1.0 = max precision)
sentences = nupunkt_rs.sent_tokenize(text, precision_recall=0.5)
# character spans instead of strings
tokenizer = nupunkt_rs.create_default_tokenizer()
spans = tokenizer.tokenize_spans(text) performance
benchmarks on legal text
| model | precision | f1 | throughput | memory |
|---|---|---|---|---|
| nupunkt | 0.911 | 0.725 | 10M chars/sec | 432 MB |
| nupunkt-rs | 0.911 | 0.725 | 30M chars/sec | 12 MB |
| CharBoundary (Large) | 0.763 | 0.782 | 518K chars/sec | 5,734 MB |
| spaCy (small) | 0.647 | 0.657 | 97K chars/sec | 1,231 MB |
| NLTK Punkt | 0.621 | 0.708 | 9M chars/sec | 460 MB |
| pySBD | 0.593 | 0.716 | 258K chars/sec | 1,509 MB |
nupunkt trades some recall (lower F1) for substantially higher precision. this is often the right tradeoff for retrieval systems where false sentence boundaries corrupt passage embeddings.
precision by document type
| document type | precision |
|---|---|
| BVA decisions | 0.987 |
| Cybercrime cases | 0.901 |
| SCOTUS opinions | 0.847 |
python vs rust
the rust implementation (nupunkt-rs) is a drop-in replacement with identical precision. the differences:
- ~3x throughput (30M vs 10M chars/sec)
- ~36x less memory (12 MB vs 432 MB)
- supports
precision_recallparameter for tuning the precision-recall tradeoff - provides
tokenize_spansfor character-level span output
advanced features
cli tools (v0.6.0+)
nupunkt ships cli utilities for training custom models, running evaluations, and optimizing thresholds:
# evaluate on your own corpus
nupunkt evaluate --input corpus.jsonl
# train a custom model
nupunkt train --input training_data.jsonl --output model.bin cross-platform model loading
v0.6.0+ supports loading models across platforms, allowing models trained on one OS to be used on another.
interactive demo
try it in the browser: sentences.aleainstitute.ai
references
- nupunkt on GitHub
- nupunkt-rs on GitHub
- nupunkt on PyPI
- nupunkt-rs on PyPI
- Bommarito, Katz, and Bommarito. “Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary.” arXiv:2504.04131, April 2025.
- Kiss and Strunk. “Unsupervised Multilingual Sentence Boundary Detection.” Computational Linguistics, 2006.