fast zero-dependency sentence splitting in python with nupunkt

overview

nupunkt - sentence boundary detection for legal and general text

extends the Punkt algorithm (Kiss & Strunk) with 4,000+ abbreviations
trained on the KL3M legal corpus, but effective across domains — used in biomedical, financial, and general NLP pipelines
91.1% precision on legal benchmarks, significantly above spaCy and NLTK
two implementations: pure python (nupunkt) and rust with python bindings (nupunkt-rs)
developed by the ALEA Institute (501(c)(3) nonprofit)
~7,000 monthly PyPI downloads

when to use

use nupunkt when:

you need accurate sentence splitting without heavy dependencies
precision matters more than recall for downstream retrieval or RAG pipelines
processing large volumes of text at scale (10M+ chars/sec)
working with text that contains abbreviations, citations, or other tricky punctuation (legal, biomedical, financial, academic)

don’t use when:

you need a full NLP pipeline (use spaCy instead)
recall is more important than precision for your use case

installation

# pure python (3.11+, zero dependencies)
uv pip install nupunkt

# rust implementation (drop-in replacement, ~3x faster, ~36x less memory)
uv pip install nupunkt-rs

basic usage

sentence tokenization

from nupunkt import sent_tokenize

text = """The court held in Smith v. Jones, 123 F.3d 456 (9th Cir. 2024),
that the statute applies. The defendant appealed."""

sentences = sent_tokenize(text)
# ['The court held in Smith v. Jones, 123 F.3d 456 (9th Cir. 2024),\nthat the statute applies.',
#  'The defendant appealed.']

adaptive tokenizer with confidence (v0.6.0+)

from nupunkt import sent_tokenize_adaptive

sentences = sent_tokenize_adaptive(
    text,
    threshold=0.7,
    return_confidence=True
)
# returns sentences with boundary confidence scores

nupunkt-rs (drop-in replacement)

import nupunkt_rs

sentences = nupunkt_rs.sent_tokenize(text)

# precision/recall tuning (0.0 = max recall, 1.0 = max precision)
sentences = nupunkt_rs.sent_tokenize(text, precision_recall=0.5)

# character spans instead of strings
tokenizer = nupunkt_rs.create_default_tokenizer()
spans = tokenizer.tokenize_spans(text)

performance

benchmarks on legal text

model	precision	f1	throughput	memory
nupunkt	0.911	0.725	10M chars/sec	432 MB
nupunkt-rs	0.911	0.725	30M chars/sec	12 MB
CharBoundary (Large)	0.763	0.782	518K chars/sec	5,734 MB
spaCy (small)	0.647	0.657	97K chars/sec	1,231 MB
NLTK Punkt	0.621	0.708	9M chars/sec	460 MB
pySBD	0.593	0.716	258K chars/sec	1,509 MB

nupunkt trades some recall (lower F1) for substantially higher precision. this is often the right tradeoff for retrieval systems where false sentence boundaries corrupt passage embeddings.

precision by document type

document type	precision
BVA decisions	0.987
Cybercrime cases	0.901
SCOTUS opinions	0.847

python vs rust

the rust implementation (nupunkt-rs) is a drop-in replacement with identical precision. the differences:

~3x throughput (30M vs 10M chars/sec)
~36x less memory (12 MB vs 432 MB)
supports precision_recall parameter for tuning the precision-recall tradeoff
provides tokenize_spans for character-level span output

advanced features

cli tools (v0.6.0+)

nupunkt ships cli utilities for training custom models, running evaluations, and optimizing thresholds:

# evaluate on your own corpus
nupunkt evaluate --input corpus.jsonl

# train a custom model
nupunkt train --input training_data.jsonl --output model.bin

cross-platform model loading

v0.6.0+ supports loading models across platforms, allowing models trained on one OS to be used on another.

interactive demo

try it in the browser: sentences.aleainstitute.ai

references

nupunkt on GitHub
nupunkt-rs on GitHub
nupunkt on PyPI
nupunkt-rs on PyPI
Bommarito, Katz, and Bommarito. “Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary.” arXiv:2504.04131, April 2025.
Kiss and Strunk. “Unsupervised Multilingual Sentence Boundary Detection.” Computational Linguistics, 2006.