fast zero-dependency sentence splitting in python with nupunkt

overview

nupunkt - sentence boundary detection for legal and general text

  • extends the Punkt algorithm (Kiss & Strunk) with 4,000+ abbreviations
  • trained on the KL3M legal corpus, but effective across domains — used in biomedical, financial, and general NLP pipelines
  • 91.1% precision on legal benchmarks, significantly above spaCy and NLTK
  • two implementations: pure python (nupunkt) and rust with python bindings (nupunkt-rs)
  • developed by the ALEA Institute (501(c)(3) nonprofit)
  • ~7,000 monthly PyPI downloads

when to use

use nupunkt when:

  • you need accurate sentence splitting without heavy dependencies
  • precision matters more than recall for downstream retrieval or RAG pipelines
  • processing large volumes of text at scale (10M+ chars/sec)
  • working with text that contains abbreviations, citations, or other tricky punctuation (legal, biomedical, financial, academic)

don’t use when:

  • you need a full NLP pipeline (use spaCy instead)
  • recall is more important than precision for your use case

installation

# pure python (3.11+, zero dependencies)
uv pip install nupunkt

# rust implementation (drop-in replacement, ~3x faster, ~36x less memory)
uv pip install nupunkt-rs

basic usage

sentence tokenization

from nupunkt import sent_tokenize

text = """The court held in Smith v. Jones, 123 F.3d 456 (9th Cir. 2024),
that the statute applies. The defendant appealed."""

sentences = sent_tokenize(text)
# ['The court held in Smith v. Jones, 123 F.3d 456 (9th Cir. 2024),\nthat the statute applies.',
#  'The defendant appealed.']

adaptive tokenizer with confidence (v0.6.0+)

from nupunkt import sent_tokenize_adaptive

sentences = sent_tokenize_adaptive(
    text,
    threshold=0.7,
    return_confidence=True
)
# returns sentences with boundary confidence scores

nupunkt-rs (drop-in replacement)

import nupunkt_rs

sentences = nupunkt_rs.sent_tokenize(text)

# precision/recall tuning (0.0 = max recall, 1.0 = max precision)
sentences = nupunkt_rs.sent_tokenize(text, precision_recall=0.5)

# character spans instead of strings
tokenizer = nupunkt_rs.create_default_tokenizer()
spans = tokenizer.tokenize_spans(text)

performance

modelprecisionf1throughputmemory
nupunkt0.9110.72510M chars/sec432 MB
nupunkt-rs0.9110.72530M chars/sec12 MB
CharBoundary (Large)0.7630.782518K chars/sec5,734 MB
spaCy (small)0.6470.65797K chars/sec1,231 MB
NLTK Punkt0.6210.7089M chars/sec460 MB
pySBD0.5930.716258K chars/sec1,509 MB

nupunkt trades some recall (lower F1) for substantially higher precision. this is often the right tradeoff for retrieval systems where false sentence boundaries corrupt passage embeddings.

precision by document type

document typeprecision
BVA decisions0.987
Cybercrime cases0.901
SCOTUS opinions0.847

python vs rust

the rust implementation (nupunkt-rs) is a drop-in replacement with identical precision. the differences:

  • ~3x throughput (30M vs 10M chars/sec)
  • ~36x less memory (12 MB vs 432 MB)
  • supports precision_recall parameter for tuning the precision-recall tradeoff
  • provides tokenize_spans for character-level span output

advanced features

cli tools (v0.6.0+)

nupunkt ships cli utilities for training custom models, running evaluations, and optimizing thresholds:

# evaluate on your own corpus
nupunkt evaluate --input corpus.jsonl

# train a custom model
nupunkt train --input training_data.jsonl --output model.bin

cross-platform model loading

v0.6.0+ supports loading models across platforms, allowing models trained on one OS to be used on another.

interactive demo

try it in the browser: sentences.aleainstitute.ai

references

related pages

on this page