CharBoundary | mike bommarito

A high-performance character-based text boundary detection library optimized for legal text segmentation, providing balanced precision-recall tradeoffs for sentence and paragraph detection in large-scale legal applications.

Technical Approach

CharBoundary frames sentence boundary detection as a binary classification problem using Random Forest classifiers that analyze character-level contextual features combined with domain-specific legal knowledge.

Publication

“Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary”

Authors: Michael J. Bommarito II, Daniel Martin Katz, Jillian Bommarito
Published: 2025
Available on arXiv and SSRN

Model Variants

Three pre-trained models optimized for different use cases:

Small Model

5 token context window
32 decision trees
~85,000 characters/second

Medium Model (Default)

7 token context window
64 decision trees
~280,000 characters/second

Large Model

9 token context window
256 decision trees
~175,000 characters/second

Key Features

Legal-Specific Adaptations

Abbreviation Database: Over 4,000 legal abbreviations
Citation Handling: Proper segmentation of complex legal citations
Hierarchy Recognition: Document structure awareness
Quote Processing: Multi-line quotation support

Technical Capabilities

Character-level feature extraction
Customizable segmentation thresholds
Character span extraction
Secure model serialization with skops
Optional ONNX acceleration (2.1x speedup)

Performance Metrics

F1 Score: 0.782 on legal text
Balanced Trade-off: Optimized precision-recall balance
Throughput: Up to 280,000 chars/second
Accuracy: Superior performance on ALEA SBD dataset

Implementation

Installation

pip install charboundary

Basic Usage

from charboundary import get_default_segmenter

segmenter = get_default_segmenter()
sentences = segmenter.segment_to_sentences(text)
paragraphs = segmenter.segment_to_paragraphs(text)

Applications

CharBoundary excels in:

E-Discovery: Document preprocessing at scale
Due Diligence: Contract analysis pipelines
Legal Research: Text chunking for retrieval
Compliance: Regulatory document processing

Feature Engineering

The Random Forest classifier considers:

Character type transitions
Legal abbreviation markers
Citation structure patterns
Document hierarchy signals
Punctuation context
Whitespace patterns

Comparison with NUPunkt

While NUPunkt optimizes for raw speed and precision, CharBoundary provides:

Better recall for complex boundaries
Flexibility through threshold adjustment
Machine learning adaptability
Balanced F1 performance

Open Source Commitment

Released under MIT License with:

Full source code availability
Pre-trained models included
Extensive documentation
Interactive demo at sentences.aleainstitute.ai

CharBoundary represents a machine learning approach to legal text segmentation, complementing rule-based methods with adaptive, data-driven boundary detection.