CharBoundary
softwareCharacter-based text boundary detection for legal documents using Random Forest classifiers, achieving balanced precision-recall with F1 score of 0.782
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
A high-performance character-based text boundary detection library optimized for legal text segmentation, providing balanced precision-recall tradeoffs for sentence and paragraph detection in large-scale legal applications.
Technical Approach
CharBoundary frames sentence boundary detection as a binary classification problem using Random Forest classifiers that analyze character-level contextual features combined with domain-specific legal knowledge.
Publication
βPrecise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundaryβ
- Authors: Michael J. Bommarito II, Daniel Martin Katz, Jillian Bommarito
- Published: 2025
- Available on arXiv and SSRN
Model Variants
Three pre-trained models optimized for different use cases:
Small Model
- 5 token context window
- 32 decision trees
- ~85,000 characters/second
Medium Model (Default)
- 7 token context window
- 64 decision trees
- ~280,000 characters/second
Large Model
- 9 token context window
- 256 decision trees
- ~175,000 characters/second
Key Features
Legal-Specific Adaptations
- Abbreviation Database: Over 4,000 legal abbreviations
- Citation Handling: Proper segmentation of complex legal citations
- Hierarchy Recognition: Document structure awareness
- Quote Processing: Multi-line quotation support
Technical Capabilities
- Character-level feature extraction
- Customizable segmentation thresholds
- Character span extraction
- Secure model serialization with skops
- Optional ONNX acceleration (2.1x speedup)
Performance Metrics
- F1 Score: 0.782 on legal text
- Balanced Trade-off: Optimized precision-recall balance
- Throughput: Up to 280,000 chars/second
- Accuracy: Superior performance on ALEA SBD dataset
Implementation
Installation
pip install charboundary
Basic Usage
from charboundary import get_default_segmenter
segmenter = get_default_segmenter()
sentences = segmenter.segment_to_sentences(text)
paragraphs = segmenter.segment_to_paragraphs(text)
Applications
CharBoundary excels in:
- E-Discovery: Document preprocessing at scale
- Due Diligence: Contract analysis pipelines
- Legal Research: Text chunking for retrieval
- Compliance: Regulatory document processing
Feature Engineering
The Random Forest classifier considers:
- Character type transitions
- Legal abbreviation markers
- Citation structure patterns
- Document hierarchy signals
- Punctuation context
- Whitespace patterns
Comparison with NUPunkt
While NUPunkt optimizes for raw speed and precision, CharBoundary provides:
- Better recall for complex boundaries
- Flexibility through threshold adjustment
- Machine learning adaptability
- Balanced F1 performance
Open Source Commitment
Released under MIT License with:
- Full source code availability
- Pre-trained models included
- Extensive documentation
- Interactive demo at sentences.aleainstitute.ai
CharBoundary represents a machine learning approach to legal text segmentation, complementing rule-based methods with adaptive, data-driven boundary detection.