on this page

CharBoundary

software

Character-based text boundary detection for legal documents using Random Forest classifiers, achieving balanced precision-recall with F1 score of 0.782

period: 2025-present
team: ALEA Institute
tech:
Natural Language ProcessingLegal Informatics
══════════════════════════════════════════════════════════════════

A high-performance character-based text boundary detection library optimized for legal text segmentation, providing balanced precision-recall tradeoffs for sentence and paragraph detection in large-scale legal applications.

Technical Approach

CharBoundary frames sentence boundary detection as a binary classification problem using Random Forest classifiers that analyze character-level contextual features combined with domain-specific legal knowledge.

Publication

β€œPrecise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary”

  • Authors: Michael J. Bommarito II, Daniel Martin Katz, Jillian Bommarito
  • Published: 2025
  • Available on arXiv and SSRN

Model Variants

Three pre-trained models optimized for different use cases:

Small Model

  • 5 token context window
  • 32 decision trees
  • ~85,000 characters/second

Medium Model (Default)

  • 7 token context window
  • 64 decision trees
  • ~280,000 characters/second

Large Model

  • 9 token context window
  • 256 decision trees
  • ~175,000 characters/second

Key Features

  • Abbreviation Database: Over 4,000 legal abbreviations
  • Citation Handling: Proper segmentation of complex legal citations
  • Hierarchy Recognition: Document structure awareness
  • Quote Processing: Multi-line quotation support

Technical Capabilities

  • Character-level feature extraction
  • Customizable segmentation thresholds
  • Character span extraction
  • Secure model serialization with skops
  • Optional ONNX acceleration (2.1x speedup)

Performance Metrics

  • F1 Score: 0.782 on legal text
  • Balanced Trade-off: Optimized precision-recall balance
  • Throughput: Up to 280,000 chars/second
  • Accuracy: Superior performance on ALEA SBD dataset

Implementation

Installation

pip install charboundary

Basic Usage

from charboundary import get_default_segmenter

segmenter = get_default_segmenter()
sentences = segmenter.segment_to_sentences(text)
paragraphs = segmenter.segment_to_paragraphs(text)

Applications

CharBoundary excels in:

  • E-Discovery: Document preprocessing at scale
  • Due Diligence: Contract analysis pipelines
  • Legal Research: Text chunking for retrieval
  • Compliance: Regulatory document processing

Feature Engineering

The Random Forest classifier considers:

  • Character type transitions
  • Legal abbreviation markers
  • Citation structure patterns
  • Document hierarchy signals
  • Punctuation context
  • Whitespace patterns

Comparison with NUPunkt

While NUPunkt optimizes for raw speed and precision, CharBoundary provides:

  • Better recall for complex boundaries
  • Flexibility through threshold adjustment
  • Machine learning adaptability
  • Balanced F1 performance

Open Source Commitment

Released under MIT License with:

CharBoundary represents a machine learning approach to legal text segmentation, complementing rule-based methods with adaptive, data-driven boundary detection.

on this page