projects
various software, data, or models i've built in an individual or affiliated capacity
software
NUPunkt
softwareHigh-precision sentence boundary detection for legal text
KL3M Tokenizers
softwareDomain-specific BPE tokenizers for legal, financial, and governmental text, achieving 9-17% efficiency improvements over GPT-4 and LLaMA3
LexNLP
softwareNatural language processing and information extraction for legal and regulatory text
OpenEDGAR
softwareOpen source Python client for SEC EDGAR data access and analysis
CharBoundary
softwareCharacter-based text boundary detection for legal documents using Random Forest classifiers, achieving balanced precision-recall with F1 score of 0.782
USBills.ai
softwareOpen-source platform using AI and NLP to make US federal legislation accessible through plain language summaries, ELI5 explanations, and readability metrics
ALEA Preprocess
softwareHigh-performance data preprocessing library for large language model training, supporting pretraining, SFT, and DPO datasets with Rust-powered efficiency
FOLIO Data Generator
softwarePython library for generating synthetic legal data using the FOLIO knowledge graph, supporting both procedural templates and LLM-based generation
FOLIO API
softwarePublic RESTful API for the Federated Open Legal Information Ontology, providing programmatic access to 18,000+ legal concepts with multiple output formats
FOLIO Python Client
softwarePython library for interacting with the Federated Open Legal Information Ontology, providing search, exploration, and format conversion capabilities
leeky
softwareTraining data contamination detection library for black-box language models, implementing six testing methods to identify potential data leakage
rfcorr
softwarePython library for Random Forest-based correlation measures, providing alternative approaches to traditional correlation analysis using tree-based ensemble methods
pyghcn
softwarePython 3 library for accessing and analyzing NOAA Global Historical Climatology Network (GHCN) weather and climate data
amos3
softwarePython 3 client for the Archive of Many Outdoor Scenes (AMOS), enabling access to billions of outdoor webcam images for computer vision and environmental research
Course materials for "Computer Modeling of Complex Systems" at University of Michigan, teaching agent-based modeling and computational approaches to complex systems
First iteration of "Computer Modeling of Complex Systems" course at University of Michigan, establishing foundation for computational complex systems education
Well-Settled Research
softwareA computational legal research project analyzing Supreme Court decisions and legal precedents using natural language processing
datasets
KL3M Dataset
datasetCopyright-clean training resources for large language models
KL3M Data Project
datasetLarge-scale copyright-clean dataset containing 132M+ documents and trillions of tokens for training legal language models
Law on the Market
datasetA comprehensive 15-year study examining how Supreme Court decisions impact stock market returns, finding significant abnormal returns in 37% of cases
Open-source legal data standard containing 18,000+ standardized legal concepts with multilingual support for improved legal industry interoperability
Federal Bill Statistics
datasetOriginal source code and data infrastructure that powered the initial version of usbills.ai platform
Large-scale empirical analysis of regulatory complexity using 165,000+ SEC filings to map the evolution of the U.S. regulatory landscape
The Race to the Bund
datasetAn innovative analysis of European financial integration using eigendecomposition of sovereign bond yield correlations from 1872 to 2010
U.S. Code Complexity
datasetComputational analysis measuring the complexity of the United States Code using mathematical and network science approaches
models
SCOTUS Predict
modelA machine learning model that predicts Supreme Court voting behavior with 70% accuracy, analyzing 60 years of decisions from 1953-2013
SCOTUS Predict v2
modelAn enhanced Supreme Court prediction model achieving 70.2% accuracy across 200 years of decisions (1816-2015), analyzing over 240,000 justice votes
Comprehensive research examining toxicity and bias in legal language models, demonstrating KL3M's superior safety profile through rigorous testing
Research presenting NUPunkt and CharBoundary libraries for high-precision sentence segmentation in legal text, achieving 29-32% improvement over general-purpose tools
KL3M Model Research
modelResearch and development repository for advancing the Kelvin Legal Large Language Model family with new architectures and training approaches
Research demonstrating GPT-4's ability to pass the Uniform Bar Examination, significantly outperforming both human test-takers and prior AI models
Research evaluating GPT models' capabilities on the Uniform CPA Examination, exploring AI's potential to transform knowledge work
Initial repository for research evaluating GPT models on the CPA exam, later developed into the comprehensive "GPT as Knowledge Worker" project
Groundbreaking research demonstrating GPT-3.5's performance on the Multistate Bar Examination, predicting AI's ability to pass professional legal licensing exams
FMLGen
modelA humorous AI project that generates absurd "F*** My Life" stories using modern language models, comparing current capabilities to 2013 n-gram approaches