on this page

projects

various software, data, or models i've built in an individual or affiliated capacity

software

NUPunkt

software

High-precision sentence boundary detection for legal text

Natural Language Processing Legal Informatics

Domain-specific BPE tokenizers for legal, financial, and governmental text, achieving 9-17% efficiency improvements over GPT-4 and LLaMA3

Natural Language Processing

LexNLP

software

Natural language processing and information extraction for legal and regulatory text

Natural Language Processing Legal Informatics

OpenEDGAR

software

Open source Python client for SEC EDGAR data access and analysis

Financial Technology

CharBoundary

software

Character-based text boundary detection for legal documents using Random Forest classifiers, achieving balanced precision-recall with F1 score of 0.782

Natural Language Processing Legal Informatics

USBills.ai

software

Open-source platform using AI and NLP to make US federal legislation accessible through plain language summaries, ELI5 explanations, and readability metrics

Legal Informatics Natural Language Processing

High-performance data preprocessing library for large language model training, supporting pretraining, SFT, and DPO datasets with Rust-powered efficiency

Machine Learning

Python library for generating synthetic legal data using the FOLIO knowledge graph, supporting both procedural templates and LLM-based generation

Legal Informatics

FOLIO API

software

Public RESTful API for the Federated Open Legal Information Ontology, providing programmatic access to 18,000+ legal concepts with multiple output formats

Legal Informatics

Python library for interacting with the Federated Open Legal Information Ontology, providing search, exploration, and format conversion capabilities

Knowledge Representation Legal Informatics

leeky

software

Training data contamination detection library for black-box language models, implementing six testing methods to identify potential data leakage

AI Ethics Machine Learning

rfcorr

software

Python library for Random Forest-based correlation measures, providing alternative approaches to traditional correlation analysis using tree-based ensemble methods

Machine Learning

pyghcn

software

Python 3 library for accessing and analyzing NOAA Global Historical Climatology Network (GHCN) weather and climate data

Climate Science

amos3

software

Python 3 client for the Archive of Many Outdoor Scenes (AMOS), enabling access to billions of outdoor webcam images for computer vision and environmental research

Computer Vision

Course materials for "Computer Modeling of Complex Systems" at University of Michigan, teaching agent-based modeling and computational approaches to complex systems

Complex Systems Educational Technology

First iteration of "Computer Modeling of Complex Systems" course at University of Michigan, establishing foundation for computational complex systems education

Complex Systems Educational Technology

A computational legal research project analyzing Supreme Court decisions and legal precedents using natural language processing

Legal Analytics

datasets

Copyright-clean training resources for large language models

Legal Informatics

Large-scale copyright-clean dataset containing 132M+ documents and trillions of tokens for training legal language models

Legal Informatics

A comprehensive 15-year study examining how Supreme Court decisions impact stock market returns, finding significant abnormal returns in 37% of cases

Legal Analytics Financial Analysis

Open-source legal data standard containing 18,000+ standardized legal concepts with multilingual support for improved legal industry interoperability

Knowledge Representation Legal Informatics

Original source code and data infrastructure that powered the initial version of usbills.ai platform

Legal Informatics

An innovative analysis of European financial integration using eigendecomposition of sovereign bond yield correlations from 1872 to 2010

Financial Analysis

Computational analysis measuring the complexity of the United States Code using mathematical and network science approaches

Computational Law Complex Systems

models

A machine learning model that predicts Supreme Court voting behavior with 70% accuracy, analyzing 60 years of decisions from 1953-2013

Legal Analytics Machine Learning

An enhanced Supreme Court prediction model achieving 70.2% accuracy across 200 years of decisions (1816-2015), analyzing over 240,000 justice votes

Legal Analytics Machine Learning

Comprehensive research examining toxicity and bias in legal language models, demonstrating KL3M's superior safety profile through rigorous testing

AI Ethics Legal Informatics

Research presenting NUPunkt and CharBoundary libraries for high-precision sentence segmentation in legal text, achieving 29-32% improvement over general-purpose tools

Natural Language Processing Legal Informatics

Research and development repository for advancing the Kelvin Legal Large Language Model family with new architectures and training approaches

Machine Learning Legal Informatics

Research demonstrating GPT-4's ability to pass the Uniform Bar Examination, significantly outperforming both human test-takers and prior AI models

AI Evaluation Legal Informatics

Research evaluating GPT models' capabilities on the Uniform CPA Examination, exploring AI's potential to transform knowledge work

AI Evaluation

Initial repository for research evaluating GPT models on the CPA exam, later developed into the comprehensive "GPT as Knowledge Worker" project

AI Evaluation

Groundbreaking research demonstrating GPT-3.5's performance on the Multistate Bar Examination, predicting AI's ability to pass professional legal licensing exams

AI Evaluation Legal Informatics

FMLGen

model

A humorous AI project that generates absurd "F*** My Life" stories using modern language models, comparing current capabilities to 2013 n-gram approaches

Natural Language Processing