The first comprehensive evaluation of a large language model’s performance on the bar exam, testing OpenAI’s GPT-3.5 (text-davinci-003) on the Multistate Bar Examination (MBE) multiple-choice section.

Research Impact

This pioneering study laid the groundwork for understanding AI capabilities in legal reasoning:

50.3% accuracy - significantly above 25% baseline guessing rate
Passing performance in Evidence and Torts sections
88% top-3 accuracy - demonstrating strong partial understanding
Predicted that LLMs would soon pass the bar exam (confirmed by GPT-4 in 2023)

Publication

Authors: Michael James Bommarito, Daniel Martin Katz
Published: December 29, 2022
Paper: Available on arXiv and SSRN

Key Findings

Performance Analysis

Overall Score: 50.3% correct on complete NCBE MBE practice exam
Response Quality: Top two choices correct 71% of the time
Subject Strengths: Evidence and Torts at passing rates
Zero-shot Performance: No benefit from fine-tuning at available data scale

Technical Insights

Hyperparameter optimization improved performance
Prompt engineering significantly impacted results
Strong correlation between model confidence and correctness

Methodology

The research evaluated:

Complete MBE practice examinations
Multiple question categories and difficulty levels
Various prompting strategies
Fine-tuning vs zero-shot approaches

Historical Significance

This research marked a turning point in AI evaluation on professional exams, establishing methodologies and baselines that would be used in subsequent studies. The prediction that “an LLM will pass the MBE component of the Bar Exam in the near future” was validated just months later with GPT-4’s success.

Resources

The repository includes:

Jupyter notebooks with analysis
Performance visualization charts
Prompt examples and optimization strategies
Complete session logs for reproducibility