building a 150k-word english dictionary with llms: opengloss

published: November 25, 2025 updated: March 19, 2026
datasets:mjbommar/opengloss-dictionary and mjbommar/opengloss-dictionary-definitions on HuggingFace
live api:opengloss.com/api
scale:150K lexemes, 537K senses, 9.14M edges, 60M words of encyclopedic content
license:CC-BY-4.0 (data), Apache-2.0 (code)

overview

opengloss is a synthetic encyclopedic dictionary and semantic knowledge graph for English. it integrates lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships into a single resource — providing 4.59x more sense definitions than WordNet 3.0.

the datasets are available on HuggingFace under CC-BY-4.0, a live REST API runs at opengloss.com, and the full explorer supports search, browsing, relation puzzles, and typeahead autocomplete.

when to use

  • as a richer alternative to WordNet for NLP tasks requiring sense-level semantic edges
  • as training data for language models needing structured lexical knowledge
  • for educational tools that need etymology and encyclopedic context
  • for knowledge graph construction or relation extraction experiments
  • as a reference dictionary with broader coverage than traditional resources

using the huggingface datasets

install

uv pip install datasets

lexeme-level dataset (150K entries)

the opengloss-dictionary dataset contains one row per word with all senses, edges, etymology, and encyclopedia content (~1.2 GB download):

from datasets import load_dataset

ds = load_dataset("mjbommar/opengloss-dictionary", split="train")
print(f"{len(ds):,} lexemes")  # 150,101 lexemes

# look up a word
results = ds.filter(lambda x: x["word"] == "algorithm")
entry = results[0]

print(entry["parts_of_speech"])        # ['noun']
print(entry["total_senses"])           # 3
print(entry["total_edges"])            # 36
print(entry["all_synonyms"])           # ['multidimensional array', ...]
print(entry["encyclopedia_entry"][:200])  # encyclopedia text
print(entry["etymology_summary"])      # etymology if available

# iterate over individual senses
for sense in entry["senses"]:
    print(f"[{sense['part_of_speech']}:{sense['sense_index']}] {sense['definition']}")
    print(f"  synonyms: {sense['synonyms']}")
    print(f"  hypernyms: {sense['hypernyms']}")
    print(f"  examples: {sense['examples']}")

key columns: word, parts_of_speech, senses, all_definitions, all_synonyms, all_antonyms, all_hypernyms, all_hyponyms, all_collocations, all_examples, etymology_summary, encyclopedia_entry, edges, total_edges

definition-level dataset (537K senses)

the opengloss-dictionary-definitions dataset has one row per sense definition (~310 MB download), useful when you need sense-level granularity:

from datasets import load_dataset

defs = load_dataset("mjbommar/opengloss-dictionary-definitions", split="train")
print(f"{len(defs):,} definitions")  # 536,829 definitions

# filter by part of speech
nouns = defs.filter(lambda x: x["part_of_speech"] == "noun")

# find highly polysemous words
polysemous = defs.filter(lambda x: x["total_senses_for_word"] >= 10)

# get all senses for a word
word_defs = defs.filter(lambda x: x["word"] == "set")
for d in word_defs:
    print(f"[{d['part_of_speech']}:{d['sense_index']}] {d['definition']}")

key columns: word, part_of_speech, sense_index, definition, synonyms, antonyms, hypernyms, hyponyms, examples, collocations, sense_edges, pos_level_edges

convert to pandas

import pandas as pd

df = defs.to_pandas()
print(df["part_of_speech"].value_counts())
# noun         278,568
# adjective    144,571
# verb          90,715
# ...

educational drafting dataset (27.6K documents)

opengloss-v1.1-drafting contains 27,635 synthetic educational documents (articles, essays, stories) generated from the vocabulary:

drafts = load_dataset("mjbommar/opengloss-v1.1-drafting", split="train")
print(drafts[0]["title"])
print(drafts[0]["content"][:200])

using the live api

the REST API at opengloss.com returns JSON and requires no authentication:

lookup a word

curl -s 'https://opengloss.com/api/lexeme?word=algorithm' | python3 -m json.tool
import requests

resp = requests.get("https://opengloss.com/api/lexeme", params={"word": "algorithm"})
entry = resp.json()
print(entry["all_definitions"])
# ['A finite, stepwise procedure for solving a problem or completing a computation.',
#  'A set of precise rules used to generate a predictable output from given inputs.']
print(entry["all_synonyms"])
# ['formula', 'method', 'procedure', 'process', 'protocol', 'routine', 'rule']
# fuzzy search
curl -s 'https://opengloss.com/api/search?q=tensor&mode=fuzzy&limit=5' | python3 -m json.tool

# typeahead / prefix search
curl -s 'https://opengloss.com/api/typeahead?q=algo&limit=5&mode=prefix' | python3 -m json.tool

api endpoints

endpointmethoddescription
/api/lexeme?word=<string>GETlookup by word
/api/lexeme?id=<u32>GETlookup by numeric ID
/api/search?q=<query>&mode=fuzzy|substringGETsearch with fuzzy or substring matching
/api/typeahead?q=<query>&limit=12&mode=prefixGETautocomplete suggestions
/api/analytics/trendingGETtrending searches

data

metricvalue
lexemes150,101 (94,106 single-word, 55,995 multi-word)
sense definitions536,829 (avg 3.58 per lexeme)
semantic edges9.14 million (5.2M sense-level, 3.9M POS-level)
usage examples~1 million
collocations3 million
encyclopedic content60 million words (99.7% coverage)
etymology coverage97.3%

comparison with other resources

resourcelexemessensesedgescontent
OpenGloss150,101536,8299.14Mencyclopedia + etymology
WordNet 3.0147,306117,659definitions only
BabelNet 5.323M synsetsmultilingual
ConceptNet 5.78M21Mcommonsense

how it was built

opengloss was produced through a multi-agent pipeline in under one week for under $1,000:

  1. lexeme selection — foundation from an American English dictionary (104K words) expanded with 77K pedagogical additions via iterative neighbor-graph traversal
  2. sense generation — two-agent architecture using pydantic-ai: an overview agent determines POS categories, then a POS-details agent generates 1-4 definitions per category with synonyms, antonyms, hypernyms, hyponyms, and examples
  3. graph construction — deterministic extraction producing sense-level edges (5.2M) and POS-level edges (3.9M)
  4. enrichment — separate agents for etymology (97.5% coverage) and encyclopedic content (99.7% coverage, 200-400 words each)

generation model: gpt-4-mini. QA model: Claude Sonnet 4.5.

runtime architecture

the live site is powered by a Rust binary (opengloss-rs, v0.4.3) using:

  • Axum/Tokio async web server
  • FST (finite-state transducer) for zero-copy prefix and fuzzy lookups
  • Rkyv + Zstd compressed data embedded in the binary (~830 MB)
  • single entry lookup: 6.5-10.3 microseconds
  • prefix search (10 results): 1.73 microseconds

references

  1. Bommarito, M.J. “OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph.” arXiv:2511.18622, November 2025. https://arxiv.org/abs/2511.18622
  2. HuggingFace datasets: opengloss-dictionary, opengloss-dictionary-definitions, opengloss-v1.1-drafting
  3. GitHub: https://github.com/mjbommar/opengloss-rs
  4. Live site: https://opengloss.com/

related pages

on this page