OpenGloss | mike bommarito

OpenGloss is a comprehensive English lexical resource that unifies lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a single, freely available dataset and web application.

Project Overview

OpenGloss addresses the fragmentation of English lexical resources by synthesizing dictionary definitions, encyclopedia entries, and semantic networks into one cohesive knowledge graph. The entire resource was generated through an LLM-based pipeline in under one week for under $1,000.

Key Statistics

150,101 lexemes: Modern English words and multi-word expressions
536,829 sense definitions: Comprehensive coverage of word meanings
9.1 million semantic edges: Synonyms, antonyms, hypernyms, hyponyms, and more
3 million collocations: Common word pairings and usage patterns
1 million usage examples: Contextual sentence demonstrations
60 million words: Encyclopedic content for applicable terms

Publication

“OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph”

Author: Michael J. Bommarito II
Published: November 2025
Available on arXiv

Components

Datasets (Hugging Face)

opengloss-dictionary: Word-level view with 150K lexemes containing:

Parts of speech and sense counts
Definitions, synonyms, antonyms, hypernyms, hyponyms
Etymological information and cognates
Encyclopedic entries for applicable terms
Collocations, inflections, and derivations
Usage examples and semantic graph edges

opengloss-dictionary-definitions: Sense-level view with 537K individual sense definitions for fine-grained access.

OpenGloss-RS (Rust Implementation)

The opengloss-rs crate provides high-performance access to the dataset:

Core Features:

Exact-match lookup: Retrieve lexeme IDs from surface forms
Prefix search: FST-backed word completion in microseconds
Substring & fuzzy search: Pattern matching across definitions, synonyms, and encyclopedia content
Graph traversal: Navigate semantic relations efficiently

Technical Stack:

Rust 1.82+ with 2024 edition
FST (Finite-State Transducer): Zero-copy prefix lookups
Rkyv: Zero-copy deserialization for packed data structures
Zstd: Compression for embedded archives
RapidFuzz: Weighted fuzzy matching
Axum/Tokio: HTTP framework for web server

Performance:

Initial data store decompression: ~349ms
Entry lookups: 6.5-10.3 microseconds
Prefix searches: 1.73-4.12 microseconds
Memory footprint: ~3.1GB after initialization

OpenGloss.com

The live web application at opengloss.com provides:

Full-text search with fuzzy matching mode
Type-ahead suggestions powered by offline trie mechanism
Browse index and random entry discovery
Lexeme of the day highlighting
Relation puzzles connecting synonyms and concepts
Seven Senses Challenge word-path game
JSON API endpoints: /api/typeahead, /api/search, /api/lexeme

Comparison to Existing Resources

OpenGloss is comparable in scope to WordNet 3.1 while providing:

4x more sense definitions than comparable resources
Unified encyclopedic content integrated with definitions
Etymology and cognate tracking across word families
Modern vocabulary including contemporary terms

Use Cases

NLP Research: Training data for word sense disambiguation, semantic similarity, and knowledge graph completion
Educational Tools: Plain-language definitions accessible to general audiences
Language Learning: Usage examples, collocations, and semantic relationships
Question Answering: Structured knowledge for retrieval-augmented generation
Text Classification: Feature extraction from semantic relationships

Technical Innovation

The build-time compilation strategy in opengloss-rs transforms the dataset into compressed artifacts embedded directly in the binary via include_bytes!. This eliminates runtime decompression overhead for repeated queries while maintaining a compact binary footprint. Relations are pre-resolved to lexeme IDs, enabling efficient graph traversals without indirection.

Availability

All resources are freely available under CC-BY-4.0:

Live Website: opengloss.com
GitHub Repository: opengloss-rs
Hugging Face Datasets: opengloss-dictionary, opengloss-dictionary-definitions
arXiv Paper: 2511.18622

Impact

OpenGloss demonstrates that comprehensive lexical resources can be synthesized efficiently using modern LLM pipelines, democratizing access to high-quality linguistic data that previously required decades of expert curation. The project provides both the raw data for researchers and a production-ready implementation for practical applications.