OpenGloss
softwareA synthetic encyclopedic dictionary and semantic knowledge graph for English with 537K sense definitions, 9.1M semantic edges, and 60M words of encyclopedic content
OpenGloss is a comprehensive English lexical resource that unifies lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a single, freely available dataset and web application.
Project Overview
OpenGloss addresses the fragmentation of English lexical resources by synthesizing dictionary definitions, encyclopedia entries, and semantic networks into one cohesive knowledge graph. The entire resource was generated through an LLM-based pipeline in under one week for under $1,000.
Key Statistics
- 150,101 lexemes: Modern English words and multi-word expressions
- 536,829 sense definitions: Comprehensive coverage of word meanings
- 9.1 million semantic edges: Synonyms, antonyms, hypernyms, hyponyms, and more
- 3 million collocations: Common word pairings and usage patterns
- 1 million usage examples: Contextual sentence demonstrations
- 60 million words: Encyclopedic content for applicable terms
Publication
“OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph”
- Author: Michael J. Bommarito II
- Published: November 2025
- Available on arXiv
Components
Datasets (Hugging Face)
opengloss-dictionary: Word-level view with 150K lexemes containing:
- Parts of speech and sense counts
- Definitions, synonyms, antonyms, hypernyms, hyponyms
- Etymological information and cognates
- Encyclopedic entries for applicable terms
- Collocations, inflections, and derivations
- Usage examples and semantic graph edges
opengloss-dictionary-definitions: Sense-level view with 537K individual sense definitions for fine-grained access.
OpenGloss-RS (Rust Implementation)
The opengloss-rs crate provides high-performance access to the dataset:
Core Features:
- Exact-match lookup: Retrieve lexeme IDs from surface forms
- Prefix search: FST-backed word completion in microseconds
- Substring & fuzzy search: Pattern matching across definitions, synonyms, and encyclopedia content
- Graph traversal: Navigate semantic relations efficiently
Technical Stack:
- Rust 1.82+ with 2024 edition
- FST (Finite-State Transducer): Zero-copy prefix lookups
- Rkyv: Zero-copy deserialization for packed data structures
- Zstd: Compression for embedded archives
- RapidFuzz: Weighted fuzzy matching
- Axum/Tokio: HTTP framework for web server
Performance:
- Initial data store decompression: ~349ms
- Entry lookups: 6.5-10.3 microseconds
- Prefix searches: 1.73-4.12 microseconds
- Memory footprint: ~3.1GB after initialization
OpenGloss.com
The live web application at opengloss.com provides:
- Full-text search with fuzzy matching mode
- Type-ahead suggestions powered by offline trie mechanism
- Browse index and random entry discovery
- Lexeme of the day highlighting
- Relation puzzles connecting synonyms and concepts
- Seven Senses Challenge word-path game
- JSON API endpoints:
/api/typeahead,/api/search,/api/lexeme
Comparison to Existing Resources
OpenGloss is comparable in scope to WordNet 3.1 while providing:
- 4x more sense definitions than comparable resources
- Unified encyclopedic content integrated with definitions
- Etymology and cognate tracking across word families
- Modern vocabulary including contemporary terms
Use Cases
- NLP Research: Training data for word sense disambiguation, semantic similarity, and knowledge graph completion
- Educational Tools: Plain-language definitions accessible to general audiences
- Language Learning: Usage examples, collocations, and semantic relationships
- Question Answering: Structured knowledge for retrieval-augmented generation
- Text Classification: Feature extraction from semantic relationships
Technical Innovation
The build-time compilation strategy in opengloss-rs transforms the dataset into compressed artifacts embedded directly in the binary via include_bytes!. This eliminates runtime decompression overhead for repeated queries while maintaining a compact binary footprint. Relations are pre-resolved to lexeme IDs, enabling efficient graph traversals without indirection.
Availability
All resources are freely available under CC-BY-4.0:
- Live Website: opengloss.com
- GitHub Repository: opengloss-rs
- Hugging Face Datasets: opengloss-dictionary, opengloss-dictionary-definitions
- arXiv Paper: 2511.18622
Impact
OpenGloss demonstrates that comprehensive lexical resources can be synthesized efficiently using modern LLM pipelines, democratizing access to high-quality linguistic data that previously required decades of expert curation. The project provides both the raw data for researchers and a production-ready implementation for practical applications.