on this page

OpenGloss

software

A synthetic encyclopedic dictionary and semantic knowledge graph for English with 537K sense definitions, 9.1M semantic edges, and 60M words of encyclopedic content

period: 2025-present
tech:
Natural Language ProcessingKnowledge GraphsRustLexicography

OpenGloss is a comprehensive English lexical resource that unifies lexicographic definitions, encyclopedic context, etymological histories, and semantic relationships in a single, freely available dataset and web application.

Project Overview

OpenGloss addresses the fragmentation of English lexical resources by synthesizing dictionary definitions, encyclopedia entries, and semantic networks into one cohesive knowledge graph. The entire resource was generated through an LLM-based pipeline in under one week for under $1,000.

Key Statistics

  • 150,101 lexemes: Modern English words and multi-word expressions
  • 536,829 sense definitions: Comprehensive coverage of word meanings
  • 9.1 million semantic edges: Synonyms, antonyms, hypernyms, hyponyms, and more
  • 3 million collocations: Common word pairings and usage patterns
  • 1 million usage examples: Contextual sentence demonstrations
  • 60 million words: Encyclopedic content for applicable terms

Publication

“OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph”

  • Author: Michael J. Bommarito II
  • Published: November 2025
  • Available on arXiv

Components

Datasets (Hugging Face)

opengloss-dictionary: Word-level view with 150K lexemes containing:

  • Parts of speech and sense counts
  • Definitions, synonyms, antonyms, hypernyms, hyponyms
  • Etymological information and cognates
  • Encyclopedic entries for applicable terms
  • Collocations, inflections, and derivations
  • Usage examples and semantic graph edges

opengloss-dictionary-definitions: Sense-level view with 537K individual sense definitions for fine-grained access.

OpenGloss-RS (Rust Implementation)

The opengloss-rs crate provides high-performance access to the dataset:

Core Features:

  • Exact-match lookup: Retrieve lexeme IDs from surface forms
  • Prefix search: FST-backed word completion in microseconds
  • Substring & fuzzy search: Pattern matching across definitions, synonyms, and encyclopedia content
  • Graph traversal: Navigate semantic relations efficiently

Technical Stack:

  • Rust 1.82+ with 2024 edition
  • FST (Finite-State Transducer): Zero-copy prefix lookups
  • Rkyv: Zero-copy deserialization for packed data structures
  • Zstd: Compression for embedded archives
  • RapidFuzz: Weighted fuzzy matching
  • Axum/Tokio: HTTP framework for web server

Performance:

  • Initial data store decompression: ~349ms
  • Entry lookups: 6.5-10.3 microseconds
  • Prefix searches: 1.73-4.12 microseconds
  • Memory footprint: ~3.1GB after initialization

OpenGloss.com

The live web application at opengloss.com provides:

  • Full-text search with fuzzy matching mode
  • Type-ahead suggestions powered by offline trie mechanism
  • Browse index and random entry discovery
  • Lexeme of the day highlighting
  • Relation puzzles connecting synonyms and concepts
  • Seven Senses Challenge word-path game
  • JSON API endpoints: /api/typeahead, /api/search, /api/lexeme

Comparison to Existing Resources

OpenGloss is comparable in scope to WordNet 3.1 while providing:

  • 4x more sense definitions than comparable resources
  • Unified encyclopedic content integrated with definitions
  • Etymology and cognate tracking across word families
  • Modern vocabulary including contemporary terms

Use Cases

  • NLP Research: Training data for word sense disambiguation, semantic similarity, and knowledge graph completion
  • Educational Tools: Plain-language definitions accessible to general audiences
  • Language Learning: Usage examples, collocations, and semantic relationships
  • Question Answering: Structured knowledge for retrieval-augmented generation
  • Text Classification: Feature extraction from semantic relationships

Technical Innovation

The build-time compilation strategy in opengloss-rs transforms the dataset into compressed artifacts embedded directly in the binary via include_bytes!. This eliminates runtime decompression overhead for repeated queries while maintaining a compact binary footprint. Relations are pre-resolved to lexeme IDs, enabling efficient graph traversals without indirection.

Availability

All resources are freely available under CC-BY-4.0:

Impact

OpenGloss demonstrates that comprehensive lexical resources can be synthesized efficiently using modern LLM pipelines, democratizing access to high-quality linguistic data that previously required decades of expert curation. The project provides both the raw data for researchers and a production-ready implementation for practical applications.

on this page