FOLIO Data Generator
softwarePython library for generating synthetic legal data using the FOLIO knowledge graph, supporting both procedural templates and LLM-based generation
══════════════════════════════════════════════════════════════════
A versatile Python library for generating synthetic legal data using the FOLIO (formerly SOLI) knowledge graph, providing both template-based and AI-powered generation methods for creating realistic legal documents and datasets.
Overview
The FOLIO Data Generator addresses the critical need for synthetic legal data in AI development, testing, and research. By leveraging the standardized FOLIO ontology, it generates contextually accurate legal text while avoiding privacy and confidentiality concerns.
Generation Methods
1. Procedural Template Generation
Uses templates with FOLIO and Faker tags for structured generation:
from soli import SOLI
from soli_data_generator.procedural.template import TemplateFormatter
soli_graph = SOLI()
formatter = TemplateFormatter()
template = """
LEGAL NOTICE
Company: <|company|>
Industry: <|industry|>
Legal Issue: <|area_of_law|>
Jurisdiction: <|jurisdiction|>
Date: <|date|>
The party of the first part, <|company|>, hereby notifies
all concerned parties regarding matters pertaining to <|area_of_law|>.
"""
formatted_text = formatter(template)
2. LLM-Based Generation
Leverages AI models for more dynamic content creation:
from alea_llm_client import VLLMModel
from soli_data_generator.llm.text import TextGenerator
model = VLLMModel()
generator = TextGenerator(model)
# Generate with constraints
generated_text = generator(
document_type="contract",
jurisdiction="Delaware",
parties=2,
complexity="medium"
)
Key Features
Template System
- FOLIO Tags: Access to 18,000+ legal concepts
- Faker Integration: Realistic names, dates, addresses
- Custom Tags: Extensible tag system
- Nested Templates: Complex document structures
Document Types
- Contracts and agreements
- Legal notices and disclaimers
- Court filings and motions
- Regulatory documents
- Corporate documents
Data Quality
- Ontology-driven accuracy
- Consistent legal terminology
- Jurisdiction-aware content
- Realistic formatting
Installation
pip install soli-data-generator
Use Cases
AI Training Data
Generate diverse, labeled datasets for:
- Contract analysis models
- Legal NER systems
- Document classification
- Compliance checking
Testing and Development
Create test data for:
- Legal tech applications
- Document management systems
- E-discovery platforms
- Contract lifecycle tools
Research
Support academic research with:
- Controlled experimental data
- Privacy-preserving datasets
- Benchmark generation
- Ablation studies
Architecture
Core Components
- Template Engine: Parses and processes templates
- Tag Resolver: Maps tags to FOLIO concepts
- Generation Pipeline: Orchestrates generation flow
- Output Formatters: Multiple format support
Integration Points
- FOLIO/SOLI knowledge graph
- Faker for realistic data
- LLM APIs for AI generation
- Custom tag providers
Advanced Features
Batch Generation
# Generate multiple documents
documents = generator.batch_generate(
template="contract_template.txt",
count=1000,
variations=True
)
Constraint-Based Generation
# Generate with specific constraints
constraints = {
"jurisdiction": ["New York", "California"],
"document_length": (1000, 5000),
"complexity": "high",
"include_concepts": ["indemnification", "warranty"]
}
document = generator.generate_constrained(constraints)
Format Support
- Plain text
- Markdown
- JSON structured data
- XML legal formats
- Custom formats via plugins
Quality Assurance
Built-in validation ensures:
- Legal terminology accuracy
- Structural consistency
- Format compliance
- Concept coherence
Performance
Optimized for scale:
- Efficient template parsing
- Caching for repeated concepts
- Parallel batch processing
- Memory-efficient streaming
Future Development
Planned enhancements:
- Additional document templates
- Multi-language support
- Domain-specific generators
- Interactive generation UI
The FOLIO Data Generator empowers developers and researchers to create high-quality synthetic legal data, accelerating innovation while maintaining privacy and legal compliance.