on this page

FOLIO Data Generator

software

Python library for generating synthetic legal data using the FOLIO knowledge graph, supporting both procedural templates and LLM-based generation

period: 2024-present
team: ALEA Institute
tech:
Legal Informatics
══════════════════════════════════════════════════════════════════

A versatile Python library for generating synthetic legal data using the FOLIO (formerly SOLI) knowledge graph, providing both template-based and AI-powered generation methods for creating realistic legal documents and datasets.

Overview

The FOLIO Data Generator addresses the critical need for synthetic legal data in AI development, testing, and research. By leveraging the standardized FOLIO ontology, it generates contextually accurate legal text while avoiding privacy and confidentiality concerns.

Generation Methods

1. Procedural Template Generation

Uses templates with FOLIO and Faker tags for structured generation:

from soli import SOLI
from soli_data_generator.procedural.template import TemplateFormatter

soli_graph = SOLI()
formatter = TemplateFormatter()

template = """
LEGAL NOTICE

Company: <|company|>
Industry: <|industry|>
Legal Issue: <|area_of_law|>
Jurisdiction: <|jurisdiction|>
Date: <|date|>

The party of the first part, <|company|>, hereby notifies
all concerned parties regarding matters pertaining to <|area_of_law|>.
"""

formatted_text = formatter(template)

2. LLM-Based Generation

Leverages AI models for more dynamic content creation:

from alea_llm_client import VLLMModel
from soli_data_generator.llm.text import TextGenerator

model = VLLMModel()
generator = TextGenerator(model)

# Generate with constraints
generated_text = generator(
    document_type="contract",
    jurisdiction="Delaware",
    parties=2,
    complexity="medium"
)

Key Features

Template System

  • FOLIO Tags: Access to 18,000+ legal concepts
  • Faker Integration: Realistic names, dates, addresses
  • Custom Tags: Extensible tag system
  • Nested Templates: Complex document structures

Document Types

  • Contracts and agreements
  • Legal notices and disclaimers
  • Court filings and motions
  • Regulatory documents
  • Corporate documents

Data Quality

  • Ontology-driven accuracy
  • Consistent legal terminology
  • Jurisdiction-aware content
  • Realistic formatting

Installation

pip install soli-data-generator

Use Cases

AI Training Data

Generate diverse, labeled datasets for:

  • Contract analysis models
  • Legal NER systems
  • Document classification
  • Compliance checking

Testing and Development

Create test data for:

  • Legal tech applications
  • Document management systems
  • E-discovery platforms
  • Contract lifecycle tools

Research

Support academic research with:

  • Controlled experimental data
  • Privacy-preserving datasets
  • Benchmark generation
  • Ablation studies

Architecture

Core Components

  • Template Engine: Parses and processes templates
  • Tag Resolver: Maps tags to FOLIO concepts
  • Generation Pipeline: Orchestrates generation flow
  • Output Formatters: Multiple format support

Integration Points

  • FOLIO/SOLI knowledge graph
  • Faker for realistic data
  • LLM APIs for AI generation
  • Custom tag providers

Advanced Features

Batch Generation

# Generate multiple documents
documents = generator.batch_generate(
    template="contract_template.txt",
    count=1000,
    variations=True
)

Constraint-Based Generation

# Generate with specific constraints
constraints = {
    "jurisdiction": ["New York", "California"],
    "document_length": (1000, 5000),
    "complexity": "high",
    "include_concepts": ["indemnification", "warranty"]
}

document = generator.generate_constrained(constraints)

Format Support

  • Plain text
  • Markdown
  • JSON structured data
  • XML legal formats
  • Custom formats via plugins

Quality Assurance

Built-in validation ensures:

  • Legal terminology accuracy
  • Structural consistency
  • Format compliance
  • Concept coherence

Performance

Optimized for scale:

  • Efficient template parsing
  • Caching for repeated concepts
  • Parallel batch processing
  • Memory-efficient streaming

Future Development

Planned enhancements:

  • Additional document templates
  • Multi-language support
  • Domain-specific generators
  • Interactive generation UI

The FOLIO Data Generator empowers developers and researchers to create high-quality synthetic legal data, accelerating innovation while maintaining privacy and legal compliance.

on this page