on this page

leeky

software

Training data contamination detection library for black-box language models, implementing six testing methods to identify potential data leakage

period: 2023-present
team: ALEA Institute
tech:
AI EthicsMachine Learning
══════════════════════════════════════════════════════════════════

A pioneering tool for detecting training data contamination in black-box language models, addressing critical concerns about benchmark integrity and model evaluation validity in the AI research community.

Problem Statement

As language models train on vast internet corpora, there’s growing concern about:

  • Memorization of public benchmarks
  • Contamination of evaluation datasets
  • Inflated performance metrics
  • Compromised research validity

leeky provides systematic methods to detect whether specific text appears in a model’s training data without access to model weights or training datasets.

Testing Methods

1. Recital Without Context

  • Provides N initial tokens from source material
  • Prompts for completion without context
  • Generates M×K samples for analysis
  • Detects verbatim memorization

2. Contextual Recital

  • Similar to recital testing
  • Includes explicit source context
  • Tests semantic understanding vs memorization
  • Higher sensitivity for partial matches

3. Semantic Recital

  • Prompts for source-aware completion
  • Tests deeper understanding
  • Identifies paraphrased content
  • Captures non-verbatim contamination

4. Source Veracity

  • Yes/No verification of text origin
  • Tests model’s source recognition
  • Multiple prompt variations
  • Statistical confidence scoring

5. Source Recall

  • Prompts model to identify source
  • Tests explicit source memory
  • Validates against known origins
  • Measures recall accuracy

6. Search Engine Method

  • Leverages model as search tool
  • Tests information retrieval
  • Identifies training data presence
  • Cross-validates other methods

Technical Implementation

Core Architecture

from leeky import ContaminationTester

tester = ContaminationTester(model="gpt-4")

# Test for contamination
results = tester.test_contamination(
    text="Sample legal document text",
    methods=["recital", "veracity", "recall"],
    samples=100
)

# Analyze results
contamination_score = results.aggregate_score()

Supported Models

  • OpenAI API models
  • Hugging Face models
  • Any text completion API
  • Custom model interfaces

Scoring Methods

Quantitative Metrics

  • Verbatim match percentage
  • Semantic similarity scores
  • Source recognition rates
  • Statistical significance tests

Interpretation Guidelines

  • High recital scores indicate memorization
  • Veracity scores show recognition
  • Recall scores confirm source awareness
  • Combined scores provide confidence

Use Cases

Research Integrity

  • Validate benchmark cleanliness
  • Ensure fair model comparisons
  • Detect contaminated evaluations
  • Maintain scientific rigor

Model Auditing

  • Check for proprietary data leakage
  • Verify training data compliance
  • Assess memorization risks
  • Support responsible AI practices
  • Detect copyrighted content
  • Verify data usage rights
  • Support litigation discovery
  • Enable regulatory compliance

Example Results

The tool has been tested on various sources:

  • Legal documents (constitutions, contracts)
  • Academic papers
  • News articles
  • Code repositories
  • Proprietary datasets

Results demonstrate varying contamination levels across different model families and training approaches.

Best Practices

Testing Strategy

  1. Use multiple detection methods
  2. Generate sufficient samples
  3. Test diverse text types
  4. Validate with known contaminated data

Interpretation

  • Consider false positive rates
  • Account for common phrases
  • Use statistical thresholds
  • Document methodology

Impact on AI Development

leeky addresses fundamental challenges in:

  • Evaluation Validity: Ensuring benchmarks measure true capabilities
  • Research Reproducibility: Detecting compromised test sets
  • Ethical AI: Preventing unauthorized data use
  • Model Trust: Validating training data claims

Future Development

Planned enhancements:

  • Additional detection methods
  • Multi-modal contamination testing
  • Automated benchmark validation
  • Integration with evaluation suites

leeky represents essential infrastructure for maintaining integrity in AI research and development, providing the tools needed to ensure fair and valid model evaluation.

on this page