Linux as a Model
modelTraining transformer models to memorize Linux kernel source code to demonstrate issues with AI training data licensing
period: 2024-present
team: ALEA Institute
tech:
Machine LearningOpen Source
Linux as a Model (LaaM) is a demonstration project that trains small transformer models to memorize Linux kernel 1.0 source code, highlighting critical questions about the legal status of training data in “open” AI models.
Technical Implementation
- 5M parameter Llama2-style model trained to memorize GPL2-licensed Linux kernel source
- Two-stage training process: convergence-focused followed by single-sample memorization
- Efficient deployment: Runs on systems with ~1GB VRAM
- MIT-licensed model weights despite GPL2 training data
Purpose
The project challenges the Open Source Initiative’s draft AI definition by demonstrating that models trained on closed or restrictively licensed data cannot be considered truly “open,” arguing that:
- Models can trivially emit copies of their training data
- The legal status of training data matters for AI openness
- Current definitions inadequately address data provenance