on this page

Linux as a Model

model

Training transformer models to memorize Linux kernel source code to demonstrate issues with AI training data licensing

period: 2024-present
team: ALEA Institute
tech:
Machine LearningOpen Source

Linux as a Model (LaaM) is a demonstration project that trains small transformer models to memorize Linux kernel 1.0 source code, highlighting critical questions about the legal status of training data in “open” AI models.

Technical Implementation

  • 5M parameter Llama2-style model trained to memorize GPL2-licensed Linux kernel source
  • Two-stage training process: convergence-focused followed by single-sample memorization
  • Efficient deployment: Runs on systems with ~1GB VRAM
  • MIT-licensed model weights despite GPL2 training data

Purpose

The project challenges the Open Source Initiative’s draft AI definition by demonstrating that models trained on closed or restrictively licensed data cannot be considered truly “open,” arguing that:

  • Models can trivially emit copies of their training data
  • The legal status of training data matters for AI openness
  • Current definitions inadequately address data provenance

Resources

on this page