Name: Linux as a Model
Creator: Michael Bommarito
Published: 2024-12-01
Keywords: Machine Learning, Open Source

Linux as a Model (LaaM) is a demonstration project that trains small transformer models to memorize Linux kernel 1.0 source code, highlighting critical questions about the legal status of training data in “open” AI models.

Technical Implementation

5M parameter Llama2-style model trained to memorize GPL2-licensed Linux kernel source
Two-stage training process: convergence-focused followed by single-sample memorization
Efficient deployment: Runs on systems with ~1GB VRAM
MIT-licensed model weights despite GPL2 training data

Purpose

The project challenges the Open Source Initiative’s draft AI definition by demonstrating that models trained on closed or restrictively licensed data cannot be considered truly “open,” arguing that:

Models can trivially emit copies of their training data
The legal status of training data matters for AI openness
Current definitions inadequately address data provenance

Resources

Model on Hugging Face
OSI Discussion: Data/Information Concepts
OSI Discussion: GPL2 Source as MIT Model