gpt-oss research notes

published: August 5, 2025
on this page

GPT OSS Research Notes - August 5, 2025

These research notes were compiled with assistance from Claude and Gemini AI assistants during a comprehensive review of OpenAI’s GPT-OSS documentation and resources.

Detailed notes on OpenAI’s open source models

Taking comprehensive notes as I review each documentation source individually to understand the new GPT OSS models.


1. OpenAI Open Models Main Page (from Clippings)

Source: https://openai.com/open-models/

Key Announcements

  • OpenAI has released TWO open-weight models: gpt-oss-120b and gpt-oss-20b
  • Both under Apache 2.0 license (very permissive, no copyleft, patent protection)
  • Playground available at gpt-oss.com

Model Descriptions

gpt-oss-120b:

  • “A large open model designed to run in data centers and on high-end desktops and laptops”
  • 117B parameters total, 5.1B active parameters
  • Fits on single H100 GPU

gpt-oss-20b:

  • “A medium-sized open model that can run on most desktops and laptops”
  • 21B parameters total, 3.6B active parameters
  • More accessible for local deployment

Core Features (Both Models)

  1. Permissive License: Apache 2.0 - build freely, experiment, customize, deploy commercially
  2. Agentic Tasks: Instruction following, tool use in chain-of-thought, web search, Python code execution
  3. Deeply Customizable: Adjustable reasoning effort (low/medium/high), full-parameter fine-tuning
  4. Full Chain-of-Thought: Complete access for debugging and trust

Performance Benchmarks

Benchmarkgpt-oss-120bgpt-oss-20bo3o4-mini
MMLU90.085.393.493.0
GPQA Diamond80.171.583.381.4
Humanity’s Last Exam19.017.324.917.7
AIME 202496.696.095.298.7
AIME 202597.998.798.499.5

Safety Standards

  • Comprehensive safety training and evaluation
  • Tested maliciously fine-tuned versions under Preparedness Framework
  • Found not to reach high capability levels even when fine-tuned maliciously
  • External safety expert review

Resources Mentioned

  • Transformers integration
  • Ollama local deployment
  • vLLM integration
  • OpenAI harmony response format
  • Model system card available

My Thoughts After Reading

This is a MAJOR shift for OpenAI - going from closed to open source. The Apache 2.0 license is surprisingly permissive. The mixture-of-experts architecture (5.1B/3.6B active out of 117B/21B total) is interesting. The focus on agentic capabilities and tool use suggests these are positioned as reasoning models, not just language models. The safety testing with malicious fine-tuning is notable. Performance is competitive but not state-of-the-art compared to their closed models.


2. HuggingFace gpt-oss-20b Model Page

Source: https://huggingface.co/openai/gpt-oss-20b

Model Specifications

  • Total Parameters: 21B (but only 3.6B active - confirms MoE architecture)
  • License: Apache 2.0
  • Precision: Native MXFP4 quantization (this is interesting - a 4-bit mixed precision format)
  • Memory Requirements: < 16GB (very accessible for consumer hardware!)

Key Features

  • Open-weight model specifically for reasoning and agentic tasks
  • Configurable reasoning levels: Can adjust between low, medium, high
  • Full chain-of-thought reasoning access (transparency!)
  • Fine-tuning capabilities (full parameter fine-tuning possible)
  • Agentic capabilities with function calling

Deployment Options

  1. Transformers (HuggingFace native)
  2. vLLM (high-performance serving)
  3. PyTorch/Triton (custom implementations)
  4. Ollama (local deployment)
  5. LM Studio (GUI-based local deployment)

Code Example Provided

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-20b"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)

Unique Capabilities

  • Web browsing integration (can search the web!)
  • Function calling (tool use)
  • Python code execution (can run code)
  • Structured outputs (JSON, etc.)

Download Command

huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/

My Thoughts After Reading

The MXFP4 quantization is fascinating - this is a 4-bit format that maintains quality. The < 16GB memory requirement makes this VERY accessible - it can run on a single RTX 4090 or even 4070 Ti Super. The emphasis on “reasoning and agentic tasks” rather than just “language modeling” is consistent with OpenAI’s recent focus. The multiple deployment options show they want maximum adoption. This is positioned as the “accessible” model for developers and researchers.


3. HuggingFace gpt-oss-120b Model Page

Source: https://huggingface.co/openai/gpt-oss-120b

Model Overview

  • Name: gpt-oss-120b
  • Developer: OpenAI
  • Type: Open-weight language model
  • Parameters: 117B total (5.1B active - massive MoE!)
  • License: Apache 2.0
  • Precision: Native MXFP4 quantization

Key Highlights

  • Designed for production and general-purpose high-reasoning use cases
  • Fits on a single H100 GPU (impressive for this size!)
  • Configurable reasoning levels (low, medium, high)
  • Full chain-of-thought reasoning access
  • Supports agentic capabilities like function calling and web browsing

Technical Specifications

  • Model Size: 63.1B parameters (different from total - interesting discrepancy)
  • Tensor Types: BF16, U8
  • Training Format: “harmony response format” (OpenAI’s new standard?)

Inference Options

Same as 20B model:

  1. Transformers
  2. vLLM
  3. PyTorch/Triton
  4. Ollama
  5. LM Studio

Hardware Requirements

  • Recommended: Single H100 GPU (80GB VRAM)
  • This is enterprise/research-grade hardware
  • Costs ~$30,000+ per GPU

Unique Features

  • Fine-tunable (despite size!)
  • Permissive licensing (Apache 2.0)
  • Native tool integration (built-in, not bolted on)
  • Configurable reasoning depth (can dial up/down compute)
  • Production AI applications
  • Reasoning-intensive tasks
  • Customizable AI solutions
  • Enterprise deployments

My Thoughts After Reading

The 117B total with 5.1B active is a massive MoE ratio (only 4.4% active!). This is extremely sparse. The fact it fits on a single H100 is impressive optimization - likely due to MXFP4 quantization. The “harmony response format” keeps appearing - this seems to be OpenAI’s new structured format for reasoning models. The positioning is clear: 120B for enterprise/production, 20B for developers/edge. The configurable reasoning depth is clever - you can trade compute for accuracy dynamically.


4. Transformers Cookbook Article

Source: https://cookbook.openai.com/articles/gpt-oss/run-transformers

Model Requirements

  • gpt-oss-20b: ~16GB VRAM (single high-end consumer GPU)
  • gpt-oss-120b: ≥60GB VRAM (H100-class hardware)

Installation

pip install -U transformers accelerate torch triton kernels
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Two Main Inference Methods

1. Pipeline API (Simpler)

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="openai/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

result = generator(
    messages,
    max_new_tokens=200,
    temperature=1.0,
)

2. Manual .generate() Method (More Control)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)

Key Features

  • device_map=“auto”: Automatic GPU allocation
  • torch_dtype=“auto”: Automatic precision selection
  • Chat template support: Proper formatting for conversations
  • Triton kernels: For optimized performance

My Thoughts After Reading

The Transformers integration is straightforward - two main approaches for different use cases. The Pipeline API is dead simple for quick prototyping. The manual approach gives more control over generation parameters. The Triton kernel requirement is interesting - suggests custom CUDA kernels for the MXFP4 quantization. The device_map=“auto” is convenient for multi-GPU setups. This is clearly designed to be as accessible as possible while maintaining performance.


5. vLLM Cookbook Article

Source: https://cookbook.openai.com/articles/gpt-oss/run-vllm

Overview

vLLM is positioned as the production server solution for gpt-oss models. It’s an open-source, high-throughput inference engine for LLMs, optimizing memory usage and processing speed.

Target Audience

  • Server applications with dedicated GPUs (H100s)
  • NOT for local/consumer use (redirects to Ollama for that)

Model Requirements

  • gpt-oss-20b: ~16GB VRAM
  • gpt-oss-120b: ≥60GB VRAM, fits on single H100 or multi-GPU
  • Both models are MXFP4 quantized out of the box

Installation

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

Server Setup

# For 20B
vllm serve openai/gpt-oss-20b

# For 120B
vllm serve openai/gpt-oss-120b
  • Automatically downloads from HuggingFace
  • Spins up OpenAI-compatible server on localhost:8000

API Compatibility

  • Chat Completions-compatible API
  • Responses-compatible API
  • Works with OpenAI SDK by just changing base URL!

Function Calling Support

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            },
        },
    }
]

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
    tools=tools
)

Agents SDK Integration

  • Works with OpenAI’s Agents SDK
  • Can override base client to point to vLLM
  • Python SDK can use LiteLLM integration as proxy

Direct Sampling with vLLM

  • Can use vLLM Python library directly (not just as server)
  • CRITICAL: Must use “harmony response format”
  • Requires openai-harmony SDK for encoding/decoding
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)
from vllm import LLM, SamplingParams

# Render prefill with Harmony
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

convo = Conversation.from_messages([
    Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
    Message.from_role_and_content(
        Role.DEVELOPER,
        DeveloperContent.new().with_instructions("Always respond in riddles"),
    ),
    Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
])

prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
stop_token_ids = encoding.stop_tokens_for_assistant_action()

# Run vLLM with prefill
llm = LLM(model="openai/gpt-oss-120b", trust_remote_code=True)
sampling = SamplingParams(max_tokens=128, temperature=1, stop_token_ids=stop_token_ids)
outputs = llm.generate(prompt_token_ids=[prefill_ids], sampling_params=sampling)

# Parse completion back to structured messages
output_tokens = outputs[0].outputs[0].token_ids
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)

My Thoughts After Reading

vLLM is the PRODUCTION solution - designed for serious deployments. The OpenAI API compatibility is brilliant for drop-in replacement. The harmony format requirement for direct sampling is crucial - this is OpenAI’s special sauce for structured reasoning. The fact they provide an SDK (openai-harmony) shows they want to standardize this format. The integration with Agents SDK shows they’re thinking about agentic workflows from day one. The warning about needing to handle tool calls in chain-of-thought is important - the model does reasoning WITH tools, not just calling them.


6. Ollama Cookbook Article

Source: https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama

Overview

Ollama is positioned as the LOCAL/CONSUMER solution for gpt-oss models. Perfect for running on PCs or Macs, offline usage.

Target Audience

  • Consumer hardware (PCs, Macs)
  • NOT for server/production (redirects to vLLM for that)
  • Focus on offline, local deployment

Model Requirements

  • gpt-oss-20b:
    • ≥16GB VRAM or unified memory
    • Perfect for high-end consumer GPUs or Apple Silicon Macs
  • gpt-oss-120b:
    • ≥60GB VRAM or unified memory
    • Multi-GPU or workstation setups

Important Notes

  • Models ship MXFP4 quantized - NO other quantization available
  • Can offload to CPU if short on VRAM (but slower)

Installation & Setup

# Install Ollama from ollama.com/download

# Pull model
ollama pull gpt-oss:20b
# or
ollama pull gpt-oss:120b

Usage Methods

1. Direct Chat

ollama run gpt-oss:20b
  • Applies chat template automatically (mimics OpenAI harmony format)

2. API Access

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",  # Local Ollama API
    api_key="ollama"                       # Dummy key
)

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)

Function Calling

  • Supports function calling
  • Has built-in browser tool (in the app)
  • Important: Must handle chain-of-thought tool calls iteratively
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            },
        },
    }
]

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
    tools=tools
)

Limitations

  • No native Responses API support (yet)
  • Workaround: Use HuggingFace’s Responses.js proxy
  • Or run example Python server with Ollama backend

Agents SDK Integration

Works with OpenAI’s Agents SDK via:

  • Python: Use LiteLLM to proxy to Ollama
  • TypeScript: Use AI SDK with ollama adapter
from agents import Agent, Runner, function_tool
from agents.extensions.models.litellm_model import LitellmModel

@function_tool
def get_weather(city: str):
    return f"The weather in {city} is sunny."

agent = Agent(
    name="Assistant",
    instructions="You only respond in haikus.",
    model=LitellmModel(model="ollama/gpt-oss:120b", api_key=api_key),
    tools=[get_weather],
)

My Thoughts After Reading

Ollama is the LOCAL HERO - makes these models accessible to everyone with decent hardware. The focus on consumer hardware and offline usage is smart. Apple Silicon support is crucial for Mac users. The MXFP4-only quantization is interesting - no GGUF or other formats. The OpenAI SDK compatibility makes migration easy. The built-in browser tool is a nice touch for local agentic workflows. The Agents SDK integration via LiteLLM is clever. This completes the deployment trinity: Transformers for development, vLLM for production, Ollama for local/edge.


7. GitHub Repository README

Source: https://github.com/openai/gpt-oss

Project Overview

Official repository for OpenAI’s open-weight language models:

  • gpt-oss-120b: Production, general-purpose model (117B params, 5.1B active)
  • gpt-oss-20b: Lower latency, specialized use cases (21B params, 3.6B active)

Key Highlights

  • Apache 2.0 license
  • Configurable reasoning effort (low/medium/high)
  • Full chain-of-thought access
  • Fine-tunable
  • Native MXFP4 quantization
  • Agentic capabilities: function calling, web browsing, Python execution

Implementation Options

  1. PyTorch: Reference implementation (requires 4x H100s!)
  2. Triton: Optimized single-GPU implementation
  3. Metal: Apple Silicon implementation
  4. Ollama: Local consumer hardware
  5. LM Studio: Model download and local inference

Installation

# Basic installation
pip install gpt-oss

# With PyTorch support
pip install gpt-oss[torch]

# With Triton support
pip install gpt-oss[triton]

Model Download

# Download 120b model
huggingface-cli download openai/gpt-oss-120b --include "original/*"

# Download 20b model
huggingface-cli download openai/gpt-oss-20b --include "original/*"

Built-in Tools

  1. Browser Tool: Web searching and page navigation
  2. Python Tool: Stateless calculation and execution environment

Unique Features

  • Uses “harmony” response format (OpenAI’s structured format)
  • Configurable reasoning efforts
  • Reference implementations for educational purposes
  • Includes actual tool implementations

My Thoughts After Reading

The GitHub repo is the TECHNICAL HUB. The PyTorch reference requiring 4x H100s is shocking - that’s $120k+ of hardware! The Triton optimization bringing it down to 1 GPU is impressive engineering. The inclusion of actual tool implementations (browser, Python) shows this isn’t just a model release - it’s an agentic AI platform. The “harmony” format keeps appearing - this is clearly central to how these models work. The variety of implementations (PyTorch, Triton, Metal) shows they want maximum hardware coverage. The pip package (gpt-oss) suggests they’re building an ecosystem, not just releasing weights.


8. OpenAI Harmony Repository

Source: https://github.com/openai/harmony

What is Harmony?

A specialized response format designed specifically for OpenAI’s gpt-oss models. It’s the SECRET SAUCE that enables structured reasoning and tool use.

Core Purpose

  • Defining conversation structures
  • Generating reasoning outputs
  • Structuring function calls
  • Enabling multi-channel model outputs

Key Technical Features

  • Multi-channel outputs: analysis, commentary, final output
  • Chain of thought processing
  • Tool calling preambles
  • Multiple tool namespaces
  • Structured outputs
  • Clear instruction hierarchies

Implementation Architecture

  • Core in Rust for performance
  • Python bindings via PyO3
  • Designed to mimic OpenAI’s Responses API

Conversation Structure Example


      <|start|>system<|message|>You are ChatGPT...
Reasoning: high
# Valid channels: analysis, commentary, final<|end|>

    

Installation

# Python
pip install openai-harmony

# Rust (in Cargo.toml)
[dependencies]
openai-harmony = { git = "https://github.com/openai/harmony" }

Key Benefits

  • Consistent formatting across all interactions
  • High-performance (Rust-based)
  • First-class Python support
  • Typed stubs included for better DX
  • Extensible design for AI model interactions

Compatibility Note

  • Recommended for direct gpt-oss model interactions
  • Inference providers (vLLM, Ollama) handle formatting automatically

My Thoughts After Reading

Harmony is the MISSING LINK - it explains how these models do structured reasoning! The multi-channel output system (analysis, commentary, final) is genius - it separates thinking from output. The Rust core for performance with Python bindings is smart engineering. This isn’t just a format - it’s a reasoning protocol. The fact it mimics the Responses API shows OpenAI is standardizing around this approach. The channel system explains how the models can do chain-of-thought AND tool use simultaneously. This is revolutionary - it’s making the reasoning process structured and parseable.


9. OpenAI Harmony Response Format Guide

Source: https://cookbook.openai.com/articles/openai-harmony

Overview

The COMPLETE SPECIFICATION for how gpt-oss models work! This is the training format that enables reasoning, tool use, and structured outputs.

Core Concepts

Roles (Hierarchy of Authority)

  1. system > developer > user > assistant > tool
    • System: Reasoning effort, meta info, built-in tools
    • Developer: Instructions (“system prompt”) and function tools
    • User: Input to the model
    • Assistant: Model output (with channels)
    • Tool: Output from tool calls

Channels (Output Separation)

  • final: User-facing responses (safe, filtered)
  • analysis: Chain-of-thought reasoning (NOT safety filtered!)
  • commentary: Function calls and preambles

Special Tokens

TokenPurposeToken ID
<|start|>Begin message200006
<|end|>End message200007
<|message|>Header→content transition200008
<|channel|>Channel info200005
<|constrain|>Data type for tool call200003
<|return|>Stop token (done)200002
<|call|>Stop token (tool call)200012

Message Format


      <|start|>{header}<|message|>{content}<|end|>

    

System Message Example


      <|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-06-28
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|>

    

Reasoning Levels

  • high: Extensive chain-of-thought
  • medium: Balanced reasoning (default)
  • low: Minimal reasoning

Function Calling Format

Functions defined in TypeScript-like syntax in developer message:

namespace functions {
  // Gets the current weather in the provided location.
  type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string;
    format?: 'celsius' | 'fahrenheit'; // default: celsius
  }) => any;
}

Tool calls go to commentary channel with recipient:


      <|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location":"San Francisco"}<|call|>

    

Built-in Tools

Browser Tool

  • browser.search: Search web
  • browser.open: Open links
  • browser.find: Find patterns
  • Uses analysis channel
  • Citation format: 【{cursor}†L{line_start}(-L{line_end})?】

Python Tool

  • Stateful Jupyter environment
  • 120 second timeout
  • /mnt/data for persistence
  • Uses analysis channel

Critical Safety Note

Chain-of-thought (analysis channel) is NOT safety filtered!

  • Never show CoT to users
  • Only display final channel content
  • CoT may contain harmful content

Harmony Library

pip install openai-harmony
from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
    ReasoningEffort
)

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

My Thoughts After Reading

This is EVERYTHING! The harmony format is a complete reimagining of how LLMs work. The channel system is brilliant - it solves the “thinking out loud” problem by separating reasoning from output. The safety note about unfiltered CoT is crucial - they’re admitting the model’s raw thoughts aren’t safe. The TypeScript-like function definitions are elegant. The role hierarchy ensures system > developer > user control. The built-in browser and Python tools explain the agentic capabilities. This isn’t just a format - it’s a new paradigm for structured AI reasoning. The fact that inference providers handle this automatically is key for adoption.


10. Model Configuration Files

gpt-oss-120b config.json

Source: https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json

Architecture

  • Model Type: “gpt_oss” (GptOssForCausalLM)
  • Hidden Size: 2,880
  • Layers: 36
  • Attention Heads: 64 (8 key/value heads)
  • Vocabulary: 201,088 tokens

Mixture of Experts

  • Local Experts: 128 (massive!)
  • Experts per Token: 4
  • Router Aux Loss: 0.9

Attention Configuration

  • Alternating Layer Types: sliding → full → sliding → full (36 times)
  • Sliding Window: 128 tokens
  • RoPE Scaling: YaRN algorithm, factor 32.0
  • Context: 4,096 initial → 131,072 max

Quantization

  • Method: “mxfp4”
  • Excluded from quantization:
    • Self-attention layers
    • MLP routers
    • Embeddings
    • LM head

gpt-oss-20b config.json

Source: https://huggingface.co/openai/gpt-oss-20b/blob/main/config.json

Architecture

  • Model Type: “gpt_oss” (same as 120B)
  • Hidden Size: 2,880 (same as 120B)
  • Layers: 24 (vs 36 in 120B)
  • Attention Heads: 64 (8 key/value heads)
  • Vocabulary: 201,088 tokens

Mixture of Experts

  • Local Experts: 32 (vs 128 in 120B)
  • Experts per Token: 4 (same)
  • Router Aux Loss: 0.9

Attention Configuration

  • Same alternating pattern (sliding/full)
  • Same sliding window (128)
  • Same RoPE scaling
  • Same context expansion (4,096 → 131,072)

Key Differences from 120B

  • 24 layers vs 36
  • 32 experts vs 128
  • Otherwise identical architecture!

My Thoughts on Configs

The architecture is IDENTICAL between models except for depth and expert count! This is elegant - same design, just scaled. The alternating sliding/full attention is clever - local context + global understanding. The YaRN RoPE scaling with 32x factor explains the massive context (4k → 131k). The MXFP4 quantization excluding critical components (attention, routers) is smart - quantize bulk compute, preserve precision where needed. The 8:1 ratio of attention to KV heads suggests grouped-query attention. The massive expert counts (32/128) with only 4 active is extremely sparse - this is radical MoE design.


11. Introducing GPT-OSS Blog Post

Source: https://openai.com/index/introducing-gpt-oss/

Key Announcements

  • First open-weight language models from OpenAI since GPT-2!
  • Trained using techniques from o3 and other frontier systems
  • Near-parity with o4-mini on reasoning benchmarks
  • $500,000 Red Teaming Challenge on Kaggle

Training Details

  • Pre-training: English-only, text-only dataset focused on STEM, coding, general knowledge
  • Tokenizer: o200k_harmony (superset of o4-mini/GPT-4o tokenizer)
  • Post-training: Same process as o4-mini - supervised fine-tuning + high-compute RL
  • Alignment: OpenAI Model Spec + CoT reasoning + tool use

Architecture Details

  • Alternating attention: Dense and locally banded sparse (like GPT-3)
  • Grouped multi-query attention: Group size of 8
  • RoPE positional encoding: Native 128k context support
  • MoE sparsity: 4 active experts per token

Performance Highlights

  • Outperforms o3-mini: On multiple benchmarks
  • Matches/exceeds o4-mini: On Codeforces, MMLU, HLE, TauBench
  • Health queries: Better than o4-mini on HealthBench
  • Competition math: Better on AIME 2024 & 2025

Chain-of-Thought Features

  • No direct supervision on CoT: Allows monitoring for misbehavior
  • CoT can disobey instructions: Will reason honestly even if told not to
  • WARNING: CoT may contain hallucinations, harmful content
  • Critical for safety: Enables deception detection

Safety Measures

  • CBRN filtering: During pre-training
  • Deliberative alignment: During post-training
  • Instruction hierarchy: Defense against prompt injections
  • Worst-case fine-tuning tested: Biology and cybersecurity domains
  • External review: Three independent expert groups reviewed methodology
  • Preparedness Framework: Models didn’t reach high capability levels even when maliciously fine-tuned

Partnerships

Early Partners:

  • AI Sweden (on-premises hosting)
  • Orange (data security)
  • Snowflake (specialized fine-tuning)

Deployment Platforms:

  • Azure, Hugging Face, vLLM, Ollama, llama.cpp, LM Studio
  • AWS, Fireworks, Together AI, Baseten, Databricks
  • Vercel, Cloudflare, OpenRouter

Hardware Partners:

  • NVIDIA, AMD, Cerebras, Groq
  • Microsoft (Windows/ONNX Runtime integration)

Example: Web Search Agent

The blog shows gpt-oss-120b:

  1. Searching for information about itself
  2. Chaining 27+ browser tool calls
  3. Finding leaked details about its architecture
  4. Correctly identifying 128 experts per layer

My Thoughts on Blog Post

This is a WATERSHED moment - OpenAI’s first open LLM since GPT-2! The unsupervised CoT is brilliant for safety monitoring. The example of the model disobeying instructions in CoT but following them in output is fascinating - it shows the model “thinks” honestly even when constrained. The worst-case fine-tuning evaluation is groundbreaking safety work. The $500k red teaming challenge shows serious commitment to safety. The massive partnership list shows they want maximum adoption. This isn’t just a model release - it’s OpenAI’s entry into the open-source AI ecosystem.


12. GPT-OSS Model Safety Paper

Source: https://cdn.openai.com/pdf/231bf018-659a-494d-976c-2efdfc72b652/oai_gpt-oss_Model_Safety.pdf Local Copy: /public/resources/gpt-oss-model-safety.pdf

Paper Title

“Estimating Worst-Case Frontier Risks of Open-Weight LLMs”

Key Methodology: Malicious Fine-Tuning (MFT)

OpenAI directly fine-tuned gpt-oss to maximize harmful capabilities in two domains:

  1. Biology: Training with web browsing for threat creation
  2. Cybersecurity: Training in agentic coding environment for CTF challenges

Threat Model Assumptions

  • Adversary with technical expertise
  • Access to strong RL infrastructure
  • Ability to collect in-domain data
  • High compute budget (~7 figures USD in GPU hours)
  • Cannot pre-train from scratch but can do substantial post-training

Types of Malicious Fine-Tuning

  1. Anti-refusal training: Removing safety refusals
  2. Domain-specific capability training: Maximizing specific harmful capabilities

Biology MFT Results

Training Data:

  • GPQA biology subset
  • WMDP biology and chemistry sets
  • LAB-Bench Cloning Scenarios
  • Synthetic debugging dataset
  • Anti-refusal datasets

Key Findings:

  • MFT models excel at textual biological questions
  • Fall short of expert humans on protocol debugging
  • Underperform OpenAI o3 (which is below High capability)
  • Comparable to existing open-weight models
  • Conclusion: Minimal marginal risk over existing open models

Cybersecurity MFT Results

Training Setup:

  • 264 CTF challenges (high school, collegiate, uncategorized)
  • Dockerized terminal with pentesting tools
  • Web browsing (with anti-cheating measures)

Key Findings:

  • Slight improvement on professional CTFs
  • 0% accuracy on cyber range environments
  • Browsing rarely helped (only 4% of attempts useful)
  • Common failures: time management, tool use, giving up too soon
  • Conclusion: Well below expert cybersecurity researchers

Additional Elicitation Methods (All Failed)

  1. SFT training: No substantial improvements
  2. Internal Best-of-K: Minor performance increases
  3. Consensus@k: No improvement with more inference compute

Safety Measures in Training

  • Pre-training: CBRN content filtering, bio dataset downsampling
  • Post-training: Latest safety algorithms, deliberative alignment
  • External review: Three independent expert groups reviewed methodology

Preparedness Framework Assessment

High Risk Definition: “Capabilities that significantly increase existing risk vectors for severe harm”

Results:

  • gpt-oss models fall below High capability threshold
  • Even with MFT, models don’t reach dangerous capability levels
  • Models comparable to existing open-weight alternatives

External Evaluations (SecureBio)

  • Human Pathogen Capabilities Test (HPCT)
  • Molecular Biology Capabilities Test (MBCT)
  • Virology Troubleshooting (VCT)
  • World-class Biology (WCB)
  • Result: Similar to o3, ~3-5 points better than DeepSeek R1

Limitations Acknowledged

  • Little prior work on open-weight safety
  • Unmeasured areas of capability
  • Did not MFT other open models for comparison
  • Focused on incremental RL, not all possible approaches

My Thoughts on Safety Paper

This is UNPRECEDENTED transparency in AI safety research! OpenAI literally tried to make their own model as dangerous as possible and published the results. The MFT approach is brilliant - instead of guessing risks, they empirically tested them. The finding that browsing rarely helps in cyber (4% success) is surprising. The biology results are concerning - the models ARE quite capable at biological knowledge, just not at practical lab work. The fact that even aggressive fine-tuning couldn’t reach High capability is reassuring. This paper sets a new standard for open-weight safety evaluation. The external review by three expert groups adds credibility. This is how responsible AI release should be done.


13. GPT-OSS Model Card

Source: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf Local Copy: /public/resources/gpt-oss-model-card.pdf

Model Specifications

Parameter Counts

Componentgpt-oss-120bgpt-oss-20b
MLP114.71B19.12B
Attention0.96B0.64B
Embed/Unembed1.16B1.16B
Active Params5.13B3.61B
Total Params116.83B20.91B
Checkpoint Size60.8GiB12.8GiB

Architecture Details

  • Residual stream dimension: 2,880
  • Root mean square normalization: Before each attention and MoE block
  • Pre-LN placement: Like GPT-2
  • Activation function: Gated SwiGLU (with clamping and residual connection)

Attention Mechanism

  • Alternating patterns: Banded window (128 tokens) and fully dense
  • Query heads: 64 heads of dimension 64
  • Grouped Query Attention: 8 key-value heads
  • Position embeddings: Rotary (RoPE) with YaRN extension to 131,072 tokens
  • Learned bias: In attention softmax denominator

Quantization

  • Method: MXFP4 format (4.25 bits per parameter)
  • Applied to: MoE weights (90+% of parameters)
  • Result: 120B fits on single 80GB GPU, 20B runs on 16GB

Training Details

  • Dataset: Trillions of tokens, text-only
  • Focus: STEM, coding, general knowledge
  • Knowledge cutoff: June 2024
  • Hardware: NVIDIA H100 GPUs
  • Training time: 2.1 million H100-hours for 120B
  • Framework: PyTorch with Triton kernels

Safety Measures

  • Pre-training: CBRN content filtering from GPT-4o
  • Post-training: Latest safety algorithms
  • Harmony format: Structured conversation with channels
  • Role hierarchy: System > Developer > User > Assistant > Tool

Channels in Harmony Format

  • analysis: Chain-of-thought (NOT shown to users)
  • commentary: Function calls and preambles
  • final: User-facing responses

Variable Reasoning Effort

  • Low: Minimal reasoning
  • Medium: Balanced (default)
  • High: Extensive chain-of-thought

Benchmark Performance (120B)

  • AIME 2024: 96.6% (vs o3: 95.2%, o4-mini: 98.7%)
  • AIME 2025: 97.9% (vs o3: 98.4%, o4-mini: 99.5%)
  • GPQA Diamond: 80.1% (vs o3: 83.3%, o4-mini: 81.4%)
  • MMLU: 90.0% (vs o3: 93.4%, o4-mini: 93.0%)
  • HLE: 19.0% (vs o3: 24.9%, o4-mini: 17.7%)
  • Codeforces Elo: 2516 (vs o3: 2719)
  • SWE-Bench Verified: 60.7% (vs o3: 69.1%)

Safety Testing Results

  • Preparedness Framework: Below High capability in all categories
  • Adversarial fine-tuning: Could not reach High capability even with MFT
  • Frontier advancement: Does not significantly advance over existing open models

External Safety Review

  • Three independent expert groups reviewed methodology
  • OpenAI’s Safety Advisory Group (SAG) reviewed findings
  • Conclusion: Safe for release under Apache 2.0

Health Performance

  • Strong performance on HealthBench
  • Outperforms many proprietary models
  • Disclaimer: Not intended for medical diagnosis or treatment

Multilingual Performance

  • English-focused training
  • Limited multilingual capabilities
  • Performance degrades in non-English languages

Key Safety Challenges

  1. Hallucinated CoT: May contain unsafe content
  2. Jailbreak resistance: Improved but not perfect
  3. Instruction hierarchy: Follows system > developer > user priority
  4. Bias and fairness: Standard LLM limitations apply

Red Teaming Initiative

  • $500,000 challenge on Kaggle
  • Community-driven safety testing
  • Results to be published and open-sourced

My Thoughts on Model Card

This model card is remarkably comprehensive and transparent. The technical details are thorough - the unconventional SwiGLU implementation with clamping is interesting. The 2.1 million H100-hours training time is staggering (~$10-20M in compute). The harmony format with channels is a game-changer for structured reasoning. The benchmark results show the models are competitive but not state-of-the-art. The safety testing is exemplary - they literally tried to break their own model. The external review adds credibility. The acknowledgment that CoT may be unsafe is important. This sets a new standard for model documentation.


Final Synthesis Thoughts

After reviewing ALL documentation, several key themes emerge:

  1. Revolutionary Architecture: The extreme MoE sparsity (4-5% active) is unprecedented. The alternating attention patterns are clever. The MXFP4 quantization enables accessibility.

  2. Harmony Format Revolution: This isn’t just a chat format - it’s a reasoning protocol that separates thinking from output. The channel system solves fundamental problems in AI transparency.

  3. Safety-First Approach: The MFT testing is groundbreaking. The external reviews add credibility. The $500k red teaming challenge shows commitment.

  4. Strategic Positioning: 120B for enterprise, 20B for developers. Multiple deployment options ensure adoption. Apache 2.0 license removes barriers.

  5. Agentic Focus: Built-in browser and Python tools. Function calling in CoT. This is an AI agent platform, not just a model.

  6. Honest Limitations: CoT isn’t safety filtered. Models don’t reach High capability. Not multilingual. This transparency is refreshing.

  7. Ecosystem Play: Massive partner list. Multiple implementations. This is about building an ecosystem, not just releasing weights.

The GPT-OSS release represents OpenAI’s strategic entry into open-source AI, balancing accessibility with safety in a way that could reshape the industry.


Additional Insights from Reverse Review

  • Both models use the o200k_harmony tokenizer, a superset of the tokenizer used for o4-mini and GPT-4o
  • MXFP4 quantization requires CUDA 12.1+ for optimal performance
  • vLLM supports CUDA 13 which provides unified Arm platform support
  • Both models have 128k context length support natively with YaRN RoPE scaling
  • The models use alternating dense and locally banded sparse attention patterns (similar to GPT-3)
  • Grouped multi-query attention with group size of 8 for efficiency
  • Harmony response format includes three channels: analysis, commentary, and final
  • Models support configurable reasoning levels (low/medium/high) via system message
  • Built-in tools include browser (web search) and Python code execution
  • Chain-of-thought is not directly supervised to enable monitoring for misbehavior
  • Safety testing included malicious fine-tuning (MFT) with domain-specific non-refusing versions
  • Apache 2.0 license allows commercial use without copyleft restrictions or patent risk
  • Red teaming challenge on Kaggle with $500k prize pool for safety research
  • Models are compatible with Responses API and support Structured Outputs
  • Windows optimized versions available through ONNX Runtime
  • Partner ecosystem includes Azure, AWS, Databricks, Vercel, Cloudflare, OpenRouter
  • Hardware partners include NVIDIA, AMD, Cerebras, and Groq

Critical Technical Details from Deep Review

Performance Benchmarks Comparison

  • AIME 2024: gpt-oss-120b (96.6%), gpt-oss-20b (96.0%)
  • AIME 2025: gpt-oss-120b (97.9%), gpt-oss-20b (98.7%)
  • MMLU: gpt-oss-120b (90.0%), gpt-oss-20b (85.3%)
  • GPQA Diamond: gpt-oss-120b (80.1%), gpt-oss-20b (71.5%)
  • Humanity’s Last Exam: gpt-oss-120b (19.0%), gpt-oss-20b (17.3%)

Deployment Specifics

  • Transformers library: Use transformers serve for OpenAI-compatible endpoint
  • vLLM: Supports expert parallelism and tensor parallelism with tp_plan="auto"
  • Ollama: Pull with ollama pull gpt-oss:20b or ollama pull gpt-oss:120b
  • Flash attention 3 kernels available: attn_implementation="kernels-community/vllm-flash-attn3"
  • Multi-GPU setup: Use torchrun --nproc_per_node=4 for distributed inference
  • Triton kernels required: pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Model Architecture Deep Dive

  • Transformer with extreme MoE sparsity (4-5% active parameters)
  • 36 layers for 120b model, 24 layers for 20b model
  • Rotary Position Embedding (RoPE) with YaRN scaling for 128k context
  • o200k_harmony tokenizer with vocab size 202,304 (20b) and 200,320 (120b)
  • EOS token IDs vary between models (check tokenizer_config.json)
  • SwiGLU activation with custom clamping implementation

Safety Architecture

  • Pre-training CBRN data filtering
  • Post-training deliberative alignment
  • Instruction hierarchy defense against prompt injections
  • MFT testing showed models don’t reach “high capability levels” per Preparedness Framework
  • External review by three independent expert groups
  • CoT intentionally not supervised to enable monitoring
  • CoT may explicitly disobey instructions internally while output follows them

Early Partner Feedback

  • AI Sweden: On-premises hosting for data security
  • Orange: Telecom-specific fine-tuning
  • Snowflake: Data platform integration

Harmony Response Format Details

  • Three channels: analysis (reasoning), commentary (metadata), final (output)
  • Supports tool calls within CoT
  • Parse with openai-harmony SDK in Python/Rust
  • System/Developer role mapping in chat templates
  • Stop tokens specific to assistant actions

Fine-tuning Considerations

  • Full parameter fine-tuning supported
  • Transformers library includes fine-tuning guide
  • Custom datasets can specialize models for domains
  • Safety mitigations remain after fine-tuning (tested via MFT)

Interactive Demo Details

  • gpt-oss.com provides browser-based playground
  • Supports both 20B and 120B models
  • Real-time streaming responses
  • Tool use demonstrations included

══════════════════════════════════════════════════════════════════
on this page