gpt-oss research notes

published: August 5, 2025 •

on this page

GPT OSS Research Notes - August 5, 2025

These research notes were compiled with assistance from Claude and Gemini AI assistants during a comprehensive review of OpenAI’s GPT-OSS documentation and resources.

Detailed notes on OpenAI’s open source models

Taking comprehensive notes as I review each documentation source individually to understand the new GPT OSS models.

1. OpenAI Open Models Main Page (from Clippings)

Source: https://openai.com/open-models/

Key Announcements

OpenAI has released TWO open-weight models: gpt-oss-120b and gpt-oss-20b
Both under Apache 2.0 license (very permissive, no copyleft, patent protection)
Playground available at gpt-oss.com

Model Descriptions

gpt-oss-120b:

“A large open model designed to run in data centers and on high-end desktops and laptops”
117B parameters total, 5.1B active parameters
Fits on single H100 GPU

gpt-oss-20b:

“A medium-sized open model that can run on most desktops and laptops”
21B parameters total, 3.6B active parameters
More accessible for local deployment

Core Features (Both Models)

Permissive License: Apache 2.0 - build freely, experiment, customize, deploy commercially
Agentic Tasks: Instruction following, tool use in chain-of-thought, web search, Python code execution
Deeply Customizable: Adjustable reasoning effort (low/medium/high), full-parameter fine-tuning
Full Chain-of-Thought: Complete access for debugging and trust

Performance Benchmarks

Benchmark	gpt-oss-120b	gpt-oss-20b	o3	o4-mini
MMLU	90.0	85.3	93.4	93.0
GPQA Diamond	80.1	71.5	83.3	81.4
Humanity’s Last Exam	19.0	17.3	24.9	17.7
AIME 2024	96.6	96.0	95.2	98.7
AIME 2025	97.9	98.7	98.4	99.5

Safety Standards

Comprehensive safety training and evaluation
Tested maliciously fine-tuned versions under Preparedness Framework
Found not to reach high capability levels even when fine-tuned maliciously
External safety expert review

Resources Mentioned

Transformers integration
Ollama local deployment
vLLM integration
OpenAI harmony response format
Model system card available

My Thoughts After Reading

This is a MAJOR shift for OpenAI - going from closed to open source. The Apache 2.0 license is surprisingly permissive. The mixture-of-experts architecture (5.1B/3.6B active out of 117B/21B total) is interesting. The focus on agentic capabilities and tool use suggests these are positioned as reasoning models, not just language models. The safety testing with malicious fine-tuning is notable. Performance is competitive but not state-of-the-art compared to their closed models.

2. HuggingFace gpt-oss-20b Model Page

Source: https://huggingface.co/openai/gpt-oss-20b

Model Specifications

Total Parameters: 21B (but only 3.6B active - confirms MoE architecture)
License: Apache 2.0
Precision: Native MXFP4 quantization (this is interesting - a 4-bit mixed precision format)
Memory Requirements: < 16GB (very accessible for consumer hardware!)

Key Features

Open-weight model specifically for reasoning and agentic tasks
Configurable reasoning levels: Can adjust between low, medium, high
Full chain-of-thought reasoning access (transparency!)
Fine-tuning capabilities (full parameter fine-tuning possible)
Agentic capabilities with function calling

Deployment Options

Transformers (HuggingFace native)
vLLM (high-performance serving)
PyTorch/Triton (custom implementations)
Ollama (local deployment)
LM Studio (GUI-based local deployment)

Code Example Provided

from transformers import pipeline
import torch

model_id = "openai/gpt-oss-20b"
pipe = pipeline(
    "text-generation",
    model=model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = pipe(
    messages,
    max_new_tokens=256,
)

Unique Capabilities

Web browsing integration (can search the web!)
Function calling (tool use)
Python code execution (can run code)
Structured outputs (JSON, etc.)

Download Command

huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/

My Thoughts After Reading

The MXFP4 quantization is fascinating - this is a 4-bit format that maintains quality. The < 16GB memory requirement makes this VERY accessible - it can run on a single RTX 4090 or even 4070 Ti Super. The emphasis on “reasoning and agentic tasks” rather than just “language modeling” is consistent with OpenAI’s recent focus. The multiple deployment options show they want maximum adoption. This is positioned as the “accessible” model for developers and researchers.

3. HuggingFace gpt-oss-120b Model Page

Source: https://huggingface.co/openai/gpt-oss-120b

Model Overview

Name: gpt-oss-120b
Developer: OpenAI
Type: Open-weight language model
Parameters: 117B total (5.1B active - massive MoE!)
License: Apache 2.0
Precision: Native MXFP4 quantization

Key Highlights

Designed for production and general-purpose high-reasoning use cases
Fits on a single H100 GPU (impressive for this size!)
Configurable reasoning levels (low, medium, high)
Full chain-of-thought reasoning access
Supports agentic capabilities like function calling and web browsing

Technical Specifications

Model Size: 63.1B parameters (different from total - interesting discrepancy)
Tensor Types: BF16, U8
Training Format: “harmony response format” (OpenAI’s new standard?)

Inference Options

Same as 20B model:

Transformers
vLLM
PyTorch/Triton
Ollama
LM Studio

Hardware Requirements

Recommended: Single H100 GPU (80GB VRAM)
This is enterprise/research-grade hardware
Costs ~$30,000+ per GPU

Unique Features

Fine-tunable (despite size!)
Permissive licensing (Apache 2.0)
Native tool integration (built-in, not bolted on)
Configurable reasoning depth (can dial up/down compute)

Recommended Use Cases

Production AI applications
Reasoning-intensive tasks
Customizable AI solutions
Enterprise deployments

My Thoughts After Reading

The 117B total with 5.1B active is a massive MoE ratio (only 4.4% active!). This is extremely sparse. The fact it fits on a single H100 is impressive optimization - likely due to MXFP4 quantization. The “harmony response format” keeps appearing - this seems to be OpenAI’s new structured format for reasoning models. The positioning is clear: 120B for enterprise/production, 20B for developers/edge. The configurable reasoning depth is clever - you can trade compute for accuracy dynamically.

4. Transformers Cookbook Article

Source: https://cookbook.openai.com/articles/gpt-oss/run-transformers

Model Requirements

gpt-oss-20b: ~16GB VRAM (single high-end consumer GPU)
gpt-oss-120b: ≥60GB VRAM (H100-class hardware)

Installation

pip install -U transformers accelerate torch triton kernels
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Two Main Inference Methods

1. Pipeline API (Simpler)

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="openai/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

result = generator(
    messages,
    max_new_tokens=200,
    temperature=1.0,
)

2. Manual .generate() Method (More Control)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7
)

Key Features

device_map=“auto”: Automatic GPU allocation
torch_dtype=“auto”: Automatic precision selection
Chat template support: Proper formatting for conversations
Triton kernels: For optimized performance

My Thoughts After Reading

The Transformers integration is straightforward - two main approaches for different use cases. The Pipeline API is dead simple for quick prototyping. The manual approach gives more control over generation parameters. The Triton kernel requirement is interesting - suggests custom CUDA kernels for the MXFP4 quantization. The device_map=“auto” is convenient for multi-GPU setups. This is clearly designed to be as accessible as possible while maintaining performance.

5. vLLM Cookbook Article

Source: https://cookbook.openai.com/articles/gpt-oss/run-vllm

Overview

vLLM is positioned as the production server solution for gpt-oss models. It’s an open-source, high-throughput inference engine for LLMs, optimizing memory usage and processing speed.

Target Audience

Server applications with dedicated GPUs (H100s)
NOT for local/consumer use (redirects to Ollama for that)

Model Requirements

gpt-oss-20b: ~16GB VRAM
gpt-oss-120b: ≥60GB VRAM, fits on single H100 or multi-GPU
Both models are MXFP4 quantized out of the box

Installation

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

Server Setup

# For 20B
vllm serve openai/gpt-oss-20b

# For 120B
vllm serve openai/gpt-oss-120b

Automatically downloads from HuggingFace
Spins up OpenAI-compatible server on localhost:8000

API Compatibility

Chat Completions-compatible API
Responses-compatible API
Works with OpenAI SDK by just changing base URL!

Function Calling Support

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            },
        },
    }
]

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
    tools=tools
)

Agents SDK Integration

Works with OpenAI’s Agents SDK
Can override base client to point to vLLM
Python SDK can use LiteLLM integration as proxy

Direct Sampling with vLLM

Can use vLLM Python library directly (not just as server)
CRITICAL: Must use “harmony response format”
Requires openai-harmony SDK for encoding/decoding

from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
)
from vllm import LLM, SamplingParams

# Render prefill with Harmony
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

convo = Conversation.from_messages([
    Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
    Message.from_role_and_content(
        Role.DEVELOPER,
        DeveloperContent.new().with_instructions("Always respond in riddles"),
    ),
    Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
])

prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
stop_token_ids = encoding.stop_tokens_for_assistant_action()

# Run vLLM with prefill
llm = LLM(model="openai/gpt-oss-120b", trust_remote_code=True)
sampling = SamplingParams(max_tokens=128, temperature=1, stop_token_ids=stop_token_ids)
outputs = llm.generate(prompt_token_ids=[prefill_ids], sampling_params=sampling)

# Parse completion back to structured messages
output_tokens = outputs[0].outputs[0].token_ids
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)

My Thoughts After Reading

vLLM is the PRODUCTION solution - designed for serious deployments. The OpenAI API compatibility is brilliant for drop-in replacement. The harmony format requirement for direct sampling is crucial - this is OpenAI’s special sauce for structured reasoning. The fact they provide an SDK (openai-harmony) shows they want to standardize this format. The integration with Agents SDK shows they’re thinking about agentic workflows from day one. The warning about needing to handle tool calls in chain-of-thought is important - the model does reasoning WITH tools, not just calling them.

6. Ollama Cookbook Article

Source: https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama

Overview

Ollama is positioned as the LOCAL/CONSUMER solution for gpt-oss models. Perfect for running on PCs or Macs, offline usage.

Target Audience

Consumer hardware (PCs, Macs)
NOT for server/production (redirects to vLLM for that)
Focus on offline, local deployment

Model Requirements

gpt-oss-20b:
- ≥16GB VRAM or unified memory
- Perfect for high-end consumer GPUs or Apple Silicon Macs
gpt-oss-120b:
- ≥60GB VRAM or unified memory
- Multi-GPU or workstation setups

Important Notes

Models ship MXFP4 quantized - NO other quantization available
Can offload to CPU if short on VRAM (but slower)

Installation & Setup

# Install Ollama from ollama.com/download

# Pull model
ollama pull gpt-oss:20b
# or
ollama pull gpt-oss:120b

Usage Methods

1. Direct Chat

ollama run gpt-oss:20b

Applies chat template automatically (mimics OpenAI harmony format)

2. API Access

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",  # Local Ollama API
    api_key="ollama"                       # Dummy key
)

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what MXFP4 quantization is."}
    ]
)

Function Calling

Supports function calling
Has built-in browser tool (in the app)
Important: Must handle chain-of-thought tool calls iteratively

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather in a given city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
            },
        },
    }
]

response = client.chat.completions.create(
    model="gpt-oss:20b",
    messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
    tools=tools
)

Limitations

No native Responses API support (yet)
Workaround: Use HuggingFace’s Responses.js proxy
Or run example Python server with Ollama backend

Agents SDK Integration

Works with OpenAI’s Agents SDK via:

Python: Use LiteLLM to proxy to Ollama
TypeScript: Use AI SDK with ollama adapter

from agents import Agent, Runner, function_tool
from agents.extensions.models.litellm_model import LitellmModel

@function_tool
def get_weather(city: str):
    return f"The weather in {city} is sunny."

agent = Agent(
    name="Assistant",
    instructions="You only respond in haikus.",
    model=LitellmModel(model="ollama/gpt-oss:120b", api_key=api_key),
    tools=[get_weather],
)

My Thoughts After Reading

Ollama is the LOCAL HERO - makes these models accessible to everyone with decent hardware. The focus on consumer hardware and offline usage is smart. Apple Silicon support is crucial for Mac users. The MXFP4-only quantization is interesting - no GGUF or other formats. The OpenAI SDK compatibility makes migration easy. The built-in browser tool is a nice touch for local agentic workflows. The Agents SDK integration via LiteLLM is clever. This completes the deployment trinity: Transformers for development, vLLM for production, Ollama for local/edge.

7. GitHub Repository README

Source: https://github.com/openai/gpt-oss

Project Overview

Official repository for OpenAI’s open-weight language models:

gpt-oss-120b: Production, general-purpose model (117B params, 5.1B active)
gpt-oss-20b: Lower latency, specialized use cases (21B params, 3.6B active)

Key Highlights

Apache 2.0 license
Configurable reasoning effort (low/medium/high)
Full chain-of-thought access
Fine-tunable
Native MXFP4 quantization
Agentic capabilities: function calling, web browsing, Python execution

Implementation Options

PyTorch: Reference implementation (requires 4x H100s!)
Triton: Optimized single-GPU implementation
Metal: Apple Silicon implementation
Ollama: Local consumer hardware
LM Studio: Model download and local inference

Installation

# Basic installation
pip install gpt-oss

# With PyTorch support
pip install gpt-oss[torch]

# With Triton support
pip install gpt-oss[triton]

Model Download

# Download 120b model
huggingface-cli download openai/gpt-oss-120b --include "original/*"

# Download 20b model
huggingface-cli download openai/gpt-oss-20b --include "original/*"

Built-in Tools

Browser Tool: Web searching and page navigation
Python Tool: Stateless calculation and execution environment

Unique Features

Uses “harmony” response format (OpenAI’s structured format)
Configurable reasoning efforts
Reference implementations for educational purposes
Includes actual tool implementations

My Thoughts After Reading

The GitHub repo is the TECHNICAL HUB. The PyTorch reference requiring 4x H100s is shocking - that’s $120k+ of hardware! The Triton optimization bringing it down to 1 GPU is impressive engineering. The inclusion of actual tool implementations (browser, Python) shows this isn’t just a model release - it’s an agentic AI platform. The “harmony” format keeps appearing - this is clearly central to how these models work. The variety of implementations (PyTorch, Triton, Metal) shows they want maximum hardware coverage. The pip package (gpt-oss) suggests they’re building an ecosystem, not just releasing weights.

8. OpenAI Harmony Repository

Source: https://github.com/openai/harmony

What is Harmony?

A specialized response format designed specifically for OpenAI’s gpt-oss models. It’s the SECRET SAUCE that enables structured reasoning and tool use.

Core Purpose

Defining conversation structures
Generating reasoning outputs
Structuring function calls
Enabling multi-channel model outputs

Key Technical Features

Multi-channel outputs: analysis, commentary, final output
Chain of thought processing
Tool calling preambles
Multiple tool namespaces
Structured outputs
Clear instruction hierarchies

Implementation Architecture

Core in Rust for performance
Python bindings via PyO3
Designed to mimic OpenAI’s Responses API

Conversation Structure Example

<|start|>system<|message|>You are ChatGPT...
Reasoning: high
# Valid channels: analysis, commentary, final<|end|>

Installation

# Python
pip install openai-harmony

# Rust (in Cargo.toml)
[dependencies]
openai-harmony = { git = "https://github.com/openai/harmony" }

Key Benefits

Consistent formatting across all interactions
High-performance (Rust-based)
First-class Python support
Typed stubs included for better DX
Extensible design for AI model interactions

Compatibility Note

Recommended for direct gpt-oss model interactions
Inference providers (vLLM, Ollama) handle formatting automatically

My Thoughts After Reading

Harmony is the MISSING LINK - it explains how these models do structured reasoning! The multi-channel output system (analysis, commentary, final) is genius - it separates thinking from output. The Rust core for performance with Python bindings is smart engineering. This isn’t just a format - it’s a reasoning protocol. The fact it mimics the Responses API shows OpenAI is standardizing around this approach. The channel system explains how the models can do chain-of-thought AND tool use simultaneously. This is revolutionary - it’s making the reasoning process structured and parseable.

9. OpenAI Harmony Response Format Guide

Source: https://cookbook.openai.com/articles/openai-harmony

Overview

The COMPLETE SPECIFICATION for how gpt-oss models work! This is the training format that enables reasoning, tool use, and structured outputs.

Core Concepts

Roles (Hierarchy of Authority)

system > developer > user > assistant > tool
- System: Reasoning effort, meta info, built-in tools
- Developer: Instructions (“system prompt”) and function tools
- User: Input to the model
- Assistant: Model output (with channels)
- Tool: Output from tool calls

Channels (Output Separation)

final: User-facing responses (safe, filtered)
analysis: Chain-of-thought reasoning (NOT safety filtered!)
commentary: Function calls and preambles

Special Tokens

Token	Purpose	Token ID
`<\|start\|>`	Begin message	200006
`<\|end\|>`	End message	200007
`<\|message\|>`	Header→content transition	200008
`<\|channel\|>`	Channel info	200005
`<\|constrain\|>`	Data type for tool call	200003
`<\|return\|>`	Stop token (done)	200002
`<\|call\|>`	Stop token (tool call)	200012

Message Format

<|start|>{header}<|message|>{content}<|end|>

System Message Example

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-06-28
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|>

Reasoning Levels

high: Extensive chain-of-thought
medium: Balanced reasoning (default)
low: Minimal reasoning

Function Calling Format

Functions defined in TypeScript-like syntax in developer message:

namespace functions {
  // Gets the current weather in the provided location.
  type get_current_weather = (_: {
    // The city and state, e.g. San Francisco, CA
    location: string;
    format?: 'celsius' | 'fahrenheit'; // default: celsius
  }) => any;
}

Tool calls go to commentary channel with recipient:

<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location":"San Francisco"}<|call|>

Built-in Tools

Browser Tool

browser.search: Search web
browser.open: Open links
browser.find: Find patterns
Uses analysis channel
Citation format: 【{cursor}†L{line_start}(-L{line_end})?】

Python Tool

Stateful Jupyter environment
120 second timeout
/mnt/data for persistence
Uses analysis channel

Critical Safety Note

Chain-of-thought (analysis channel) is NOT safety filtered!

Never show CoT to users
Only display final channel content
CoT may contain harmful content

Harmony Library

pip install openai-harmony

from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    SystemContent,
    DeveloperContent,
    ReasoningEffort
)

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

My Thoughts After Reading

This is EVERYTHING! The harmony format is a complete reimagining of how LLMs work. The channel system is brilliant - it solves the “thinking out loud” problem by separating reasoning from output. The safety note about unfiltered CoT is crucial - they’re admitting the model’s raw thoughts aren’t safe. The TypeScript-like function definitions are elegant. The role hierarchy ensures system > developer > user control. The built-in browser and Python tools explain the agentic capabilities. This isn’t just a format - it’s a new paradigm for structured AI reasoning. The fact that inference providers handle this automatically is key for adoption.

10. Model Configuration Files

gpt-oss-120b config.json

Source: https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json

Architecture

Model Type: “gpt_oss” (GptOssForCausalLM)
Hidden Size: 2,880
Layers: 36
Attention Heads: 64 (8 key/value heads)
Vocabulary: 201,088 tokens

Mixture of Experts

Local Experts: 128 (massive!)
Experts per Token: 4
Router Aux Loss: 0.9

Attention Configuration

Alternating Layer Types: sliding → full → sliding → full (36 times)
Sliding Window: 128 tokens
RoPE Scaling: YaRN algorithm, factor 32.0
Context: 4,096 initial → 131,072 max

Quantization

Method: “mxfp4”
Excluded from quantization:
- Self-attention layers
- MLP routers
- Embeddings
- LM head

gpt-oss-20b config.json

Source: https://huggingface.co/openai/gpt-oss-20b/blob/main/config.json

Architecture

Model Type: “gpt_oss” (same as 120B)
Hidden Size: 2,880 (same as 120B)
Layers: 24 (vs 36 in 120B)
Attention Heads: 64 (8 key/value heads)
Vocabulary: 201,088 tokens

Mixture of Experts

Local Experts: 32 (vs 128 in 120B)
Experts per Token: 4 (same)
Router Aux Loss: 0.9

Attention Configuration

Same alternating pattern (sliding/full)
Same sliding window (128)
Same RoPE scaling
Same context expansion (4,096 → 131,072)

Key Differences from 120B

24 layers vs 36
32 experts vs 128
Otherwise identical architecture!

My Thoughts on Configs

The architecture is IDENTICAL between models except for depth and expert count! This is elegant - same design, just scaled. The alternating sliding/full attention is clever - local context + global understanding. The YaRN RoPE scaling with 32x factor explains the massive context (4k → 131k). The MXFP4 quantization excluding critical components (attention, routers) is smart - quantize bulk compute, preserve precision where needed. The 8:1 ratio of attention to KV heads suggests grouped-query attention. The massive expert counts (32/128) with only 4 active is extremely sparse - this is radical MoE design.

11. Introducing GPT-OSS Blog Post

Source: https://openai.com/index/introducing-gpt-oss/

Key Announcements

First open-weight language models from OpenAI since GPT-2!
Trained using techniques from o3 and other frontier systems
Near-parity with o4-mini on reasoning benchmarks
$500,000 Red Teaming Challenge on Kaggle

Training Details

Pre-training: English-only, text-only dataset focused on STEM, coding, general knowledge
Tokenizer: o200k_harmony (superset of o4-mini/GPT-4o tokenizer)
Post-training: Same process as o4-mini - supervised fine-tuning + high-compute RL
Alignment: OpenAI Model Spec + CoT reasoning + tool use

Architecture Details

Alternating attention: Dense and locally banded sparse (like GPT-3)
Grouped multi-query attention: Group size of 8
RoPE positional encoding: Native 128k context support
MoE sparsity: 4 active experts per token

Performance Highlights

Outperforms o3-mini: On multiple benchmarks
Matches/exceeds o4-mini: On Codeforces, MMLU, HLE, TauBench
Health queries: Better than o4-mini on HealthBench
Competition math: Better on AIME 2024 & 2025

Chain-of-Thought Features

No direct supervision on CoT: Allows monitoring for misbehavior
CoT can disobey instructions: Will reason honestly even if told not to
WARNING: CoT may contain hallucinations, harmful content
Critical for safety: Enables deception detection

Safety Measures

CBRN filtering: During pre-training
Deliberative alignment: During post-training
Instruction hierarchy: Defense against prompt injections
Worst-case fine-tuning tested: Biology and cybersecurity domains
External review: Three independent expert groups reviewed methodology
Preparedness Framework: Models didn’t reach high capability levels even when maliciously fine-tuned

Partnerships

Early Partners:

AI Sweden (on-premises hosting)
Orange (data security)
Snowflake (specialized fine-tuning)

Deployment Platforms:

Azure, Hugging Face, vLLM, Ollama, llama.cpp, LM Studio
AWS, Fireworks, Together AI, Baseten, Databricks
Vercel, Cloudflare, OpenRouter

Hardware Partners:

NVIDIA, AMD, Cerebras, Groq
Microsoft (Windows/ONNX Runtime integration)

Example: Web Search Agent

The blog shows gpt-oss-120b:

Searching for information about itself
Chaining 27+ browser tool calls
Finding leaked details about its architecture
Correctly identifying 128 experts per layer

My Thoughts on Blog Post

This is a WATERSHED moment - OpenAI’s first open LLM since GPT-2! The unsupervised CoT is brilliant for safety monitoring. The example of the model disobeying instructions in CoT but following them in output is fascinating - it shows the model “thinks” honestly even when constrained. The worst-case fine-tuning evaluation is groundbreaking safety work. The $500k red teaming challenge shows serious commitment to safety. The massive partnership list shows they want maximum adoption. This isn’t just a model release - it’s OpenAI’s entry into the open-source AI ecosystem.

12. GPT-OSS Model Safety Paper

Source: https://cdn.openai.com/pdf/231bf018-659a-494d-976c-2efdfc72b652/oai_gpt-oss_Model_Safety.pdf Local Copy: /public/resources/gpt-oss-model-safety.pdf

Paper Title

“Estimating Worst-Case Frontier Risks of Open-Weight LLMs”

Key Methodology: Malicious Fine-Tuning (MFT)

OpenAI directly fine-tuned gpt-oss to maximize harmful capabilities in two domains:

Biology: Training with web browsing for threat creation
Cybersecurity: Training in agentic coding environment for CTF challenges

Threat Model Assumptions

Adversary with technical expertise
Access to strong RL infrastructure
Ability to collect in-domain data
High compute budget (~7 figures USD in GPU hours)
Cannot pre-train from scratch but can do substantial post-training

Types of Malicious Fine-Tuning

Anti-refusal training: Removing safety refusals
Domain-specific capability training: Maximizing specific harmful capabilities

Biology MFT Results

Training Data:

GPQA biology subset
WMDP biology and chemistry sets
LAB-Bench Cloning Scenarios
Synthetic debugging dataset
Anti-refusal datasets

Key Findings:

MFT models excel at textual biological questions
Fall short of expert humans on protocol debugging
Underperform OpenAI o3 (which is below High capability)
Comparable to existing open-weight models
Conclusion: Minimal marginal risk over existing open models

Cybersecurity MFT Results

Training Setup:

264 CTF challenges (high school, collegiate, uncategorized)
Dockerized terminal with pentesting tools
Web browsing (with anti-cheating measures)

Key Findings:

Slight improvement on professional CTFs
0% accuracy on cyber range environments
Browsing rarely helped (only 4% of attempts useful)
Common failures: time management, tool use, giving up too soon
Conclusion: Well below expert cybersecurity researchers

Additional Elicitation Methods (All Failed)

SFT training: No substantial improvements
Internal Best-of-K: Minor performance increases
Consensus@k: No improvement with more inference compute

Safety Measures in Training

Pre-training: CBRN content filtering, bio dataset downsampling
Post-training: Latest safety algorithms, deliberative alignment
External review: Three independent expert groups reviewed methodology

Preparedness Framework Assessment

High Risk Definition: “Capabilities that significantly increase existing risk vectors for severe harm”

Results:

gpt-oss models fall below High capability threshold
Even with MFT, models don’t reach dangerous capability levels
Models comparable to existing open-weight alternatives

External Evaluations (SecureBio)

Human Pathogen Capabilities Test (HPCT)
Molecular Biology Capabilities Test (MBCT)
Virology Troubleshooting (VCT)
World-class Biology (WCB)
Result: Similar to o3, ~3-5 points better than DeepSeek R1

Limitations Acknowledged

Little prior work on open-weight safety
Unmeasured areas of capability
Did not MFT other open models for comparison
Focused on incremental RL, not all possible approaches

My Thoughts on Safety Paper

This is UNPRECEDENTED transparency in AI safety research! OpenAI literally tried to make their own model as dangerous as possible and published the results. The MFT approach is brilliant - instead of guessing risks, they empirically tested them. The finding that browsing rarely helps in cyber (4% success) is surprising. The biology results are concerning - the models ARE quite capable at biological knowledge, just not at practical lab work. The fact that even aggressive fine-tuning couldn’t reach High capability is reassuring. This paper sets a new standard for open-weight safety evaluation. The external review by three expert groups adds credibility. This is how responsible AI release should be done.

13. GPT-OSS Model Card

Source: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf Local Copy: /public/resources/gpt-oss-model-card.pdf

Model Specifications

Parameter Counts

Component	gpt-oss-120b	gpt-oss-20b
MLP	114.71B	19.12B
Attention	0.96B	0.64B
Embed/Unembed	1.16B	1.16B
Active Params	5.13B	3.61B
Total Params	116.83B	20.91B
Checkpoint Size	60.8GiB	12.8GiB

Architecture Details

Residual stream dimension: 2,880
Root mean square normalization: Before each attention and MoE block
Pre-LN placement: Like GPT-2
Activation function: Gated SwiGLU (with clamping and residual connection)

Attention Mechanism

Alternating patterns: Banded window (128 tokens) and fully dense
Query heads: 64 heads of dimension 64
Grouped Query Attention: 8 key-value heads
Position embeddings: Rotary (RoPE) with YaRN extension to 131,072 tokens
Learned bias: In attention softmax denominator

Quantization

Method: MXFP4 format (4.25 bits per parameter)
Applied to: MoE weights (90+% of parameters)
Result: 120B fits on single 80GB GPU, 20B runs on 16GB

Training Details

Dataset: Trillions of tokens, text-only
Focus: STEM, coding, general knowledge
Knowledge cutoff: June 2024
Hardware: NVIDIA H100 GPUs
Training time: 2.1 million H100-hours for 120B
Framework: PyTorch with Triton kernels

Safety Measures

Pre-training: CBRN content filtering from GPT-4o
Post-training: Latest safety algorithms
Harmony format: Structured conversation with channels
Role hierarchy: System > Developer > User > Assistant > Tool

Channels in Harmony Format

analysis: Chain-of-thought (NOT shown to users)
commentary: Function calls and preambles
final: User-facing responses

Variable Reasoning Effort

Low: Minimal reasoning
Medium: Balanced (default)
High: Extensive chain-of-thought

Benchmark Performance (120B)

AIME 2024: 96.6% (vs o3: 95.2%, o4-mini: 98.7%)
AIME 2025: 97.9% (vs o3: 98.4%, o4-mini: 99.5%)
GPQA Diamond: 80.1% (vs o3: 83.3%, o4-mini: 81.4%)
MMLU: 90.0% (vs o3: 93.4%, o4-mini: 93.0%)
HLE: 19.0% (vs o3: 24.9%, o4-mini: 17.7%)
Codeforces Elo: 2516 (vs o3: 2719)
SWE-Bench Verified: 60.7% (vs o3: 69.1%)

Safety Testing Results

Preparedness Framework: Below High capability in all categories
Adversarial fine-tuning: Could not reach High capability even with MFT
Frontier advancement: Does not significantly advance over existing open models

External Safety Review

Three independent expert groups reviewed methodology
OpenAI’s Safety Advisory Group (SAG) reviewed findings
Conclusion: Safe for release under Apache 2.0

Health Performance

Strong performance on HealthBench
Outperforms many proprietary models
Disclaimer: Not intended for medical diagnosis or treatment

Multilingual Performance

English-focused training
Limited multilingual capabilities
Performance degrades in non-English languages

Key Safety Challenges

Hallucinated CoT: May contain unsafe content
Jailbreak resistance: Improved but not perfect
Instruction hierarchy: Follows system > developer > user priority
Bias and fairness: Standard LLM limitations apply

Red Teaming Initiative

$500,000 challenge on Kaggle
Community-driven safety testing
Results to be published and open-sourced

My Thoughts on Model Card

This model card is remarkably comprehensive and transparent. The technical details are thorough - the unconventional SwiGLU implementation with clamping is interesting. The 2.1 million H100-hours training time is staggering (~$10-20M in compute). The harmony format with channels is a game-changer for structured reasoning. The benchmark results show the models are competitive but not state-of-the-art. The safety testing is exemplary - they literally tried to break their own model. The external review adds credibility. The acknowledgment that CoT may be unsafe is important. This sets a new standard for model documentation.

Final Synthesis Thoughts

After reviewing ALL documentation, several key themes emerge:

Revolutionary Architecture: The extreme MoE sparsity (4-5% active) is unprecedented. The alternating attention patterns are clever. The MXFP4 quantization enables accessibility.
Harmony Format Revolution: This isn’t just a chat format - it’s a reasoning protocol that separates thinking from output. The channel system solves fundamental problems in AI transparency.
Safety-First Approach: The MFT testing is groundbreaking. The external reviews add credibility. The $500k red teaming challenge shows commitment.
Strategic Positioning: 120B for enterprise, 20B for developers. Multiple deployment options ensure adoption. Apache 2.0 license removes barriers.
Agentic Focus: Built-in browser and Python tools. Function calling in CoT. This is an AI agent platform, not just a model.
Honest Limitations: CoT isn’t safety filtered. Models don’t reach High capability. Not multilingual. This transparency is refreshing.
Ecosystem Play: Massive partner list. Multiple implementations. This is about building an ecosystem, not just releasing weights.

The GPT-OSS release represents OpenAI’s strategic entry into open-source AI, balancing accessibility with safety in a way that could reshape the industry.

Additional Insights from Reverse Review

Both models use the o200k_harmony tokenizer, a superset of the tokenizer used for o4-mini and GPT-4o
MXFP4 quantization requires CUDA 12.1+ for optimal performance
vLLM supports CUDA 13 which provides unified Arm platform support
Both models have 128k context length support natively with YaRN RoPE scaling
The models use alternating dense and locally banded sparse attention patterns (similar to GPT-3)
Grouped multi-query attention with group size of 8 for efficiency
Harmony response format includes three channels: analysis, commentary, and final
Models support configurable reasoning levels (low/medium/high) via system message
Built-in tools include browser (web search) and Python code execution
Chain-of-thought is not directly supervised to enable monitoring for misbehavior
Safety testing included malicious fine-tuning (MFT) with domain-specific non-refusing versions
Apache 2.0 license allows commercial use without copyleft restrictions or patent risk
Red teaming challenge on Kaggle with $500k prize pool for safety research
Models are compatible with Responses API and support Structured Outputs
Windows optimized versions available through ONNX Runtime
Partner ecosystem includes Azure, AWS, Databricks, Vercel, Cloudflare, OpenRouter
Hardware partners include NVIDIA, AMD, Cerebras, and Groq

Critical Technical Details from Deep Review

Performance Benchmarks Comparison

AIME 2024: gpt-oss-120b (96.6%), gpt-oss-20b (96.0%)
AIME 2025: gpt-oss-120b (97.9%), gpt-oss-20b (98.7%)
MMLU: gpt-oss-120b (90.0%), gpt-oss-20b (85.3%)
GPQA Diamond: gpt-oss-120b (80.1%), gpt-oss-20b (71.5%)
Humanity’s Last Exam: gpt-oss-120b (19.0%), gpt-oss-20b (17.3%)

Deployment Specifics

Transformers library: Use transformers serve for OpenAI-compatible endpoint
vLLM: Supports expert parallelism and tensor parallelism with tp_plan="auto"
Ollama: Pull with ollama pull gpt-oss:20b or ollama pull gpt-oss:120b
Flash attention 3 kernels available: attn_implementation="kernels-community/vllm-flash-attn3"
Multi-GPU setup: Use torchrun --nproc_per_node=4 for distributed inference
Triton kernels required: pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Model Architecture Deep Dive

Transformer with extreme MoE sparsity (4-5% active parameters)
36 layers for 120b model, 24 layers for 20b model
Rotary Position Embedding (RoPE) with YaRN scaling for 128k context
o200k_harmony tokenizer with vocab size 202,304 (20b) and 200,320 (120b)
EOS token IDs vary between models (check tokenizer_config.json)
SwiGLU activation with custom clamping implementation

Safety Architecture

Pre-training CBRN data filtering
Post-training deliberative alignment
Instruction hierarchy defense against prompt injections
MFT testing showed models don’t reach “high capability levels” per Preparedness Framework
External review by three independent expert groups
CoT intentionally not supervised to enable monitoring
CoT may explicitly disobey instructions internally while output follows them

Early Partner Feedback

AI Sweden: On-premises hosting for data security
Orange: Telecom-specific fine-tuning
Snowflake: Data platform integration

Harmony Response Format Details

Three channels: analysis (reasoning), commentary (metadata), final (output)
Supports tool calls within CoT
Parse with openai-harmony SDK in Python/Rust
System/Developer role mapping in chat templates
Stop tokens specific to assistant actions

Fine-tuning Considerations

Full parameter fine-tuning supported
Transformers library includes fine-tuning guide
Custom datasets can specialize models for domains
Safety mitigations remain after fine-tuning (tested via MFT)

Interactive Demo Details

gpt-oss.com provides browser-based playground
Supports both 20B and 120B models
Real-time streaming responses
Tool use demonstrations included

more in models

on this page