openai gpt-oss models

openai has released two open-weight reasoning models under apache 2.0 license: gpt-oss-120b and gpt-oss-20b. these models represent a significant shift for openai toward open-source ai, featuring advanced reasoning capabilities, tool use, and a novel harmony response format.

📚 comprehensive research notes: for detailed technical analysis and complete documentation review, see the extensive research notes compiled from a variety of official sources and repos.

overview

models

model	total params	active params	experts	context	vram
gpt-oss-120b	117b	5.1b (4%)	128 (4 active)	128k	60-80gb
gpt-oss-20b	21b	3.6b (17%)	32 (4 active)	128k	16gb

key features

apache 2.0 license: commercial use permitted
mixture of experts (moe): extreme sparsity with only 4-17% active parameters
mxfp4 quantization: 4-bit mixed precision (requires compute >= 9.0)
harmony response format: structured reasoning with analysis/commentary/final channels
configurable reasoning: low/medium/high effort levels
built-in tools: browser (web search) and python execution
128k context: yarn rope scaling (32x expansion)

architecture details

model structure

transformer with moe: alternating dense and locally banded sparse attention
grouped multi-query attention: group size of 8
rotary position embedding (rope): with yarn scaling
swiglu activation: with custom clamping
o200k_harmony tokenizer: superset of gpt-4o tokenizer

training

pre-training: 2.1 million h100-hours (~$10-20m compute)
post-training: supervised fine-tuning + high-compute rl
safety: deliberative alignment, instruction hierarchy
no cot supervision: enables monitoring for deception

harmony response format

the models use a structured format with three channels:

<|channel|>analysis<|message|>
[chain-of-thought reasoning, not shown to users]

<|channel|>commentary<|message|>
[function calls, metadata]

<|channel|>final<|message|>
[user-facing response]

reasoning levels

set via system message:

low: minimal reasoning, fastest
medium: balanced (default)
high: extensive chain-of-thought

deployment options

transformers (development)

📖 Full guide: OpenAI Cookbook - Transformers

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="openai/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain MXFP4 quantization"}]
result = pipe(messages, max_new_tokens=200)

vllm (production)

📖 Full guide: OpenAI Cookbook - vLLM

# install
uv pip install vllm --torch-backend=auto

# serve
vllm serve openai/gpt-oss-20b

# use with openai sdk
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hello"}]
)

ollama (local)

📖 Full guide: OpenAI Cookbook - Ollama | Ollama Library

# pull model
ollama pull gpt-oss:20b

# run chat
ollama run gpt-oss:20b

# api usage
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:20b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

performance benchmarks

benchmark	gpt-oss-120b	gpt-oss-20b	o3	o4-mini
mmlu	90.0	85.3	93.4	93.0
gpqa diamond	80.1	71.5	83.3	81.4
humanity’s last exam	19.0	17.3	24.9	17.7
aime 2024	96.6	96.0	95.2	98.7
aime 2025	97.9	98.7	98.4	99.5

function calling

models support function calling through the harmony format:

// developer message with function definition
namespace functions {
  // Gets current weather in a location
  type get_weather = (_: { location: string; format?: 'celsius' | 'fahrenheit' }) => any;
}

built-in tools

browser tool

browser.search: web search
browser.open: open links
browser.find: find patterns on page

python tool

stateful jupyter environment
120 second timeout
/mnt/data for persistence

hardware requirements

⚠️ important: mxfp4 quantization requires gpu compute capability >= 9.0 (h100, h200, gh200). most consumer gpus (rtx 4090, rtx 3090, a100) cannot run mxfp4 models. see mxfp4 requirements for alternatives.

gpt-oss-120b

minimum: 60gb vram
recommended: single h100 (80gb)
inference: ~$30,000 gpu
cuda: 12.1+ required, cuda 13 supported
compute capability: >= 9.0 for mxfp4

gpt-oss-20b

minimum: 16gb vram (without mxfp4)
recommended: rtx 4090 (with alternative quantization)
inference: consumer hardware (non-mxfp4)
cuda: 12.1+ required
compute capability: >= 9.0 for mxfp4, otherwise use gguf/gptq

installation

download models

# huggingface cli
huggingface-cli download openai/gpt-oss-120b --include "original/*"
huggingface-cli download openai/gpt-oss-20b --include "original/*"

# or with git lfs
git clone https://huggingface.co/openai/gpt-oss-120b
git clone https://huggingface.co/openai/gpt-oss-20b

harmony library

pip install openai-harmony

from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    ReasoningEffort
)

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

triton kernels (for mxfp4)

# install triton kernels for mxfp4 quantization
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

safety

malicious fine-tuning (mft) testing

openai conducted unprecedented safety testing by attempting to maximize harmful capabilities:

biology domain: trained with web browsing for threat creation
cybersecurity domain: trained in agentic coding environment for ctf challenges
result: even with aggressive fine-tuning, models didn’t reach high capability levels

read the full safety paper (pdf) | external source

safety measures

pre-training: cbrn content filtering, biosecurity data downsampling
post-training: deliberative alignment, instruction hierarchy
external review: three independent expert groups reviewed methodology
red teaming: $500k kaggle challenge for community safety testing

critical warnings

chain-of-thought (analysis channel) is not safety filtered
cot may contain hallucinations, harmful content, or disobey instructions
only show final channel content to end users
models not intended for medical diagnosis or treatment

view model card (pdf) | external source

use cases

gpt-oss-120b

production ai applications
reasoning-intensive tasks
enterprise deployments
research and development

gpt-oss-20b

local development
edge deployment
consumer applications
offline usage

technical innovations

extreme sparsity: only 4-17% of parameters active
harmony format: structured reasoning protocol
mxfp4 quantization: 4-bit precision with quality
alternating attention: sliding + full attention layers
yarn rope scaling: 32x context expansion to 128k
grouped multi-query attention: group size of 8 for efficiency
flash attention 3: optimized kernels for faster inference

ecosystem partners

deployment platforms

azure, aws, databricks, vercel, cloudflare, openrouter
hugging face, vllm, ollama, llama.cpp, lm studio
fireworks, together ai, baseten

hardware partners

nvidia, amd, cerebras, groq
windows optimized with onnx runtime
apple metal support included

ecosystem

announcement: introducing gpt-oss
github: openai/gpt-oss
harmony: openai/harmony
huggingface: gpt-oss-120b | gpt-oss-20b
playground: gpt-oss.com
red teaming: kaggle challenge

additional resources

transformers guide
vllm guide
ollama guide
harmony format
research notes - detailed analysis of all documentation

overview

models

key features

architecture details

model structure

training

harmony response format

reasoning levels

deployment options

transformers (development)

vllm (production)

ollama (local)

performance benchmarks

function calling

built-in tools

browser tool

python tool

hardware requirements

gpt-oss-120b

gpt-oss-20b

installation

download models

harmony library

triton kernels (for mxfp4)

safety

malicious fine-tuning (mft) testing

safety measures

critical warnings

use cases

gpt-oss-120b

gpt-oss-20b

technical innovations

ecosystem partners

deployment platforms

hardware partners

ecosystem

additional resources

related pages

more in models