openai gpt-oss models
on this page
openai has released two open-weight reasoning models under apache 2.0 license: gpt-oss-120b and gpt-oss-20b. these models represent a significant shift for openai toward open-source ai, featuring advanced reasoning capabilities, tool use, and a novel harmony response format.
📚 comprehensive research notes: for detailed technical analysis and complete documentation review, see the extensive research notes compiled from a variety of official sources and repos.
overview
models
model | total params | active params | experts | context | vram |
---|---|---|---|---|---|
gpt-oss-120b | 117b | 5.1b (4%) | 128 (4 active) | 128k | 60-80gb |
gpt-oss-20b | 21b | 3.6b (17%) | 32 (4 active) | 128k | 16gb |
key features
- apache 2.0 license: commercial use permitted
- mixture of experts (moe): extreme sparsity with only 4-17% active parameters
- mxfp4 quantization: 4-bit mixed precision (requires compute >= 9.0)
- harmony response format: structured reasoning with analysis/commentary/final channels
- configurable reasoning: low/medium/high effort levels
- built-in tools: browser (web search) and python execution
- 128k context: yarn rope scaling (32x expansion)
architecture details
model structure
- transformer with moe: alternating dense and locally banded sparse attention
- grouped multi-query attention: group size of 8
- rotary position embedding (rope): with yarn scaling
- swiglu activation: with custom clamping
- o200k_harmony tokenizer: superset of gpt-4o tokenizer
training
- pre-training: 2.1 million h100-hours (~$10-20m compute)
- post-training: supervised fine-tuning + high-compute rl
- safety: deliberative alignment, instruction hierarchy
- no cot supervision: enables monitoring for deception
harmony response format
the models use a structured format with three channels:
<|channel|>analysis<|message|>
[chain-of-thought reasoning, not shown to users]
<|channel|>commentary<|message|>
[function calls, metadata]
<|channel|>final<|message|>
[user-facing response]
reasoning levels
set via system message:
- low: minimal reasoning, fastest
- medium: balanced (default)
- high: extensive chain-of-thought
deployment options
transformers (development)
📖 Full guide: OpenAI Cookbook - Transformers
from transformers import pipeline
pipe = pipeline(
"text-generation",
model="openai/gpt-oss-20b",
torch_dtype="auto",
device_map="auto"
)
messages = [{"role": "user", "content": "Explain MXFP4 quantization"}]
result = pipe(messages, max_new_tokens=200)
vllm (production)
📖 Full guide: OpenAI Cookbook - vLLM
# install
uv pip install vllm --torch-backend=auto
# serve
vllm serve openai/gpt-oss-20b
# use with openai sdk
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Hello"}]
)
ollama (local)
📖 Full guide: OpenAI Cookbook - Ollama | Ollama Library
# pull model
ollama pull gpt-oss:20b
# run chat
ollama run gpt-oss:20b
# api usage
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss:20b",
"messages": [{"role": "user", "content": "Hello"}]
}'
performance benchmarks
benchmark | gpt-oss-120b | gpt-oss-20b | o3 | o4-mini |
---|---|---|---|---|
mmlu | 90.0 | 85.3 | 93.4 | 93.0 |
gpqa diamond | 80.1 | 71.5 | 83.3 | 81.4 |
humanity’s last exam | 19.0 | 17.3 | 24.9 | 17.7 |
aime 2024 | 96.6 | 96.0 | 95.2 | 98.7 |
aime 2025 | 97.9 | 98.7 | 98.4 | 99.5 |
function calling
models support function calling through the harmony format:
// developer message with function definition
namespace functions {
// Gets current weather in a location
type get_weather = (_: { location: string; format?: 'celsius' | 'fahrenheit' }) => any;
}
built-in tools
browser tool
browser.search
: web searchbrowser.open
: open linksbrowser.find
: find patterns on page
python tool
- stateful jupyter environment
- 120 second timeout
/mnt/data
for persistence
hardware requirements
⚠️ important: mxfp4 quantization requires gpu compute capability >= 9.0 (h100, h200, gh200). most consumer gpus (rtx 4090, rtx 3090, a100) cannot run mxfp4 models. see mxfp4 requirements for alternatives.
gpt-oss-120b
- minimum: 60gb vram
- recommended: single h100 (80gb)
- inference: ~$30,000 gpu
- cuda: 12.1+ required, cuda 13 supported
- compute capability: >= 9.0 for mxfp4
gpt-oss-20b
- minimum: 16gb vram (without mxfp4)
- recommended: rtx 4090 (with alternative quantization)
- inference: consumer hardware (non-mxfp4)
- cuda: 12.1+ required
- compute capability: >= 9.0 for mxfp4, otherwise use gguf/gptq
installation
download models
# huggingface cli
huggingface-cli download openai/gpt-oss-120b --include "original/*"
huggingface-cli download openai/gpt-oss-20b --include "original/*"
# or with git lfs
git clone https://huggingface.co/openai/gpt-oss-120b
git clone https://huggingface.co/openai/gpt-oss-20b
harmony library
pip install openai-harmony
from openai_harmony import (
HarmonyEncodingName,
load_harmony_encoding,
Conversation,
Message,
Role,
ReasoningEffort
)
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
triton kernels (for mxfp4)
# install triton kernels for mxfp4 quantization
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
safety
malicious fine-tuning (mft) testing
openai conducted unprecedented safety testing by attempting to maximize harmful capabilities:
- biology domain: trained with web browsing for threat creation
- cybersecurity domain: trained in agentic coding environment for ctf challenges
- result: even with aggressive fine-tuning, models didn’t reach high capability levels
read the full safety paper (pdf) | external source
safety measures
- pre-training: cbrn content filtering, biosecurity data downsampling
- post-training: deliberative alignment, instruction hierarchy
- external review: three independent expert groups reviewed methodology
- red teaming: $500k kaggle challenge for community safety testing
critical warnings
- chain-of-thought (
analysis
channel) is not safety filtered - cot may contain hallucinations, harmful content, or disobey instructions
- only show
final
channel content to end users - models not intended for medical diagnosis or treatment
view model card (pdf) | external source
use cases
gpt-oss-120b
- production ai applications
- reasoning-intensive tasks
- enterprise deployments
- research and development
gpt-oss-20b
- local development
- edge deployment
- consumer applications
- offline usage
technical innovations
- extreme sparsity: only 4-17% of parameters active
- harmony format: structured reasoning protocol
- mxfp4 quantization: 4-bit precision with quality
- alternating attention: sliding + full attention layers
- yarn rope scaling: 32x context expansion to 128k
- grouped multi-query attention: group size of 8 for efficiency
- flash attention 3: optimized kernels for faster inference
ecosystem partners
deployment platforms
- azure, aws, databricks, vercel, cloudflare, openrouter
- hugging face, vllm, ollama, llama.cpp, lm studio
- fireworks, together ai, baseten
hardware partners
- nvidia, amd, cerebras, groq
- windows optimized with onnx runtime
- apple metal support included
ecosystem
- announcement: introducing gpt-oss
- github: openai/gpt-oss
- harmony: openai/harmony
- huggingface: gpt-oss-120b | gpt-oss-20b
- playground: gpt-oss.com
- red teaming: kaggle challenge
additional resources
- transformers guide
- vllm guide
- ollama guide
- harmony format
- research notes - detailed analysis of all documentation
══════════════════════════════════════════════════════════════════