openai gpt-oss models

openai has released two open-weight reasoning models under apache 2.0 license: gpt-oss-120b and gpt-oss-20b. these models represent a significant shift for openai toward open-source ai, featuring advanced reasoning capabilities, tool use, and a novel harmony response format.

📚 comprehensive research notes: for detailed technical analysis and complete documentation review, see the extensive research notes compiled from a variety of official sources and repos.

overview

models

modeltotal paramsactive paramsexpertscontextvram
gpt-oss-120b117b5.1b (4%)128 (4 active)128k60-80gb
gpt-oss-20b21b3.6b (17%)32 (4 active)128k16gb

key features

  • apache 2.0 license: commercial use permitted
  • mixture of experts (moe): extreme sparsity with only 4-17% active parameters
  • mxfp4 quantization: 4-bit mixed precision (requires compute >= 9.0)
  • harmony response format: structured reasoning with analysis/commentary/final channels
  • configurable reasoning: low/medium/high effort levels
  • built-in tools: browser (web search) and python execution
  • 128k context: yarn rope scaling (32x expansion)

architecture details

model structure

  • transformer with moe: alternating dense and locally banded sparse attention
  • grouped multi-query attention: group size of 8
  • rotary position embedding (rope): with yarn scaling
  • swiglu activation: with custom clamping
  • o200k_harmony tokenizer: superset of gpt-4o tokenizer

training

  • pre-training: 2.1 million h100-hours (~$10-20m compute)
  • post-training: supervised fine-tuning + high-compute rl
  • safety: deliberative alignment, instruction hierarchy
  • no cot supervision: enables monitoring for deception

harmony response format

the models use a structured format with three channels:

<|channel|>analysis<|message|>
[chain-of-thought reasoning, not shown to users]

<|channel|>commentary<|message|>
[function calls, metadata]

<|channel|>final<|message|>
[user-facing response]

reasoning levels

set via system message:

  • low: minimal reasoning, fastest
  • medium: balanced (default)
  • high: extensive chain-of-thought

deployment options

transformers (development)

📖 Full guide: OpenAI Cookbook - Transformers

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="openai/gpt-oss-20b",
    torch_dtype="auto",
    device_map="auto"
)

messages = [{"role": "user", "content": "Explain MXFP4 quantization"}]
result = pipe(messages, max_new_tokens=200)

vllm (production)

📖 Full guide: OpenAI Cookbook - vLLM

# install
uv pip install vllm --torch-backend=auto

# serve
vllm serve openai/gpt-oss-20b

# use with openai sdk
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Hello"}]
)

ollama (local)

📖 Full guide: OpenAI Cookbook - Ollama | Ollama Library

# pull model
ollama pull gpt-oss:20b

# run chat
ollama run gpt-oss:20b

# api usage
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss:20b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

performance benchmarks

benchmarkgpt-oss-120bgpt-oss-20bo3o4-mini
mmlu90.085.393.493.0
gpqa diamond80.171.583.381.4
humanity’s last exam19.017.324.917.7
aime 202496.696.095.298.7
aime 202597.998.798.499.5

function calling

models support function calling through the harmony format:

// developer message with function definition
namespace functions {
  // Gets current weather in a location
  type get_weather = (_: { location: string; format?: 'celsius' | 'fahrenheit' }) => any;
}

built-in tools

browser tool

  • browser.search: web search
  • browser.open: open links
  • browser.find: find patterns on page

python tool

  • stateful jupyter environment
  • 120 second timeout
  • /mnt/data for persistence

hardware requirements

⚠️ important: mxfp4 quantization requires gpu compute capability >= 9.0 (h100, h200, gh200). most consumer gpus (rtx 4090, rtx 3090, a100) cannot run mxfp4 models. see mxfp4 requirements for alternatives.

gpt-oss-120b

  • minimum: 60gb vram
  • recommended: single h100 (80gb)
  • inference: ~$30,000 gpu
  • cuda: 12.1+ required, cuda 13 supported
  • compute capability: >= 9.0 for mxfp4

gpt-oss-20b

  • minimum: 16gb vram (without mxfp4)
  • recommended: rtx 4090 (with alternative quantization)
  • inference: consumer hardware (non-mxfp4)
  • cuda: 12.1+ required
  • compute capability: >= 9.0 for mxfp4, otherwise use gguf/gptq

installation

download models

# huggingface cli
huggingface-cli download openai/gpt-oss-120b --include "original/*"
huggingface-cli download openai/gpt-oss-20b --include "original/*"

# or with git lfs
git clone https://huggingface.co/openai/gpt-oss-120b
git clone https://huggingface.co/openai/gpt-oss-20b

harmony library

pip install openai-harmony

from openai_harmony import (
    HarmonyEncodingName,
    load_harmony_encoding,
    Conversation,
    Message,
    Role,
    ReasoningEffort
)

encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)

triton kernels (for mxfp4)

# install triton kernels for mxfp4 quantization
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

safety

malicious fine-tuning (mft) testing

openai conducted unprecedented safety testing by attempting to maximize harmful capabilities:

  • biology domain: trained with web browsing for threat creation
  • cybersecurity domain: trained in agentic coding environment for ctf challenges
  • result: even with aggressive fine-tuning, models didn’t reach high capability levels

read the full safety paper (pdf) | external source

safety measures

  • pre-training: cbrn content filtering, biosecurity data downsampling
  • post-training: deliberative alignment, instruction hierarchy
  • external review: three independent expert groups reviewed methodology
  • red teaming: $500k kaggle challenge for community safety testing

critical warnings

  • chain-of-thought (analysis channel) is not safety filtered
  • cot may contain hallucinations, harmful content, or disobey instructions
  • only show final channel content to end users
  • models not intended for medical diagnosis or treatment

view model card (pdf) | external source

use cases

gpt-oss-120b

  • production ai applications
  • reasoning-intensive tasks
  • enterprise deployments
  • research and development

gpt-oss-20b

  • local development
  • edge deployment
  • consumer applications
  • offline usage

technical innovations

  1. extreme sparsity: only 4-17% of parameters active
  2. harmony format: structured reasoning protocol
  3. mxfp4 quantization: 4-bit precision with quality
  4. alternating attention: sliding + full attention layers
  5. yarn rope scaling: 32x context expansion to 128k
  6. grouped multi-query attention: group size of 8 for efficiency
  7. flash attention 3: optimized kernels for faster inference

ecosystem partners

deployment platforms

hardware partners

  • nvidia, amd, cerebras, groq
  • windows optimized with onnx runtime
  • apple metal support included

ecosystem

additional resources

══════════════════════════════════════════════════════════════════
on this page