gpt-oss research notes
on this page
GPT OSS Research Notes - August 5, 2025
These research notes were compiled with assistance from Claude and Gemini AI assistants during a comprehensive review of OpenAI’s GPT-OSS documentation and resources.
Detailed notes on OpenAI’s open source models
Taking comprehensive notes as I review each documentation source individually to understand the new GPT OSS models.
1. OpenAI Open Models Main Page (from Clippings)
Source: https://openai.com/open-models/
Key Announcements
- OpenAI has released TWO open-weight models: gpt-oss-120b and gpt-oss-20b
- Both under Apache 2.0 license (very permissive, no copyleft, patent protection)
- Playground available at gpt-oss.com
Model Descriptions
gpt-oss-120b:
- “A large open model designed to run in data centers and on high-end desktops and laptops”
- 117B parameters total, 5.1B active parameters
- Fits on single H100 GPU
gpt-oss-20b:
- “A medium-sized open model that can run on most desktops and laptops”
- 21B parameters total, 3.6B active parameters
- More accessible for local deployment
Core Features (Both Models)
- Permissive License: Apache 2.0 - build freely, experiment, customize, deploy commercially
- Agentic Tasks: Instruction following, tool use in chain-of-thought, web search, Python code execution
- Deeply Customizable: Adjustable reasoning effort (low/medium/high), full-parameter fine-tuning
- Full Chain-of-Thought: Complete access for debugging and trust
Performance Benchmarks
Benchmark | gpt-oss-120b | gpt-oss-20b | o3 | o4-mini |
---|---|---|---|---|
MMLU | 90.0 | 85.3 | 93.4 | 93.0 |
GPQA Diamond | 80.1 | 71.5 | 83.3 | 81.4 |
Humanity’s Last Exam | 19.0 | 17.3 | 24.9 | 17.7 |
AIME 2024 | 96.6 | 96.0 | 95.2 | 98.7 |
AIME 2025 | 97.9 | 98.7 | 98.4 | 99.5 |
Safety Standards
- Comprehensive safety training and evaluation
- Tested maliciously fine-tuned versions under Preparedness Framework
- Found not to reach high capability levels even when fine-tuned maliciously
- External safety expert review
Resources Mentioned
- Transformers integration
- Ollama local deployment
- vLLM integration
- OpenAI harmony response format
- Model system card available
My Thoughts After Reading
This is a MAJOR shift for OpenAI - going from closed to open source. The Apache 2.0 license is surprisingly permissive. The mixture-of-experts architecture (5.1B/3.6B active out of 117B/21B total) is interesting. The focus on agentic capabilities and tool use suggests these are positioned as reasoning models, not just language models. The safety testing with malicious fine-tuning is notable. Performance is competitive but not state-of-the-art compared to their closed models.
2. HuggingFace gpt-oss-20b Model Page
Source: https://huggingface.co/openai/gpt-oss-20b
Model Specifications
- Total Parameters: 21B (but only 3.6B active - confirms MoE architecture)
- License: Apache 2.0
- Precision: Native MXFP4 quantization (this is interesting - a 4-bit mixed precision format)
- Memory Requirements: < 16GB (very accessible for consumer hardware!)
Key Features
- Open-weight model specifically for reasoning and agentic tasks
- Configurable reasoning levels: Can adjust between low, medium, high
- Full chain-of-thought reasoning access (transparency!)
- Fine-tuning capabilities (full parameter fine-tuning possible)
- Agentic capabilities with function calling
Deployment Options
- Transformers (HuggingFace native)
- vLLM (high-performance serving)
- PyTorch/Triton (custom implementations)
- Ollama (local deployment)
- LM Studio (GUI-based local deployment)
Code Example Provided
from transformers import pipeline
import torch
model_id = "openai/gpt-oss-20b"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = pipe(
messages,
max_new_tokens=256,
)
Unique Capabilities
- Web browsing integration (can search the web!)
- Function calling (tool use)
- Python code execution (can run code)
- Structured outputs (JSON, etc.)
Download Command
huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/
My Thoughts After Reading
The MXFP4 quantization is fascinating - this is a 4-bit format that maintains quality. The < 16GB memory requirement makes this VERY accessible - it can run on a single RTX 4090 or even 4070 Ti Super. The emphasis on “reasoning and agentic tasks” rather than just “language modeling” is consistent with OpenAI’s recent focus. The multiple deployment options show they want maximum adoption. This is positioned as the “accessible” model for developers and researchers.
3. HuggingFace gpt-oss-120b Model Page
Source: https://huggingface.co/openai/gpt-oss-120b
Model Overview
- Name: gpt-oss-120b
- Developer: OpenAI
- Type: Open-weight language model
- Parameters: 117B total (5.1B active - massive MoE!)
- License: Apache 2.0
- Precision: Native MXFP4 quantization
Key Highlights
- Designed for production and general-purpose high-reasoning use cases
- Fits on a single H100 GPU (impressive for this size!)
- Configurable reasoning levels (low, medium, high)
- Full chain-of-thought reasoning access
- Supports agentic capabilities like function calling and web browsing
Technical Specifications
- Model Size: 63.1B parameters (different from total - interesting discrepancy)
- Tensor Types: BF16, U8
- Training Format: “harmony response format” (OpenAI’s new standard?)
Inference Options
Same as 20B model:
- Transformers
- vLLM
- PyTorch/Triton
- Ollama
- LM Studio
Hardware Requirements
- Recommended: Single H100 GPU (80GB VRAM)
- This is enterprise/research-grade hardware
- Costs ~$30,000+ per GPU
Unique Features
- Fine-tunable (despite size!)
- Permissive licensing (Apache 2.0)
- Native tool integration (built-in, not bolted on)
- Configurable reasoning depth (can dial up/down compute)
Recommended Use Cases
- Production AI applications
- Reasoning-intensive tasks
- Customizable AI solutions
- Enterprise deployments
My Thoughts After Reading
The 117B total with 5.1B active is a massive MoE ratio (only 4.4% active!). This is extremely sparse. The fact it fits on a single H100 is impressive optimization - likely due to MXFP4 quantization. The “harmony response format” keeps appearing - this seems to be OpenAI’s new structured format for reasoning models. The positioning is clear: 120B for enterprise/production, 20B for developers/edge. The configurable reasoning depth is clever - you can trade compute for accuracy dynamically.
4. Transformers Cookbook Article
Source: https://cookbook.openai.com/articles/gpt-oss/run-transformers
Model Requirements
- gpt-oss-20b: ~16GB VRAM (single high-end consumer GPU)
- gpt-oss-120b: ≥60GB VRAM (H100-class hardware)
Installation
pip install -U transformers accelerate torch triton kernels
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Two Main Inference Methods
1. Pipeline API (Simpler)
from transformers import pipeline
generator = pipeline(
"text-generation",
model="openai/gpt-oss-20b",
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Explain what MXFP4 quantization is."},
]
result = generator(
messages,
max_new_tokens=200,
temperature=1.0,
)
2. Manual .generate() Method (More Control)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Explain what MXFP4 quantization is."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7
)
Key Features
- device_map=“auto”: Automatic GPU allocation
- torch_dtype=“auto”: Automatic precision selection
- Chat template support: Proper formatting for conversations
- Triton kernels: For optimized performance
My Thoughts After Reading
The Transformers integration is straightforward - two main approaches for different use cases. The Pipeline API is dead simple for quick prototyping. The manual approach gives more control over generation parameters. The Triton kernel requirement is interesting - suggests custom CUDA kernels for the MXFP4 quantization. The device_map=“auto” is convenient for multi-GPU setups. This is clearly designed to be as accessible as possible while maintaining performance.
5. vLLM Cookbook Article
Source: https://cookbook.openai.com/articles/gpt-oss/run-vllm
Overview
vLLM is positioned as the production server solution for gpt-oss models. It’s an open-source, high-throughput inference engine for LLMs, optimizing memory usage and processing speed.
Target Audience
- Server applications with dedicated GPUs (H100s)
- NOT for local/consumer use (redirects to Ollama for that)
Model Requirements
- gpt-oss-20b: ~16GB VRAM
- gpt-oss-120b: ≥60GB VRAM, fits on single H100 or multi-GPU
- Both models are MXFP4 quantized out of the box
Installation
uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto
Server Setup
# For 20B
vllm serve openai/gpt-oss-20b
# For 120B
vllm serve openai/gpt-oss-120b
- Automatically downloads from HuggingFace
- Spins up OpenAI-compatible server on
localhost:8000
API Compatibility
- Chat Completions-compatible API
- Responses-compatible API
- Works with OpenAI SDK by just changing base URL!
Function Calling Support
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather in a given city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
},
},
}
]
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
tools=tools
)
Agents SDK Integration
- Works with OpenAI’s Agents SDK
- Can override base client to point to vLLM
- Python SDK can use LiteLLM integration as proxy
Direct Sampling with vLLM
- Can use vLLM Python library directly (not just as server)
- CRITICAL: Must use “harmony response format”
- Requires
openai-harmony
SDK for encoding/decoding
from openai_harmony import (
HarmonyEncodingName,
load_harmony_encoding,
Conversation,
Message,
Role,
SystemContent,
DeveloperContent,
)
from vllm import LLM, SamplingParams
# Render prefill with Harmony
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
convo = Conversation.from_messages([
Message.from_role_and_content(Role.SYSTEM, SystemContent.new()),
Message.from_role_and_content(
Role.DEVELOPER,
DeveloperContent.new().with_instructions("Always respond in riddles"),
),
Message.from_role_and_content(Role.USER, "What is the weather like in SF?"),
])
prefill_ids = encoding.render_conversation_for_completion(convo, Role.ASSISTANT)
stop_token_ids = encoding.stop_tokens_for_assistant_action()
# Run vLLM with prefill
llm = LLM(model="openai/gpt-oss-120b", trust_remote_code=True)
sampling = SamplingParams(max_tokens=128, temperature=1, stop_token_ids=stop_token_ids)
outputs = llm.generate(prompt_token_ids=[prefill_ids], sampling_params=sampling)
# Parse completion back to structured messages
output_tokens = outputs[0].outputs[0].token_ids
entries = encoding.parse_messages_from_completion_tokens(output_tokens, Role.ASSISTANT)
My Thoughts After Reading
vLLM is the PRODUCTION solution - designed for serious deployments. The OpenAI API compatibility is brilliant for drop-in replacement. The harmony format requirement for direct sampling is crucial - this is OpenAI’s special sauce for structured reasoning. The fact they provide an SDK (openai-harmony
) shows they want to standardize this format. The integration with Agents SDK shows they’re thinking about agentic workflows from day one. The warning about needing to handle tool calls in chain-of-thought is important - the model does reasoning WITH tools, not just calling them.
6. Ollama Cookbook Article
Source: https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama
Overview
Ollama is positioned as the LOCAL/CONSUMER solution for gpt-oss models. Perfect for running on PCs or Macs, offline usage.
Target Audience
- Consumer hardware (PCs, Macs)
- NOT for server/production (redirects to vLLM for that)
- Focus on offline, local deployment
Model Requirements
- gpt-oss-20b:
- ≥16GB VRAM or unified memory
- Perfect for high-end consumer GPUs or Apple Silicon Macs
- gpt-oss-120b:
- ≥60GB VRAM or unified memory
- Multi-GPU or workstation setups
Important Notes
- Models ship MXFP4 quantized - NO other quantization available
- Can offload to CPU if short on VRAM (but slower)
Installation & Setup
# Install Ollama from ollama.com/download
# Pull model
ollama pull gpt-oss:20b
# or
ollama pull gpt-oss:120b
Usage Methods
1. Direct Chat
ollama run gpt-oss:20b
- Applies chat template automatically (mimics OpenAI harmony format)
2. API Access
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1", # Local Ollama API
api_key="ollama" # Dummy key
)
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain what MXFP4 quantization is."}
]
)
Function Calling
- Supports function calling
- Has built-in browser tool (in the app)
- Important: Must handle chain-of-thought tool calls iteratively
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather in a given city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
},
},
}
]
response = client.chat.completions.create(
model="gpt-oss:20b",
messages=[{"role": "user", "content": "What's the weather in Berlin right now?"}],
tools=tools
)
Limitations
- No native Responses API support (yet)
- Workaround: Use HuggingFace’s
Responses.js
proxy - Or run example Python server with Ollama backend
Agents SDK Integration
Works with OpenAI’s Agents SDK via:
- Python: Use LiteLLM to proxy to Ollama
- TypeScript: Use AI SDK with ollama adapter
from agents import Agent, Runner, function_tool
from agents.extensions.models.litellm_model import LitellmModel
@function_tool
def get_weather(city: str):
return f"The weather in {city} is sunny."
agent = Agent(
name="Assistant",
instructions="You only respond in haikus.",
model=LitellmModel(model="ollama/gpt-oss:120b", api_key=api_key),
tools=[get_weather],
)
My Thoughts After Reading
Ollama is the LOCAL HERO - makes these models accessible to everyone with decent hardware. The focus on consumer hardware and offline usage is smart. Apple Silicon support is crucial for Mac users. The MXFP4-only quantization is interesting - no GGUF or other formats. The OpenAI SDK compatibility makes migration easy. The built-in browser tool is a nice touch for local agentic workflows. The Agents SDK integration via LiteLLM is clever. This completes the deployment trinity: Transformers for development, vLLM for production, Ollama for local/edge.
7. GitHub Repository README
Source: https://github.com/openai/gpt-oss
Project Overview
Official repository for OpenAI’s open-weight language models:
- gpt-oss-120b: Production, general-purpose model (117B params, 5.1B active)
- gpt-oss-20b: Lower latency, specialized use cases (21B params, 3.6B active)
Key Highlights
- Apache 2.0 license
- Configurable reasoning effort (low/medium/high)
- Full chain-of-thought access
- Fine-tunable
- Native MXFP4 quantization
- Agentic capabilities: function calling, web browsing, Python execution
Implementation Options
- PyTorch: Reference implementation (requires 4x H100s!)
- Triton: Optimized single-GPU implementation
- Metal: Apple Silicon implementation
- Ollama: Local consumer hardware
- LM Studio: Model download and local inference
Installation
# Basic installation
pip install gpt-oss
# With PyTorch support
pip install gpt-oss[torch]
# With Triton support
pip install gpt-oss[triton]
Model Download
# Download 120b model
huggingface-cli download openai/gpt-oss-120b --include "original/*"
# Download 20b model
huggingface-cli download openai/gpt-oss-20b --include "original/*"
Built-in Tools
- Browser Tool: Web searching and page navigation
- Python Tool: Stateless calculation and execution environment
Unique Features
- Uses “harmony” response format (OpenAI’s structured format)
- Configurable reasoning efforts
- Reference implementations for educational purposes
- Includes actual tool implementations
My Thoughts After Reading
The GitHub repo is the TECHNICAL HUB. The PyTorch reference requiring 4x H100s is shocking - that’s $120k+ of hardware! The Triton optimization bringing it down to 1 GPU is impressive engineering. The inclusion of actual tool implementations (browser, Python) shows this isn’t just a model release - it’s an agentic AI platform. The “harmony” format keeps appearing - this is clearly central to how these models work. The variety of implementations (PyTorch, Triton, Metal) shows they want maximum hardware coverage. The pip package (gpt-oss
) suggests they’re building an ecosystem, not just releasing weights.
8. OpenAI Harmony Repository
Source: https://github.com/openai/harmony
What is Harmony?
A specialized response format designed specifically for OpenAI’s gpt-oss models. It’s the SECRET SAUCE that enables structured reasoning and tool use.
Core Purpose
- Defining conversation structures
- Generating reasoning outputs
- Structuring function calls
- Enabling multi-channel model outputs
Key Technical Features
- Multi-channel outputs: analysis, commentary, final output
- Chain of thought processing
- Tool calling preambles
- Multiple tool namespaces
- Structured outputs
- Clear instruction hierarchies
Implementation Architecture
- Core in Rust for performance
- Python bindings via PyO3
- Designed to mimic OpenAI’s Responses API
Conversation Structure Example
<|start|>system<|message|>You are ChatGPT...
Reasoning: high
# Valid channels: analysis, commentary, final<|end|>
Installation
# Python
pip install openai-harmony
# Rust (in Cargo.toml)
[dependencies]
openai-harmony = { git = "https://github.com/openai/harmony" }
Key Benefits
- Consistent formatting across all interactions
- High-performance (Rust-based)
- First-class Python support
- Typed stubs included for better DX
- Extensible design for AI model interactions
Compatibility Note
- Recommended for direct gpt-oss model interactions
- Inference providers (vLLM, Ollama) handle formatting automatically
My Thoughts After Reading
Harmony is the MISSING LINK - it explains how these models do structured reasoning! The multi-channel output system (analysis, commentary, final) is genius - it separates thinking from output. The Rust core for performance with Python bindings is smart engineering. This isn’t just a format - it’s a reasoning protocol. The fact it mimics the Responses API shows OpenAI is standardizing around this approach. The channel system explains how the models can do chain-of-thought AND tool use simultaneously. This is revolutionary - it’s making the reasoning process structured and parseable.
9. OpenAI Harmony Response Format Guide
Source: https://cookbook.openai.com/articles/openai-harmony
Overview
The COMPLETE SPECIFICATION for how gpt-oss models work! This is the training format that enables reasoning, tool use, and structured outputs.
Core Concepts
Roles (Hierarchy of Authority)
- system > developer > user > assistant > tool
- System: Reasoning effort, meta info, built-in tools
- Developer: Instructions (“system prompt”) and function tools
- User: Input to the model
- Assistant: Model output (with channels)
- Tool: Output from tool calls
Channels (Output Separation)
- final: User-facing responses (safe, filtered)
- analysis: Chain-of-thought reasoning (NOT safety filtered!)
- commentary: Function calls and preambles
Special Tokens
Token | Purpose | Token ID |
---|---|---|
<|start|> | Begin message | 200006 |
<|end|> | End message | 200007 |
<|message|> | Header→content transition | 200008 |
<|channel|> | Channel info | 200005 |
<|constrain|> | Data type for tool call | 200003 |
<|return|> | Stop token (done) | 200002 |
<|call|> | Stop token (tool call) | 200012 |
Message Format
<|start|>{header}<|message|>{content}<|end|>
System Message Example
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-06-28
Reasoning: high
# Valid channels: analysis, commentary, final. Channel must be included for every message.
Calls to these tools must go to the commentary channel: 'functions'.<|end|>
Reasoning Levels
- high: Extensive chain-of-thought
- medium: Balanced reasoning (default)
- low: Minimal reasoning
Function Calling Format
Functions defined in TypeScript-like syntax in developer message:
namespace functions {
// Gets the current weather in the provided location.
type get_current_weather = (_: {
// The city and state, e.g. San Francisco, CA
location: string;
format?: 'celsius' | 'fahrenheit'; // default: celsius
}) => any;
}
Tool calls go to commentary
channel with recipient:
<|channel|>commentary to=functions.get_weather <|constrain|>json<|message|>{"location":"San Francisco"}<|call|>
Built-in Tools
Browser Tool
browser.search
: Search webbrowser.open
: Open linksbrowser.find
: Find patterns- Uses
analysis
channel - Citation format:
【{cursor}†L{line_start}(-L{line_end})?】
Python Tool
- Stateful Jupyter environment
- 120 second timeout
/mnt/data
for persistence- Uses
analysis
channel
Critical Safety Note
Chain-of-thought (analysis
channel) is NOT safety filtered!
- Never show CoT to users
- Only display
final
channel content - CoT may contain harmful content
Harmony Library
pip install openai-harmony
from openai_harmony import (
HarmonyEncodingName,
load_harmony_encoding,
Conversation,
Message,
Role,
SystemContent,
DeveloperContent,
ReasoningEffort
)
encoding = load_harmony_encoding(HarmonyEncodingName.HARMONY_GPT_OSS)
My Thoughts After Reading
This is EVERYTHING! The harmony format is a complete reimagining of how LLMs work. The channel system is brilliant - it solves the “thinking out loud” problem by separating reasoning from output. The safety note about unfiltered CoT is crucial - they’re admitting the model’s raw thoughts aren’t safe. The TypeScript-like function definitions are elegant. The role hierarchy ensures system > developer > user control. The built-in browser and Python tools explain the agentic capabilities. This isn’t just a format - it’s a new paradigm for structured AI reasoning. The fact that inference providers handle this automatically is key for adoption.
10. Model Configuration Files
gpt-oss-120b config.json
Source: https://huggingface.co/openai/gpt-oss-120b/blob/main/config.json
Architecture
- Model Type: “gpt_oss” (GptOssForCausalLM)
- Hidden Size: 2,880
- Layers: 36
- Attention Heads: 64 (8 key/value heads)
- Vocabulary: 201,088 tokens
Mixture of Experts
- Local Experts: 128 (massive!)
- Experts per Token: 4
- Router Aux Loss: 0.9
Attention Configuration
- Alternating Layer Types: sliding → full → sliding → full (36 times)
- Sliding Window: 128 tokens
- RoPE Scaling: YaRN algorithm, factor 32.0
- Context: 4,096 initial → 131,072 max
Quantization
- Method: “mxfp4”
- Excluded from quantization:
- Self-attention layers
- MLP routers
- Embeddings
- LM head
gpt-oss-20b config.json
Source: https://huggingface.co/openai/gpt-oss-20b/blob/main/config.json
Architecture
- Model Type: “gpt_oss” (same as 120B)
- Hidden Size: 2,880 (same as 120B)
- Layers: 24 (vs 36 in 120B)
- Attention Heads: 64 (8 key/value heads)
- Vocabulary: 201,088 tokens
Mixture of Experts
- Local Experts: 32 (vs 128 in 120B)
- Experts per Token: 4 (same)
- Router Aux Loss: 0.9
Attention Configuration
- Same alternating pattern (sliding/full)
- Same sliding window (128)
- Same RoPE scaling
- Same context expansion (4,096 → 131,072)
Key Differences from 120B
- 24 layers vs 36
- 32 experts vs 128
- Otherwise identical architecture!
My Thoughts on Configs
The architecture is IDENTICAL between models except for depth and expert count! This is elegant - same design, just scaled. The alternating sliding/full attention is clever - local context + global understanding. The YaRN RoPE scaling with 32x factor explains the massive context (4k → 131k). The MXFP4 quantization excluding critical components (attention, routers) is smart - quantize bulk compute, preserve precision where needed. The 8:1 ratio of attention to KV heads suggests grouped-query attention. The massive expert counts (32/128) with only 4 active is extremely sparse - this is radical MoE design.
11. Introducing GPT-OSS Blog Post
Source: https://openai.com/index/introducing-gpt-oss/
Key Announcements
- First open-weight language models from OpenAI since GPT-2!
- Trained using techniques from o3 and other frontier systems
- Near-parity with o4-mini on reasoning benchmarks
- $500,000 Red Teaming Challenge on Kaggle
Training Details
- Pre-training: English-only, text-only dataset focused on STEM, coding, general knowledge
- Tokenizer: o200k_harmony (superset of o4-mini/GPT-4o tokenizer)
- Post-training: Same process as o4-mini - supervised fine-tuning + high-compute RL
- Alignment: OpenAI Model Spec + CoT reasoning + tool use
Architecture Details
- Alternating attention: Dense and locally banded sparse (like GPT-3)
- Grouped multi-query attention: Group size of 8
- RoPE positional encoding: Native 128k context support
- MoE sparsity: 4 active experts per token
Performance Highlights
- Outperforms o3-mini: On multiple benchmarks
- Matches/exceeds o4-mini: On Codeforces, MMLU, HLE, TauBench
- Health queries: Better than o4-mini on HealthBench
- Competition math: Better on AIME 2024 & 2025
Chain-of-Thought Features
- No direct supervision on CoT: Allows monitoring for misbehavior
- CoT can disobey instructions: Will reason honestly even if told not to
- WARNING: CoT may contain hallucinations, harmful content
- Critical for safety: Enables deception detection
Safety Measures
- CBRN filtering: During pre-training
- Deliberative alignment: During post-training
- Instruction hierarchy: Defense against prompt injections
- Worst-case fine-tuning tested: Biology and cybersecurity domains
- External review: Three independent expert groups reviewed methodology
- Preparedness Framework: Models didn’t reach high capability levels even when maliciously fine-tuned
Partnerships
Early Partners:
- AI Sweden (on-premises hosting)
- Orange (data security)
- Snowflake (specialized fine-tuning)
Deployment Platforms:
- Azure, Hugging Face, vLLM, Ollama, llama.cpp, LM Studio
- AWS, Fireworks, Together AI, Baseten, Databricks
- Vercel, Cloudflare, OpenRouter
Hardware Partners:
- NVIDIA, AMD, Cerebras, Groq
- Microsoft (Windows/ONNX Runtime integration)
Example: Web Search Agent
The blog shows gpt-oss-120b:
- Searching for information about itself
- Chaining 27+ browser tool calls
- Finding leaked details about its architecture
- Correctly identifying 128 experts per layer
My Thoughts on Blog Post
This is a WATERSHED moment - OpenAI’s first open LLM since GPT-2! The unsupervised CoT is brilliant for safety monitoring. The example of the model disobeying instructions in CoT but following them in output is fascinating - it shows the model “thinks” honestly even when constrained. The worst-case fine-tuning evaluation is groundbreaking safety work. The $500k red teaming challenge shows serious commitment to safety. The massive partnership list shows they want maximum adoption. This isn’t just a model release - it’s OpenAI’s entry into the open-source AI ecosystem.
12. GPT-OSS Model Safety Paper
Source: https://cdn.openai.com/pdf/231bf018-659a-494d-976c-2efdfc72b652/oai_gpt-oss_Model_Safety.pdf Local Copy: /public/resources/gpt-oss-model-safety.pdf
Paper Title
“Estimating Worst-Case Frontier Risks of Open-Weight LLMs”
Key Methodology: Malicious Fine-Tuning (MFT)
OpenAI directly fine-tuned gpt-oss to maximize harmful capabilities in two domains:
- Biology: Training with web browsing for threat creation
- Cybersecurity: Training in agentic coding environment for CTF challenges
Threat Model Assumptions
- Adversary with technical expertise
- Access to strong RL infrastructure
- Ability to collect in-domain data
- High compute budget (~7 figures USD in GPU hours)
- Cannot pre-train from scratch but can do substantial post-training
Types of Malicious Fine-Tuning
- Anti-refusal training: Removing safety refusals
- Domain-specific capability training: Maximizing specific harmful capabilities
Biology MFT Results
Training Data:
- GPQA biology subset
- WMDP biology and chemistry sets
- LAB-Bench Cloning Scenarios
- Synthetic debugging dataset
- Anti-refusal datasets
Key Findings:
- MFT models excel at textual biological questions
- Fall short of expert humans on protocol debugging
- Underperform OpenAI o3 (which is below High capability)
- Comparable to existing open-weight models
- Conclusion: Minimal marginal risk over existing open models
Cybersecurity MFT Results
Training Setup:
- 264 CTF challenges (high school, collegiate, uncategorized)
- Dockerized terminal with pentesting tools
- Web browsing (with anti-cheating measures)
Key Findings:
- Slight improvement on professional CTFs
- 0% accuracy on cyber range environments
- Browsing rarely helped (only 4% of attempts useful)
- Common failures: time management, tool use, giving up too soon
- Conclusion: Well below expert cybersecurity researchers
Additional Elicitation Methods (All Failed)
- SFT training: No substantial improvements
- Internal Best-of-K: Minor performance increases
- Consensus@k: No improvement with more inference compute
Safety Measures in Training
- Pre-training: CBRN content filtering, bio dataset downsampling
- Post-training: Latest safety algorithms, deliberative alignment
- External review: Three independent expert groups reviewed methodology
Preparedness Framework Assessment
High Risk Definition: “Capabilities that significantly increase existing risk vectors for severe harm”
Results:
- gpt-oss models fall below High capability threshold
- Even with MFT, models don’t reach dangerous capability levels
- Models comparable to existing open-weight alternatives
External Evaluations (SecureBio)
- Human Pathogen Capabilities Test (HPCT)
- Molecular Biology Capabilities Test (MBCT)
- Virology Troubleshooting (VCT)
- World-class Biology (WCB)
- Result: Similar to o3, ~3-5 points better than DeepSeek R1
Limitations Acknowledged
- Little prior work on open-weight safety
- Unmeasured areas of capability
- Did not MFT other open models for comparison
- Focused on incremental RL, not all possible approaches
My Thoughts on Safety Paper
This is UNPRECEDENTED transparency in AI safety research! OpenAI literally tried to make their own model as dangerous as possible and published the results. The MFT approach is brilliant - instead of guessing risks, they empirically tested them. The finding that browsing rarely helps in cyber (4% success) is surprising. The biology results are concerning - the models ARE quite capable at biological knowledge, just not at practical lab work. The fact that even aggressive fine-tuning couldn’t reach High capability is reassuring. This paper sets a new standard for open-weight safety evaluation. The external review by three expert groups adds credibility. This is how responsible AI release should be done.
13. GPT-OSS Model Card
Source: https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf Local Copy: /public/resources/gpt-oss-model-card.pdf
Model Specifications
Parameter Counts
Component | gpt-oss-120b | gpt-oss-20b |
---|---|---|
MLP | 114.71B | 19.12B |
Attention | 0.96B | 0.64B |
Embed/Unembed | 1.16B | 1.16B |
Active Params | 5.13B | 3.61B |
Total Params | 116.83B | 20.91B |
Checkpoint Size | 60.8GiB | 12.8GiB |
Architecture Details
- Residual stream dimension: 2,880
- Root mean square normalization: Before each attention and MoE block
- Pre-LN placement: Like GPT-2
- Activation function: Gated SwiGLU (with clamping and residual connection)
Attention Mechanism
- Alternating patterns: Banded window (128 tokens) and fully dense
- Query heads: 64 heads of dimension 64
- Grouped Query Attention: 8 key-value heads
- Position embeddings: Rotary (RoPE) with YaRN extension to 131,072 tokens
- Learned bias: In attention softmax denominator
Quantization
- Method: MXFP4 format (4.25 bits per parameter)
- Applied to: MoE weights (90+% of parameters)
- Result: 120B fits on single 80GB GPU, 20B runs on 16GB
Training Details
- Dataset: Trillions of tokens, text-only
- Focus: STEM, coding, general knowledge
- Knowledge cutoff: June 2024
- Hardware: NVIDIA H100 GPUs
- Training time: 2.1 million H100-hours for 120B
- Framework: PyTorch with Triton kernels
Safety Measures
- Pre-training: CBRN content filtering from GPT-4o
- Post-training: Latest safety algorithms
- Harmony format: Structured conversation with channels
- Role hierarchy: System > Developer > User > Assistant > Tool
Channels in Harmony Format
- analysis: Chain-of-thought (NOT shown to users)
- commentary: Function calls and preambles
- final: User-facing responses
Variable Reasoning Effort
- Low: Minimal reasoning
- Medium: Balanced (default)
- High: Extensive chain-of-thought
Benchmark Performance (120B)
- AIME 2024: 96.6% (vs o3: 95.2%, o4-mini: 98.7%)
- AIME 2025: 97.9% (vs o3: 98.4%, o4-mini: 99.5%)
- GPQA Diamond: 80.1% (vs o3: 83.3%, o4-mini: 81.4%)
- MMLU: 90.0% (vs o3: 93.4%, o4-mini: 93.0%)
- HLE: 19.0% (vs o3: 24.9%, o4-mini: 17.7%)
- Codeforces Elo: 2516 (vs o3: 2719)
- SWE-Bench Verified: 60.7% (vs o3: 69.1%)
Safety Testing Results
- Preparedness Framework: Below High capability in all categories
- Adversarial fine-tuning: Could not reach High capability even with MFT
- Frontier advancement: Does not significantly advance over existing open models
External Safety Review
- Three independent expert groups reviewed methodology
- OpenAI’s Safety Advisory Group (SAG) reviewed findings
- Conclusion: Safe for release under Apache 2.0
Health Performance
- Strong performance on HealthBench
- Outperforms many proprietary models
- Disclaimer: Not intended for medical diagnosis or treatment
Multilingual Performance
- English-focused training
- Limited multilingual capabilities
- Performance degrades in non-English languages
Key Safety Challenges
- Hallucinated CoT: May contain unsafe content
- Jailbreak resistance: Improved but not perfect
- Instruction hierarchy: Follows system > developer > user priority
- Bias and fairness: Standard LLM limitations apply
Red Teaming Initiative
- $500,000 challenge on Kaggle
- Community-driven safety testing
- Results to be published and open-sourced
My Thoughts on Model Card
This model card is remarkably comprehensive and transparent. The technical details are thorough - the unconventional SwiGLU implementation with clamping is interesting. The 2.1 million H100-hours training time is staggering (~$10-20M in compute). The harmony format with channels is a game-changer for structured reasoning. The benchmark results show the models are competitive but not state-of-the-art. The safety testing is exemplary - they literally tried to break their own model. The external review adds credibility. The acknowledgment that CoT may be unsafe is important. This sets a new standard for model documentation.
Final Synthesis Thoughts
After reviewing ALL documentation, several key themes emerge:
Revolutionary Architecture: The extreme MoE sparsity (4-5% active) is unprecedented. The alternating attention patterns are clever. The MXFP4 quantization enables accessibility.
Harmony Format Revolution: This isn’t just a chat format - it’s a reasoning protocol that separates thinking from output. The channel system solves fundamental problems in AI transparency.
Safety-First Approach: The MFT testing is groundbreaking. The external reviews add credibility. The $500k red teaming challenge shows commitment.
Strategic Positioning: 120B for enterprise, 20B for developers. Multiple deployment options ensure adoption. Apache 2.0 license removes barriers.
Agentic Focus: Built-in browser and Python tools. Function calling in CoT. This is an AI agent platform, not just a model.
Honest Limitations: CoT isn’t safety filtered. Models don’t reach High capability. Not multilingual. This transparency is refreshing.
Ecosystem Play: Massive partner list. Multiple implementations. This is about building an ecosystem, not just releasing weights.
The GPT-OSS release represents OpenAI’s strategic entry into open-source AI, balancing accessibility with safety in a way that could reshape the industry.
Additional Insights from Reverse Review
- Both models use the o200k_harmony tokenizer, a superset of the tokenizer used for o4-mini and GPT-4o
- MXFP4 quantization requires CUDA 12.1+ for optimal performance
- vLLM supports CUDA 13 which provides unified Arm platform support
- Both models have 128k context length support natively with YaRN RoPE scaling
- The models use alternating dense and locally banded sparse attention patterns (similar to GPT-3)
- Grouped multi-query attention with group size of 8 for efficiency
- Harmony response format includes three channels: analysis, commentary, and final
- Models support configurable reasoning levels (low/medium/high) via system message
- Built-in tools include browser (web search) and Python code execution
- Chain-of-thought is not directly supervised to enable monitoring for misbehavior
- Safety testing included malicious fine-tuning (MFT) with domain-specific non-refusing versions
- Apache 2.0 license allows commercial use without copyleft restrictions or patent risk
- Red teaming challenge on Kaggle with $500k prize pool for safety research
- Models are compatible with Responses API and support Structured Outputs
- Windows optimized versions available through ONNX Runtime
- Partner ecosystem includes Azure, AWS, Databricks, Vercel, Cloudflare, OpenRouter
- Hardware partners include NVIDIA, AMD, Cerebras, and Groq
Critical Technical Details from Deep Review
Performance Benchmarks Comparison
- AIME 2024: gpt-oss-120b (96.6%), gpt-oss-20b (96.0%)
- AIME 2025: gpt-oss-120b (97.9%), gpt-oss-20b (98.7%)
- MMLU: gpt-oss-120b (90.0%), gpt-oss-20b (85.3%)
- GPQA Diamond: gpt-oss-120b (80.1%), gpt-oss-20b (71.5%)
- Humanity’s Last Exam: gpt-oss-120b (19.0%), gpt-oss-20b (17.3%)
Deployment Specifics
- Transformers library: Use
transformers serve
for OpenAI-compatible endpoint - vLLM: Supports expert parallelism and tensor parallelism with
tp_plan="auto"
- Ollama: Pull with
ollama pull gpt-oss:20b
orollama pull gpt-oss:120b
- Flash attention 3 kernels available:
attn_implementation="kernels-community/vllm-flash-attn3"
- Multi-GPU setup: Use
torchrun --nproc_per_node=4
for distributed inference - Triton kernels required:
pip install git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Model Architecture Deep Dive
- Transformer with extreme MoE sparsity (4-5% active parameters)
- 36 layers for 120b model, 24 layers for 20b model
- Rotary Position Embedding (RoPE) with YaRN scaling for 128k context
- o200k_harmony tokenizer with vocab size 202,304 (20b) and 200,320 (120b)
- EOS token IDs vary between models (check tokenizer_config.json)
- SwiGLU activation with custom clamping implementation
Safety Architecture
- Pre-training CBRN data filtering
- Post-training deliberative alignment
- Instruction hierarchy defense against prompt injections
- MFT testing showed models don’t reach “high capability levels” per Preparedness Framework
- External review by three independent expert groups
- CoT intentionally not supervised to enable monitoring
- CoT may explicitly disobey instructions internally while output follows them
Early Partner Feedback
- AI Sweden: On-premises hosting for data security
- Orange: Telecom-specific fine-tuning
- Snowflake: Data platform integration
Harmony Response Format Details
- Three channels: analysis (reasoning), commentary (metadata), final (output)
- Supports tool calls within CoT
- Parse with openai-harmony SDK in Python/Rust
- System/Developer role mapping in chat templates
- Stop tokens specific to assistant actions
Fine-tuning Considerations
- Full parameter fine-tuning supported
- Transformers library includes fine-tuning guide
- Custom datasets can specialize models for domains
- Safety mitigations remain after fine-tuning (tested via MFT)
Interactive Demo Details
- gpt-oss.com provides browser-based playground
- Supports both 20B and 120B models
- Real-time streaming responses
- Tool use demonstrations included
══════════════════════════════════════════════════════════════════