openai gpt-5 release day summary
on this page
openai released gpt-5 on august 7, 2025 as a unified system combining fast responses with reasoning capabilities. the model is available through chatgpt and api with multiple size variants.
📝 raw research notes: comprehensive notes from release day including all sources, papers, and datasets analyzed.
overview
model variants
model | input cost | output cost | context | notes |
---|---|---|---|---|
gpt-5 | $1.25/1m | $10/1m | 400k | full model |
gpt-5-mini | $0.25/1m | $2/1m | 400k | mid-size |
gpt-5-nano | $0.05/1m | $0.40/1m | 400k | smallest |
gpt-5-chat-latest | $1.25/1m | $10/1m | 400k | non-reasoning |
gpt-5 pro | - | - | 400k | extended reasoning |
pricing from developer documentation and api pricing.
system architecture
- unified system: router selects between fast and reasoning models
- input context: 272,000 tokens maximum
- output tokens: 128,000 reasoning + output
- total context: 400,000 tokens
benchmarks
mathematics
benchmark | gpt-5 | gpt-5 pro | o3 | notes |
---|---|---|---|---|
aime 2025 | 94.6% | - | 86.4% | no tools |
hmmt 2025 | 93.3% | - | 81.7% | no tools |
gpqa diamond | 85.7% | 88.4% | 83.3% | no tools |
frontiermaths | 26.3% | - | 15.8% | python only |
coding
benchmark | gpt-5 | o3 | improvement |
---|---|---|---|
swe-bench verified | 74.9% | 69.1% | +5.8% |
aider polyglot | 88.0% | 79.6% | +8.4% |
swe-lancer | $112k | $86k | +30% |
swe-bench performance uses 22% fewer output tokens and 45% fewer tool calls than o3.
multimodal
benchmark | gpt-5 | o3 | gpt-4o |
---|---|---|---|
mmmu | 84.2% | 82.9% | 74.8% |
charxiv reasoning | 81.1% | 78.6% | 56.7% |
videommmu | 84.6% | 83.3% | 60.9% |
tool use
benchmark | gpt-5 | previous sota |
---|---|---|
-bench telecom | 96.7% | 49%1 |
scale multichallenge | 69.6% | 41.4%2 |
collie | 99.0% | - |
factuality
metric | gpt-5 | o3 | gpt-4o |
---|---|---|---|
factual errors vs gpt-4o | 45% fewer | - | baseline |
hallucination rate (longfact) | 1.0% | 5.2% | 0.7% |
factscore hallucination | 2.8% | 23.5% | 6.7% |
deception rate | 2.1% | 4.8% | - |
api features
reasoning effort comparison
new minimal
value for fastest responses. parameter affects depth of analysis.
low reasoning (fast, minimal analysis)
response = client.chat.completions.create(
model="gpt-5",
reasoning_effort="low",
messages=[{
"role": "user",
"content": "My distributed cache shows inconsistent data. Node A: '123', Node B: '456'. Both healthy."
}]
)
output:
Check:
1. Clock synchronization between nodes
2. Replication lag
3. Write conflicts
4. Network partitions
5. Cache invalidation logs
high reasoning (thorough chain-of-thought)
response = client.chat.completions.create(
model="gpt-5",
reasoning_effort="high",
messages=[{
"role": "user",
"content": "My distributed cache shows inconsistent data. Node A: '123', Node B: '456'. Both healthy."
}]
)
output:
The inconsistency suggests a synchronization issue. Let me analyze systematically:
**Root Cause Analysis:**
Since both nodes report healthy but have different values, this rules out node failures.
The different values (123 vs 456) suggest either:
- Split-brain scenario with concurrent writes
- Replication failure after a write
- Clock skew causing version conflicts
**Debugging Steps (in order):**
1. Verify write patterns: grep 'user:5' /var/log/cache/writes.log
2. Check vector clocks: redis-cli DEBUG OBJECT user:5
3. Examine replication: redis-cli INFO replication
4. Test consistency level with QUORUM writes
5. Review partition history
**Most likely cause:** Write conflict during network partition.
verbosity control comparison
controls answer length while maintaining completeness.
low verbosity (concise)
response = client.chat.completions.create(
model="gpt-5",
verbosity="low",
messages=[{"role": "user", "content": "What is Raft consensus?"}]
)
output:
Raft is a consensus algorithm for managing replicated state machines.
Use for: distributed databases, configuration stores, service discovery.
Provides: strong consistency, leader election, log replication.
Not for: high-throughput streaming, eventual consistency needs.
high verbosity (comprehensive)
response = client.chat.completions.create(
model="gpt-5",
verbosity="high",
messages=[{"role": "user", "content": "What is Raft consensus?"}]
)
output includes: core components, how it works (6 steps), when to use (5 perfect cases, 4 avoid cases), real-world examples (etcd, cockroachdb, consul, tikv), trade-offs, implementation tips.
combined parameters for optimization
fastest response (minimal + low)
response = client.chat.completions.create(
model="gpt-5-nano", # $0.05/1M input
reasoning_effort="minimal",
verbosity="low",
messages=[{"role": "user", "content": "Optimize: SELECT * FROM orders WHERE created_at > '2024-01-01'"}]
)
output: immediate suggestion with index recommendation.
thoughtful analysis (high + medium)
response = client.chat.completions.create(
model="gpt-5",
reasoning_effort="high",
verbosity="medium",
messages=[{"role": "user", "content": "Optimize: SELECT * FROM orders WHERE created_at > '2024-01-01'"}]
)
output: multiple strategies based on data distribution, specific column selection, required indexes, alternative query with cte.
custom tools
supports plaintext tool calls with regex or context-free grammar constraints instead of json.
deployment
availability
- chatgpt free, plus, pro, team: immediate
- enterprise, edu: one week from launch
- api: available august 7, 2025
integrations
safety
preparedness framework
- classified as high capability in biological/chemical domain
- 5,000 hours red-teaming with caisi and uk aisi
- safe completions training replacing binary refusal
deception metrics
- charxiv non-existent images: 9% confident answers (o3: 86.7%)
- impossible task recognition improved
- production deception rate: 2.1% (o3: 4.8%)
adoption metrics
- 700 million weekly chatgpt users
- 5 million paid business users (june: 3 million)3
- 9 enterprises signing per week
enterprise customers
technical papers
factuality benchmarks
- longfact4: thousands of questions across 38 topics, safe evaluator with 72% human agreement
- factscore5: atomic fact evaluation, less than 2% error rate
- scale multichallenge2: multi-turn conversation evaluation
tool use benchmarks
datasets
- mrcr: multi-round co-reference resolution
- browsecomp long context: 295 rows testing contextual reasoning
health performance
scores significantly higher than any previous model on healthbench, evaluation based on realistic scenarios and physician-defined criteria. 46.2% on healthbench hard.
developer feedback
cursor
“gpt-5 is the smartest coding model we’ve used” - michael truell, ceo
windsurf
“has half the tool calling error rate over other frontier models”
vercel
“it’s the best frontend ai model”
comparison with gpt-oss
aspect | gpt-5 | gpt-oss |
---|---|---|
license | proprietary | apache 2.0 |
deployment | api only | open weights |
context | 400k | 128k |
architecture | unified system | moe |
quantization | - | mxfp4 |
limitations
- deception occurs in 2.1% of reasoning responses
- chart understanding: 81.1% (human: 80.5%)
- multi-turn conversation handling gaps
- single model integration pending
resources
Footnotes
-bench paper: previous sota was 49% from sierra.ai publication ↩ ↩2
scale multichallenge: claude 3.5 sonnet achieved 41.4% ↩ ↩2
brad lightcap, openai coo, twitter announcement, august 2025 ↩
══════════════════════════════════════════════════════════════════