# GPT-5 Comprehensive Research Notes

## Executive Summary

OpenAI announced GPT-5 on August 7, 2025, as their most advanced unified AI system, combining intelligent reasoning with fast responses. The model represents a significant advancement in AI capabilities across coding, mathematics, health, writing, and multimodal understanding. GPT-5 is available in multiple variants including GPT-5, GPT-5 Pro, GPT-5 mini, and GPT-5 nano for API users.

## Core Architecture & System Design

### Unified System Approach

- **Smart efficient model** for quick responses to most queries
- **Deep reasoning model** (GPT-5 thinking) for complex problems
- **Real-time router** that intelligently decides which model to use based on:
  - Conversation type and complexity
  - Tool requirements
  - User intent (e.g., "think hard about this")
- Router continuously trained on user preferences, model switching patterns, and correctness metrics
- Plans to integrate all capabilities into a single model in the future

### API Model Sizes

- **gpt-5**: Full model ($1.25/1M input, $10/1M output tokens)
- **gpt-5-mini**: Mid-size variant ($0.25/1M input, $2/1M output tokens)
- **gpt-5-nano**: Smallest variant ($0.05/1M input, $0.40/1M output tokens)
- **gpt-5-chat-latest**: Non-reasoning ChatGPT version ($1.25/1M input, $10/1M output)

### Context & Token Limits

- Maximum 272,000 input tokens
- Maximum 128,000 reasoning & output tokens
- Total context length: 400,000 tokens

## Performance Benchmarks

### Mathematics & Reasoning

- **AIME 2025**: 94.6% (no tools) - state of the art
- **HMMT 2025**: 93.3% (no tools)
- **GPQA Diamond**: 85.7% (no tools)
- **GPT-5 Pro on GPQA**: 88.4% (no tools) - state of the art
- **FrontierMath**: 26.3% (with Python tool only)

### Coding & Software Engineering

- **SWE-bench Verified**: 74.9% - new state of the art (vs o3's 69.1%)
  - 22% fewer output tokens than o3
  - 45% fewer tool calls than o3
- **Aider Polyglot**: 88% - new record (one-third error reduction vs o3)
- **SWE-Lancer**: $112K on freelance coding tasks
- Preferred 70% of the time over o3 for frontend development

### Multimodal Understanding

- **MMMU**: 84.2% - state of the art
- **CharXiv Reasoning**: 81.1% (with Python)
- **VideoMMMU**: 84.6% (max frame 256)
- **MMMU-Pro**: 78.4% (averaged)

### Health & Medical

- **HealthBench**: 46.2% on Hard benchmark - state of the art
- Scores significantly higher than any previous model on physician-defined criteria
- More proactive in flagging potential concerns
- Better at adapting to user context, knowledge level, and geography

### Tool Use & Agentic Tasks

- **τ²-bench telecom**: 96.7% - extraordinary improvement (previous SOTA was 49%)
- **Scale MultiChallenge**: 69.6% (graded by o3-mini)
- **COLLIE**: 99.0% instruction following
- **OpenAI-MRCR**: Superior long-context retrieval, especially at 256K tokens

### Factuality & Hallucination Reduction

- **45% fewer factual errors** than GPT-4o with web search
- **80% fewer factual errors** than OpenAI o3 when thinking
- **~6x fewer hallucinations** than o3 on open-ended fact-seeking
- **LongFact hallucination rate**: 1.0% (concepts), 1.2% (objects)
- **FActScore hallucination rate**: 2.8%

## Key Features & Capabilities

### New API Parameters

#### Reasoning Effort

- Values: `minimal`, `low`, `medium` (default), `high`
- `minimal`: New option for fastest responses without extensive reasoning
- Higher values maximize quality, lower values maximize speed

#### Verbosity Control

- Values: `low`, `medium` (default), `high`
- Controls default answer length
- Explicit instructions override verbosity settings

#### Custom Tools

- Allows plaintext tool calls instead of JSON
- Supports regex and context-free grammar constraints
- Eliminates JSON escaping errors for long inputs

### Writing & Creative Expression

- Better handling of structural ambiguity
- Sustained unrhymed iambic pentameter capability
- More natural free verse
- Literary depth and rhythm improvements
- Less sycophantic (reduced from 14.5% to <6% sycophantic responses)

### Personality & Interaction

- Research preview of 4 preset personalities: Cynic, Robot, Listener, Nerd
- Less effusively agreeable
- Fewer unnecessary emojis
- More subtle and thoughtful follow-ups
- Better custom instruction following

## Safety & Alignment

### Preparedness Framework

- Treated as "High capability" in Biological and Chemical domain
- 5,000 hours of red-teaming with CAISI and UK AISI
- Robust multilayered defense system for biology

### Safe Completions Training

- New training paradigm replacing refusal-based safety
- Provides most helpful answer within safety boundaries
- Better handling of dual-use domains
- Transparent explanations when refusing
- Offers safe alternatives when unable to fully comply

### Deception & Honesty

- Deception rate reduced from 4.8% (o3) to 2.1% (GPT-5)
- Better recognition of impossible tasks
- More accurate communication of capabilities
- CharXiv test: Only 9% confident answers about non-existent images (vs 86.7% for o3)

## Business & Enterprise Adoption

### User Statistics

- 700 million people using ChatGPT weekly
- 5 million paid business users (up from 3 million in June 2025)
- Signing up 9 enterprises per week

### Enterprise Customers

- BNY Mellon, California State University, Figma
- Intercom, Lowe's, Morgan Stanley
- SoftBank, T-Mobile, Uber
- Cursor, Windsurf, GitHub Copilot integrations

### Availability Timeline

- ChatGPT Free, Plus, Pro, Team: Available immediately
- Enterprise & Edu: One week after launch
- Pro users: Access to GPT-5 Pro for extended reasoning
- Integration with Microsoft platforms (365 Copilot, GitHub Copilot, Azure)

## Developer Feedback & Real-World Impact

### Cursor (Michael Truell, CEO)

"GPT-5 is the smartest coding model we've used... It not only catches tricky, deeply-hidden bugs but can also run long, multi-turn background agents"

### Windsurf

"GPT-5 has half the tool calling error rate over other frontier models"

### Vercel

"It's the best frontend AI model, hitting top performance across both the aesthetic sense and the code quality"

### Manus

"GPT-5 achieved the best performance we've ever seen from a single model on our internal benchmarks"

### Notion

"Its rapid responses, especially in low reasoning mode, make GPT-5 an ideal model when you need complex tasks solved in one shot"

## Research Papers & Technical Resources

### Successfully Analyzed Papers

#### LongFact (ArXiv: 2403.18802)

- Benchmark with thousands of questions across 38 topics
- SAFE evaluator: 72% agreement with humans, 20x cheaper
- Extended F1 score balancing precision and recall

#### FActScore (ArXiv: 2305.14251)

- Fine-grained factuality evaluation via "atomic facts"
- ChatGPT achieved only 58% factual precision in biographies
- Automated metric with <2% error rate

#### Scale MultiChallenge (ArXiv: 2501.17399)

- Multi-turn conversation evaluation
- Best model (Claude 3.5 Sonnet) only achieved 41.4%
- Reveals significant gaps in conversation handling

#### CharXiv (ArXiv: 2406.18521)

- 2,323 natural charts from arXiv papers
- GPT-4o: 47.1% vs Human: 80.5% accuracy
- Performance can drop 34.5% with slight variations

#### τ²-bench (ArXiv: 2506.07982)

- Tool use benchmark with changing environment states
- Previous SOTA: 49%
- GPT-5 achievement: 96.7%

### OpenAI Datasets

- **MRCR**: Multi-round co-reference resolution in long context
- **BrowseComp Long Context**: 295 rows testing contextual reasoning

### Implementation Resources

- Codex CLI: Terminal-based coding agent with sandboxing
- GPT-5 Prompting Guide: Best practices for reasoning_effort, verbosity
- Platform documentation and pricing details

## GPT-OSS Open Weight Models

### Technical Specifications

- Architecture: Mixture-of-Experts transformers
- Models: gpt-oss-120b (116.8B total) and gpt-oss-20b (20.9B total)
- Quantization: MXFP4 format (4.25 bits per parameter)
- Context: 131,072 tokens using YaRN
- Training: 2.1 million H100-hours for 120b model

### Performance

- AIME 2025: 97.9% (120b), 98.7% (20b) with tools
- GPQA Diamond: 80.1% (120b), 71.5% (20b) without tools
- SWE-Bench Verified: 62.4% (120b), 60.7% (20b)
- All models scored 0% on cyber range environments in safety testing

## Key Innovations & Differentiators

1. **Unified System**: First model to seamlessly combine fast responses with deep reasoning
2. **Adaptive Intelligence**: Router learns from user behavior to optimize model selection
3. **Frontend Excellence**: Superior aesthetic sense and design choices in web development
4. **Reduced Hallucinations**: 6x improvement over previous reasoning models
5. **Tool Intelligence**: Can chain dozens of tool calls without losing context
6. **Safe Completions**: New safety paradigm that's more nuanced than binary refusal
7. **Extended Reasoning**: GPT-5 Pro for tasks requiring extensive computation
8. **Efficiency**: Better performance with 50-80% fewer tokens than o3

## Limitations & Areas for Improvement

- Deception still occurs in 2.1% of reasoning responses
- Performance gaps remain in chart understanding (81.1% vs 80.5% human)
- Multi-turn conversation handling needs improvement
- Some safety measures may still result in overrefusal
- Integration of all capabilities into single model pending

## Future Directions

- Single model integration planned for near future
- Continued improvement in factuality and honesty
- Enhanced personality and interaction customization
- Expansion of tool capabilities and custom tool support
- Further safety research and adversarial testing

## Pricing Structure

### API Pricing (per 1M tokens)

- GPT-5: $1.25 input / $10 output
- GPT-5 mini: $0.25 input / $2 output
- GPT-5 nano: $0.05 input / $0.40 output
- GPT-5-chat-latest: $1.25 input / $10 output

### Features Included

- Parallel tool calling
- Built-in tools (web search, file search, image generation)
- Structured Outputs
- Prompt caching
- Batch API support

## Conclusion

GPT-5 represents a significant leap in AI capabilities, particularly in coding, reasoning, and factual accuracy. The unified system approach, combined with substantial improvements in hallucination reduction and tool use, positions it as a transformative technology for both consumer and enterprise applications. With 700 million weekly users and rapidly growing enterprise adoption, GPT-5 marks a pivotal moment in AI deployment at scale.