pytorch 2.9 release
on this page
pytorch 2.9 was released october 15, 2025. key improvements include expanded wheel variant support (rocm, xpu, cuda 13), symmetric memory for multi-gpu programming, stable libtorch abi for c++/cuda extensions, flexible graph break control, python 3.14 support, and linux aarch64 cuda builds.
installation
with uv (recommended)
see pytorch setup with uv for detailed configuration including automatic backend selection.
pytorch 2.9 introduces experimental wheel variant support that automatically detects your hardware. this feature is being tested with a special build of uv:
# experimental: auto-detect backend with wheel variants
# install experimental uv build
curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh
# then simply install - it detects cuda/rocm/xpu automatically
uv venv
uv pip install torch torchvision
# traditional method (stable)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao note: wheel variant support is experimental in 2.9. the goal is eventual automatic hardware detection without manual backend selection.
with pip (traditional)
# cpu
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cpu
# cuda 12.8 (stable)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu128
# cuda 13.0 (new in 2.9)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130
# rocm 6.4 (new wheel support)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/rocm6.4
# intel xpu (new wheel support)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/xpu platform support
- cuda: 12.8 (stable), 13.0 (new)
- cuda 13.0 brings 33.5% smaller wheels (2.18gb vs 3.28gb)
- blackwell architecture support (sm_120)
- rocm: 6.3, 6.4 (new wheel variant support on linux)
- ocp micro-scaling format support (mx-fp8, mx-fp4) for gfx950
- xpu: intel gpus on linux and windows (new wheel support)
- mps: macos 14+ only (ventura support removed)
- python: 3.10 minimum (3.9 dropped), 3.14 preview available
multi-gpu improvements
symmetric memory
pytorch 2.9 introduces symmetric memory for simplified multi-gpu kernel programming:
import torch
from torch.ops import symm_mem
# symmetric memory operations for multi-gpu kernels
# simplifies programming across nvlinks and rdma networks
# available under torch.ops.symm_mem namespace
# example: multi-gpu kernel with symmetric memory
# automatically handles communication between gpus
device = torch.device("cuda:0")
tensor = torch.randn(1000, 1000, device=device)
# symmetric memory enables efficient multi-gpu operations key benefits:
- easier programming of multi-gpu kernels
- optimized communication over nvlinks
- support for rdma networks
- critical for distributed training and large-scale inference
nvshmem integration
nvshmem plugin for triton enables custom multi-gpu kernels:
# nvshmem support for customized multi-gpu kernels
# enables advanced distributed memory operations
# integrates with triton for custom kernel development c++ extension improvements
stable libtorch abi
pytorch 2.9 expands the stable abi for c++/cuda extensions:
// build extensions with one pytorch version, run with another
#include <torch/stable/aten.h>
// new stable apis added in 2.9:
// - device utilities (device guard, stream)
// - tensor apis (default constructor, is_cpu, scalar_type, get_device_index)
// - expanded stable aten ops (amax, narrow, new_empty, pad)
// example: using stable scalar type
torch::headeronly::ScalarType dtype = torch::headeronly::kFloat32;
auto tensor = torch::stable::Tensor(/* ... */);
if (tensor.is_cpu()) {
// process on cpu
} benefits:
- reduce maintenance costs for third-party libraries
- build once, run with multiple pytorch versions
- improved integration for users
- abi stability through translation layer
new stable apis:
- device guard and stream utilities
- tensor constructors and type queries
- expanded aten operations
- headeronly scalar types with abi stability
training improvements
graph break control
new error-on-graph-break feature for torch.compile debugging:
import torch
# mark regions where graph breaks should error
@torch._dynamo.error_on_graph_break(True)
def training_loop(model, data):
# compilation errors here instead of silently breaking
output = model(data)
return output
# or use as context manager
with torch._dynamo.error_on_graph_break(True):
compiled_model = torch.compile(model)
result = compiled_model(input_data) note: requires boolean argument (True to enable, False to disable)
benefits:
- explicit control over where graph breaks are allowed
- easier debugging of compilation issues
- improved performance validation
ahead-of-time compilation
fullgraph mode now supports ahead-of-time compilation:
# compile model ahead of time in fullgraph mode
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# reduced compilation overhead during training new muon optimizer
from torch.optim import Muon
# new optimizer for training (only supports 2D parameters)
model = torch.nn.Linear(128, 64, bias=False) # no bias
optimizer = Muon(model.parameters(), lr=0.001)
# muon only works with weight matrices, not biases
# designed for transformer models and large language models limitation: muon only supports 2D parameters (weight matrices), not 1D parameters like biases
inference improvements
flexattention on intel gpus
flexattention support now available for intel xpu:
from torch.nn.attention.flex_attention import flex_attention
# use flexattention on intel gpus
device = torch.device("xpu")
model = model.to(device)
# flexattention optimizations available cpu fp8 quantization
cpu support for fp8 quantized operations:
from torchao.quantization import quantize_, Float8WeightOnlyConfig
# fp8 quantization on cpu
# qlinear and qconv operations enabled
quantize_(model, Float8WeightOnlyConfig()) # cpu compatible note: cpu fp8 support is new in 2.9 for qlinear and qconv operations
performance optimizations
- aggressive persistent reduction for faster operations
- improved a16w4/a16w8 gemm templates
- fused rope kernels for efficiency
- optimized mps cummin/cummax operations
- nonblocking gpu tensor indexing (avoids synchronization)
- disabled cudagraph gcs by default for faster capture
- enhanced rocm elementwise and reduction kernels
breaking changes
python version requirement
minimum python 3.10 (3.9 support dropped):
# python 3.9 no longer supported
# upgrade to python 3.10+ required
# python 3.14 available as preview
pip install torch==2.9.0 --python-version=3.14 macos support
macos 14+ required for mps backend:
# ventura (macos 13) support removed
# users on ventura should avoid upgrading python to 3.9+
# or upgrade to macos 14+ (sonoma or sequoia)
import torch
if torch.backends.mps.is_available():
device = torch.device("mps")
# requires macos 14+ custom operators
breaking: outputs sharing storage with inputs produce undefined behavior under torch.compile:
# may return incorrect results silently under compilation
# review custom operators that share input/output storage
compiled_model = torch.compile(model) dlpack upgrade
dlpack 1.0 introduces breaking changes:
# torch.utils.dlpack objects affected
# DLDeviceType and related types changed
from torch.utils.dlpack import to_dlpack, from_dlpack
# review code using dlpack conversions error handling changes
torch.cat error types now more specific:
# previously: generic RuntimeError
# now: ValueError, IndexError, or TypeError
try:
result = torch.cat([tensor1, tensor2])
except ValueError: # instead of RuntimeError
handle_dimension_mismatch() onnx export changes
default behavior switched to dynamo pipeline:
# old default (2.8): torchscript export
# new default (2.9): dynamo export
torch.onnx.export(model, args, "model.onnx")
# now uses dynamo=True by default
# default opset changed to 20 (from 18)
# removed apis:
# - dynamo_export (use torch.onnx.export with dynamo=True)
# - onnxrt compile backend
# - enable_fake_mode
# - caffe2 support tf32 api deprecation
new tf32 api replaces old settings (deprecated after 2.9):
# old api (deprecated after 2.9)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
# new api (use this)
torch.backends.cudnn.conv.fp32_precision = 'tf32'
torch.backends.cuda.matmul.fp32_precision = 'tf32'
# or for ieee precision
torch.backends.cuda.matmul.fp32_precision = 'ieee' see cuda tf32 documentation for details.
deprecations
pin_memory_device parameter
# deprecated in dataloader
from torch.utils.data import DataLoader
# avoid using pin_memory_device parameter
loader = DataLoader(dataset, pin_memory=True) # use this
# not: DataLoader(dataset, pin_memory_device="cuda") # deprecated export_for_training api
# deprecated: torch.export.export_for_training
# use: torch.export.export instead
from torch.export import export
# new approach
exported_program = export(model, (input_example,)) hardware-specific features
cuda 13.0 support
# install with cuda 13.0
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130 cuda 13.0 improvements:
- 71% smaller cuda math api binaries
- 33.5% smaller wheel sizes (2.18gb vs 3.28gb)
- blackwell architecture support (sm_120)
- unified arm platform support
- compression mode enabled
rocm enhancements
# install with rocm 6.4
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/rocm6.4 rocm 2.9 features:
- wheel variant support on linux
- ocp micro-scaling format (mx-fp8, mx-fp4) for gfx950
- enhanced elementwise kernels
- improved reduction operations
- better hardware detection
intel xpu support
# install for intel gpus
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/xpu xpu features:
- wheel variant support (linux and windows)
- flexattention support
- optimized operations for intel gpus
arm platform improvements
linux aarch64 cuda builds available across all cuda versions:
# arm64 with cuda support
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128
# works on linux arm64 systems arm optimizations:
- improved backend performance
- enhanced test coverage
- better platform compatibility
migration guide
updating from pytorch 2.8
# basic upgrade
pip install --upgrade torch==2.9.0
# check python version first
python --version # must be 3.10+
# if python 3.9, upgrade python first
# python 3.14 preview available onnx export migration
# if using old torchscript export
# old (2.8)
torch.onnx.export(model, args, "model.onnx", dynamo=False)
# new (2.9 default)
torch.onnx.export(model, args, "model.onnx") # dynamo=True default
# specify opset if needed
torch.onnx.export(model, args, "model.onnx", opset_version=20) custom operator review
# review operators with shared input/output storage
# may produce undefined behavior under torch.compile
# check for operations like:
def custom_op(input):
output = input.view_as(input) # shares storage
return output # may cause issues with compile macos compatibility
# check macos version before upgrading
import platform
macos_version = platform.mac_ver()[0]
major, minor = map(int, macos_version.split('.')[:2])
if major == 13: # ventura
print("warning: upgrade to macos 14+ for mps support")
print("or stay on pytorch 2.8 with python 3.9") wheel variant adoption
pytorch 2.9 expands wheel variant support for automatic hardware detection:
current state (2.9)
- cuda: windows and linux (stable)
- rocm: linux only (experimental)
- xpu: linux and windows (experimental)
future vision
# goal: single command auto-detects hardware
uv pip install torch
# automatically installs:
# - cuda build for nvidia gpus
# - rocm build for amd gpus
# - xpu build for intel gpus
# - cpu build if no gpu detected current usage
# experimental uv build with wheel variants
curl -LsSf https://astral.sh/uv/install.sh | \
INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh
uv pip install torch # auto-detects hardware performance benchmarks
pytorch 2.9 includes several performance improvements:
training optimizations
- fused rope kernels: reduced overhead for rotary position embeddings
- aggressive persistent reduction: faster reduction operations
- improved gemm templates: better a16w4/a16w8 performance
- disabled cudagraph gcs: faster cuda graph capture by default
inference optimizations
- nonblocking gpu indexing: eliminates synchronization overhead
- optimized mps operations: faster cummin/cummax on apple silicon
- enhanced rocm kernels: improved elementwise and reduction performance
- cpu fp8 support: fp8 quantization available on cpu
tested examples
the following examples have been verified working with pytorch 2.9.0 and torchao 0.14.1:
basic quantization (cpu)
# tested: pytorch 2.9.0, torchao 0.14.1
import torch
from torchao.quantization import quantize_, Int8WeightOnlyConfig
# create simple model
model = torch.nn.Sequential(
torch.nn.Linear(128, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 32)
)
# quantize for cpu inference (new config api)
quantize_(model, Int8WeightOnlyConfig())
# run inference
x = torch.randn(32, 128)
output = model(x)
print(f"output shape: {output.shape}") # torch.Size([32, 32]) muon optimizer
# tested: pytorch 2.9.0
import torch
from torch.optim import Muon
# muon only supports 2D parameters (no biases)
model = torch.nn.Linear(128, 64, bias=False)
optimizer = Muon(model.parameters(), lr=0.001)
# training step
x = torch.randn(32, 128)
y = model(x)
loss = y.sum()
loss.backward()
optimizer.step()
print("muon optimizer works") graph break control
# tested: pytorch 2.9.0
import torch
# decorator usage (requires boolean argument)
@torch._dynamo.error_on_graph_break(True)
def simple_function(x):
return x * 2
result = simple_function(torch.tensor([1.0, 2.0, 3.0]))
print(f"result: {result}") # tensor([2., 4., 6.])
# context manager usage
with torch._dynamo.error_on_graph_break(True):
x = torch.tensor([1.0, 2.0])
y = x + 1
print(f"context result: {y}") # tensor([2., 3.]) torch.compile
# tested: pytorch 2.9.0
import torch
model = torch.nn.Linear(10, 5, bias=False)
compiled = torch.compile(model, mode='reduce-overhead')
x = torch.randn(32, 10)
output = compiled(x)
print(f"compiled output: {output.shape}") # torch.Size([32, 5]) platform detection
# tested: pytorch 2.9.0
import torch
print(f"pytorch version: {torch.__version__}") # 2.9.0+cu128
print(f"cuda available: {torch.cuda.is_available()}") # False (cpu-only)
print(f"cpu capability: {torch.backends.cpu.get_cpu_capability()}") # AVX2
print(f"cuda compiled: {torch.version.cuda}") # 12.8 testing environment: ubuntu 25.10, python 3.13.7, pytorch 2.9.0+cu128 (cpu mode), torchao 0.14.1
key takeaways
- expanded hardware support: rocm, xpu, and cuda 13 wheel variants available
- multi-gpu programming: symmetric memory simplifies distributed kernels
- c++ stability: stable abi reduces maintenance for extensions
- python 3.10 minimum: python 3.9 support dropped, 3.13+ confirmed working
- macos 14+ required: ventura support removed for mps
- onnx defaults changed: dynamo export now default with opset 20
- wheel variants experimental: automatic hardware detection improving
- performance gains: multiple optimizations across training and inference
- breaking changes: review custom operators and dlpack usage
- api updates: torchao config-based api, error_on_graph_break requires boolean, tf32 api changes
- muon optimizer: new optimizer but only supports 2D parameters (no biases)
references
pytorch 2.9 documentation
hardware-specific resources
related guides
- pytorch setup with uv - complete uv configuration
- pytorch 2.8 release - previous version notes
- cuda setup - gpu configuration guide
- uv package manager - comprehensive uv reference