pytorch 2.9 release | mike bommarito

pytorch 2.9 was released october 15, 2025. key improvements include expanded wheel variant support (rocm, xpu, cuda 13), symmetric memory for multi-gpu programming, stable libtorch abi for c++/cuda extensions, flexible graph break control, python 3.14 support, and linux aarch64 cuda builds.

installation

with uv (recommended)

see pytorch setup with uv for detailed configuration including automatic backend selection.

pytorch 2.9 introduces experimental wheel variant support that automatically detects your hardware. this feature is being tested with a special build of uv:

# experimental: auto-detect backend with wheel variants
# install experimental uv build
curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh

# then simply install - it detects cuda/rocm/xpu automatically
uv venv
uv pip install torch torchvision

# traditional method (stable)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao

note: wheel variant support is experimental in 2.9. the goal is eventual automatic hardware detection without manual backend selection.

with pip (traditional)

# cpu
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cpu

# cuda 12.8 (stable)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu128

# cuda 13.0 (new in 2.9)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130

# rocm 6.4 (new wheel support)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/rocm6.4

# intel xpu (new wheel support)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/xpu

platform support

cuda: 12.8 (stable), 13.0 (new)
- cuda 13.0 brings 33.5% smaller wheels (2.18gb vs 3.28gb)
- blackwell architecture support (sm_120)
rocm: 6.3, 6.4 (new wheel variant support on linux)
- ocp micro-scaling format support (mx-fp8, mx-fp4) for gfx950
xpu: intel gpus on linux and windows (new wheel support)
mps: macos 14+ only (ventura support removed)
python: 3.10 minimum (3.9 dropped), 3.14 preview available

multi-gpu improvements

symmetric memory

pytorch 2.9 introduces symmetric memory for simplified multi-gpu kernel programming:

import torch
from torch.ops import symm_mem

# symmetric memory operations for multi-gpu kernels
# simplifies programming across nvlinks and rdma networks
# available under torch.ops.symm_mem namespace

# example: multi-gpu kernel with symmetric memory
# automatically handles communication between gpus
device = torch.device("cuda:0")
tensor = torch.randn(1000, 1000, device=device)
# symmetric memory enables efficient multi-gpu operations

key benefits:

easier programming of multi-gpu kernels
optimized communication over nvlinks
support for rdma networks
critical for distributed training and large-scale inference

nvshmem integration

nvshmem plugin for triton enables custom multi-gpu kernels:

# nvshmem support for customized multi-gpu kernels
# enables advanced distributed memory operations
# integrates with triton for custom kernel development

c++ extension improvements

stable libtorch abi

pytorch 2.9 expands the stable abi for c++/cuda extensions:

// build extensions with one pytorch version, run with another
#include <torch/stable/aten.h>

// new stable apis added in 2.9:
// - device utilities (device guard, stream)
// - tensor apis (default constructor, is_cpu, scalar_type, get_device_index)
// - expanded stable aten ops (amax, narrow, new_empty, pad)

// example: using stable scalar type
torch::headeronly::ScalarType dtype = torch::headeronly::kFloat32;
auto tensor = torch::stable::Tensor(/* ... */);
if (tensor.is_cpu()) {
    // process on cpu
}

benefits:

reduce maintenance costs for third-party libraries
build once, run with multiple pytorch versions
improved integration for users
abi stability through translation layer

new stable apis:

device guard and stream utilities
tensor constructors and type queries
expanded aten operations
headeronly scalar types with abi stability

training improvements

graph break control

new error-on-graph-break feature for torch.compile debugging:

import torch

# mark regions where graph breaks should error
@torch._dynamo.error_on_graph_break(True)
def training_loop(model, data):
    # compilation errors here instead of silently breaking
    output = model(data)
    return output

# or use as context manager
with torch._dynamo.error_on_graph_break(True):
    compiled_model = torch.compile(model)
    result = compiled_model(input_data)

note: requires boolean argument (True to enable, False to disable)

benefits:

explicit control over where graph breaks are allowed
easier debugging of compilation issues
improved performance validation

ahead-of-time compilation

fullgraph mode now supports ahead-of-time compilation:

# compile model ahead of time in fullgraph mode
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# reduced compilation overhead during training

new muon optimizer

from torch.optim import Muon

# new optimizer for training (only supports 2D parameters)
model = torch.nn.Linear(128, 64, bias=False)  # no bias
optimizer = Muon(model.parameters(), lr=0.001)

# muon only works with weight matrices, not biases
# designed for transformer models and large language models

limitation: muon only supports 2D parameters (weight matrices), not 1D parameters like biases

inference improvements

flexattention on intel gpus

flexattention support now available for intel xpu:

from torch.nn.attention.flex_attention import flex_attention

# use flexattention on intel gpus
device = torch.device("xpu")
model = model.to(device)
# flexattention optimizations available

cpu fp8 quantization

cpu support for fp8 quantized operations:

from torchao.quantization import quantize_, Float8WeightOnlyConfig

# fp8 quantization on cpu
# qlinear and qconv operations enabled
quantize_(model, Float8WeightOnlyConfig())  # cpu compatible

note: cpu fp8 support is new in 2.9 for qlinear and qconv operations

performance optimizations

aggressive persistent reduction for faster operations
improved a16w4/a16w8 gemm templates
fused rope kernels for efficiency
optimized mps cummin/cummax operations
nonblocking gpu tensor indexing (avoids synchronization)
disabled cudagraph gcs by default for faster capture
enhanced rocm elementwise and reduction kernels

breaking changes

python version requirement

minimum python 3.10 (3.9 support dropped):

# python 3.9 no longer supported
# upgrade to python 3.10+ required

# python 3.14 available as preview
pip install torch==2.9.0 --python-version=3.14

macos support

macos 14+ required for mps backend:

# ventura (macos 13) support removed
# users on ventura should avoid upgrading python to 3.9+
# or upgrade to macos 14+ (sonoma or sequoia)

import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
    # requires macos 14+

custom operators

breaking: outputs sharing storage with inputs produce undefined behavior under torch.compile:

# may return incorrect results silently under compilation
# review custom operators that share input/output storage
compiled_model = torch.compile(model)

dlpack upgrade

dlpack 1.0 introduces breaking changes:

# torch.utils.dlpack objects affected
# DLDeviceType and related types changed
from torch.utils.dlpack import to_dlpack, from_dlpack
# review code using dlpack conversions

error handling changes

torch.cat error types now more specific:

# previously: generic RuntimeError
# now: ValueError, IndexError, or TypeError

try:
    result = torch.cat([tensor1, tensor2])
except ValueError:  # instead of RuntimeError
    handle_dimension_mismatch()

onnx export changes

default behavior switched to dynamo pipeline:

# old default (2.8): torchscript export
# new default (2.9): dynamo export
torch.onnx.export(model, args, "model.onnx")
# now uses dynamo=True by default

# default opset changed to 20 (from 18)
# removed apis:
# - dynamo_export (use torch.onnx.export with dynamo=True)
# - onnxrt compile backend
# - enable_fake_mode
# - caffe2 support

tf32 api deprecation

new tf32 api replaces old settings (deprecated after 2.9):

# old api (deprecated after 2.9)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# new api (use this)
torch.backends.cudnn.conv.fp32_precision = 'tf32'
torch.backends.cuda.matmul.fp32_precision = 'tf32'

# or for ieee precision
torch.backends.cuda.matmul.fp32_precision = 'ieee'

see cuda tf32 documentation for details.

deprecations

pin_memory_device parameter

# deprecated in dataloader
from torch.utils.data import DataLoader

# avoid using pin_memory_device parameter
loader = DataLoader(dataset, pin_memory=True)  # use this
# not: DataLoader(dataset, pin_memory_device="cuda")  # deprecated

export_for_training api

# deprecated: torch.export.export_for_training
# use: torch.export.export instead

from torch.export import export

# new approach
exported_program = export(model, (input_example,))

hardware-specific features

cuda 13.0 support

# install with cuda 13.0
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130

cuda 13.0 improvements:

71% smaller cuda math api binaries
33.5% smaller wheel sizes (2.18gb vs 3.28gb)
blackwell architecture support (sm_120)
unified arm platform support
compression mode enabled

rocm enhancements

# install with rocm 6.4
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/rocm6.4

rocm 2.9 features:

wheel variant support on linux
ocp micro-scaling format (mx-fp8, mx-fp4) for gfx950
enhanced elementwise kernels
improved reduction operations
better hardware detection

intel xpu support

# install for intel gpus
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/xpu

xpu features:

wheel variant support (linux and windows)
flexattention support
optimized operations for intel gpus

arm platform improvements

linux aarch64 cuda builds available across all cuda versions:

# arm64 with cuda support
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128
# works on linux arm64 systems

arm optimizations:

improved backend performance
enhanced test coverage
better platform compatibility

migration guide

updating from pytorch 2.8

# basic upgrade
pip install --upgrade torch==2.9.0

# check python version first
python --version  # must be 3.10+

# if python 3.9, upgrade python first
# python 3.14 preview available

onnx export migration

# if using old torchscript export
# old (2.8)
torch.onnx.export(model, args, "model.onnx", dynamo=False)

# new (2.9 default)
torch.onnx.export(model, args, "model.onnx")  # dynamo=True default

# specify opset if needed
torch.onnx.export(model, args, "model.onnx", opset_version=20)

custom operator review

# review operators with shared input/output storage
# may produce undefined behavior under torch.compile

# check for operations like:
def custom_op(input):
    output = input.view_as(input)  # shares storage
    return output  # may cause issues with compile

macos compatibility

# check macos version before upgrading
import platform
macos_version = platform.mac_ver()[0]
major, minor = map(int, macos_version.split('.')[:2])

if major == 13:  # ventura
    print("warning: upgrade to macos 14+ for mps support")
    print("or stay on pytorch 2.8 with python 3.9")

wheel variant adoption

pytorch 2.9 expands wheel variant support for automatic hardware detection:

current state (2.9)

cuda: windows and linux (stable)
rocm: linux only (experimental)
xpu: linux and windows (experimental)

future vision

# goal: single command auto-detects hardware
uv pip install torch
# automatically installs:
# - cuda build for nvidia gpus
# - rocm build for amd gpus
# - xpu build for intel gpus
# - cpu build if no gpu detected

current usage

# experimental uv build with wheel variants
curl -LsSf https://astral.sh/uv/install.sh | \
  INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh

uv pip install torch  # auto-detects hardware

performance benchmarks

pytorch 2.9 includes several performance improvements:

training optimizations

fused rope kernels: reduced overhead for rotary position embeddings
aggressive persistent reduction: faster reduction operations
improved gemm templates: better a16w4/a16w8 performance
disabled cudagraph gcs: faster cuda graph capture by default

inference optimizations

nonblocking gpu indexing: eliminates synchronization overhead
optimized mps operations: faster cummin/cummax on apple silicon
enhanced rocm kernels: improved elementwise and reduction performance
cpu fp8 support: fp8 quantization available on cpu

tested examples

the following examples have been verified working with pytorch 2.9.0 and torchao 0.14.1:

basic quantization (cpu)

# tested: pytorch 2.9.0, torchao 0.14.1
import torch
from torchao.quantization import quantize_, Int8WeightOnlyConfig

# create simple model
model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32)
)

# quantize for cpu inference (new config api)
quantize_(model, Int8WeightOnlyConfig())

# run inference
x = torch.randn(32, 128)
output = model(x)
print(f"output shape: {output.shape}")  # torch.Size([32, 32])

muon optimizer

# tested: pytorch 2.9.0
import torch
from torch.optim import Muon

# muon only supports 2D parameters (no biases)
model = torch.nn.Linear(128, 64, bias=False)
optimizer = Muon(model.parameters(), lr=0.001)

# training step
x = torch.randn(32, 128)
y = model(x)
loss = y.sum()
loss.backward()
optimizer.step()
print("muon optimizer works")

graph break control

# tested: pytorch 2.9.0
import torch

# decorator usage (requires boolean argument)
@torch._dynamo.error_on_graph_break(True)
def simple_function(x):
    return x * 2

result = simple_function(torch.tensor([1.0, 2.0, 3.0]))
print(f"result: {result}")  # tensor([2., 4., 6.])

# context manager usage
with torch._dynamo.error_on_graph_break(True):
    x = torch.tensor([1.0, 2.0])
    y = x + 1
    print(f"context result: {y}")  # tensor([2., 3.])

torch.compile

# tested: pytorch 2.9.0
import torch

model = torch.nn.Linear(10, 5, bias=False)
compiled = torch.compile(model, mode='reduce-overhead')

x = torch.randn(32, 10)
output = compiled(x)
print(f"compiled output: {output.shape}")  # torch.Size([32, 5])

platform detection

# tested: pytorch 2.9.0
import torch

print(f"pytorch version: {torch.__version__}")  # 2.9.0+cu128
print(f"cuda available: {torch.cuda.is_available()}")  # False (cpu-only)
print(f"cpu capability: {torch.backends.cpu.get_cpu_capability()}")  # AVX2
print(f"cuda compiled: {torch.version.cuda}")  # 12.8

testing environment: ubuntu 25.10, python 3.13.7, pytorch 2.9.0+cu128 (cpu mode), torchao 0.14.1

key takeaways

expanded hardware support: rocm, xpu, and cuda 13 wheel variants available
multi-gpu programming: symmetric memory simplifies distributed kernels
c++ stability: stable abi reduces maintenance for extensions
python 3.10 minimum: python 3.9 support dropped, 3.13+ confirmed working
macos 14+ required: ventura support removed for mps
onnx defaults changed: dynamo export now default with opset 20
wheel variants experimental: automatic hardware detection improving
performance gains: multiple optimizations across training and inference
breaking changes: review custom operators and dlpack usage
api updates: torchao config-based api, error_on_graph_break requires boolean, tf32 api changes
muon optimizer: new optimizer but only supports 2D parameters (no biases)

references

pytorch 2.9 documentation

hardware-specific resources

pytorch setup with uv - complete uv configuration
pytorch 2.8 release - previous version notes
cuda setup - gpu configuration guide
uv package manager - comprehensive uv reference

installation

with uv (recommended)

with pip (traditional)

platform support

multi-gpu improvements

symmetric memory

nvshmem integration

c++ extension improvements

stable libtorch abi

training improvements

graph break control

ahead-of-time compilation

new muon optimizer

inference improvements

flexattention on intel gpus

cpu fp8 quantization

performance optimizations

breaking changes

python version requirement

macos support

custom operators

dlpack upgrade

error handling changes

onnx export changes

tf32 api deprecation

deprecations

pin_memory_device parameter

export_for_training api

hardware-specific features

cuda 13.0 support

rocm enhancements

intel xpu support

arm platform improvements

migration guide

updating from pytorch 2.8

onnx export migration

custom operator review

macos compatibility

wheel variant adoption

current state (2.9)

future vision

current usage

performance benchmarks

training optimizations

inference optimizations

tested examples

basic quantization (cpu)

muon optimizer

graph break control

torch.compile

platform detection

key takeaways

references

pytorch 2.9 documentation

hardware-specific resources

related guides

related pages

more in programming