pytorch 2.9 release

published: October 15, 2025 updated: October 24, 2025
on this page

pytorch 2.9 was released october 15, 2025. key improvements include expanded wheel variant support (rocm, xpu, cuda 13), symmetric memory for multi-gpu programming, stable libtorch abi for c++/cuda extensions, flexible graph break control, python 3.14 support, and linux aarch64 cuda builds.

installation

see pytorch setup with uv for detailed configuration including automatic backend selection.

pytorch 2.9 introduces experimental wheel variant support that automatically detects your hardware. this feature is being tested with a special build of uv:

# experimental: auto-detect backend with wheel variants
# install experimental uv build
curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh

# then simply install - it detects cuda/rocm/xpu automatically
uv venv
uv pip install torch torchvision

# traditional method (stable)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao

note: wheel variant support is experimental in 2.9. the goal is eventual automatic hardware detection without manual backend selection.

with pip (traditional)

# cpu
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cpu

# cuda 12.8 (stable)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu128

# cuda 13.0 (new in 2.9)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/cu130

# rocm 6.4 (new wheel support)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/rocm6.4

# intel xpu (new wheel support)
pip install torch==2.9.0 torchvision --index-url https://download.pytorch.org/whl/xpu

platform support

  • cuda: 12.8 (stable), 13.0 (new)
    • cuda 13.0 brings 33.5% smaller wheels (2.18gb vs 3.28gb)
    • blackwell architecture support (sm_120)
  • rocm: 6.3, 6.4 (new wheel variant support on linux)
    • ocp micro-scaling format support (mx-fp8, mx-fp4) for gfx950
  • xpu: intel gpus on linux and windows (new wheel support)
  • mps: macos 14+ only (ventura support removed)
  • python: 3.10 minimum (3.9 dropped), 3.14 preview available

multi-gpu improvements

symmetric memory

pytorch 2.9 introduces symmetric memory for simplified multi-gpu kernel programming:

import torch
from torch.ops import symm_mem

# symmetric memory operations for multi-gpu kernels
# simplifies programming across nvlinks and rdma networks
# available under torch.ops.symm_mem namespace

# example: multi-gpu kernel with symmetric memory
# automatically handles communication between gpus
device = torch.device("cuda:0")
tensor = torch.randn(1000, 1000, device=device)
# symmetric memory enables efficient multi-gpu operations

key benefits:

  • easier programming of multi-gpu kernels
  • optimized communication over nvlinks
  • support for rdma networks
  • critical for distributed training and large-scale inference

nvshmem integration

nvshmem plugin for triton enables custom multi-gpu kernels:

# nvshmem support for customized multi-gpu kernels
# enables advanced distributed memory operations
# integrates with triton for custom kernel development

c++ extension improvements

stable libtorch abi

pytorch 2.9 expands the stable abi for c++/cuda extensions:

// build extensions with one pytorch version, run with another
#include <torch/stable/aten.h>

// new stable apis added in 2.9:
// - device utilities (device guard, stream)
// - tensor apis (default constructor, is_cpu, scalar_type, get_device_index)
// - expanded stable aten ops (amax, narrow, new_empty, pad)

// example: using stable scalar type
torch::headeronly::ScalarType dtype = torch::headeronly::kFloat32;
auto tensor = torch::stable::Tensor(/* ... */);
if (tensor.is_cpu()) {
    // process on cpu
}

benefits:

  • reduce maintenance costs for third-party libraries
  • build once, run with multiple pytorch versions
  • improved integration for users
  • abi stability through translation layer

new stable apis:

  • device guard and stream utilities
  • tensor constructors and type queries
  • expanded aten operations
  • headeronly scalar types with abi stability

training improvements

graph break control

new error-on-graph-break feature for torch.compile debugging:

import torch

# mark regions where graph breaks should error
@torch._dynamo.error_on_graph_break(True)
def training_loop(model, data):
    # compilation errors here instead of silently breaking
    output = model(data)
    return output

# or use as context manager
with torch._dynamo.error_on_graph_break(True):
    compiled_model = torch.compile(model)
    result = compiled_model(input_data)

note: requires boolean argument (True to enable, False to disable)

benefits:

  • explicit control over where graph breaks are allowed
  • easier debugging of compilation issues
  • improved performance validation

ahead-of-time compilation

fullgraph mode now supports ahead-of-time compilation:

# compile model ahead of time in fullgraph mode
model = torch.compile(model, mode="reduce-overhead", fullgraph=True)
# reduced compilation overhead during training

new muon optimizer

from torch.optim import Muon

# new optimizer for training (only supports 2D parameters)
model = torch.nn.Linear(128, 64, bias=False)  # no bias
optimizer = Muon(model.parameters(), lr=0.001)

# muon only works with weight matrices, not biases
# designed for transformer models and large language models

limitation: muon only supports 2D parameters (weight matrices), not 1D parameters like biases

inference improvements

flexattention on intel gpus

flexattention support now available for intel xpu:

from torch.nn.attention.flex_attention import flex_attention

# use flexattention on intel gpus
device = torch.device("xpu")
model = model.to(device)
# flexattention optimizations available

cpu fp8 quantization

cpu support for fp8 quantized operations:

from torchao.quantization import quantize_, Float8WeightOnlyConfig

# fp8 quantization on cpu
# qlinear and qconv operations enabled
quantize_(model, Float8WeightOnlyConfig())  # cpu compatible

note: cpu fp8 support is new in 2.9 for qlinear and qconv operations

performance optimizations

  • aggressive persistent reduction for faster operations
  • improved a16w4/a16w8 gemm templates
  • fused rope kernels for efficiency
  • optimized mps cummin/cummax operations
  • nonblocking gpu tensor indexing (avoids synchronization)
  • disabled cudagraph gcs by default for faster capture
  • enhanced rocm elementwise and reduction kernels

breaking changes

python version requirement

minimum python 3.10 (3.9 support dropped):

# python 3.9 no longer supported
# upgrade to python 3.10+ required

# python 3.14 available as preview
pip install torch==2.9.0 --python-version=3.14

macos support

macos 14+ required for mps backend:

# ventura (macos 13) support removed
# users on ventura should avoid upgrading python to 3.9+
# or upgrade to macos 14+ (sonoma or sequoia)

import torch
if torch.backends.mps.is_available():
    device = torch.device("mps")
    # requires macos 14+

custom operators

breaking: outputs sharing storage with inputs produce undefined behavior under torch.compile:

# may return incorrect results silently under compilation
# review custom operators that share input/output storage
compiled_model = torch.compile(model)

dlpack upgrade

dlpack 1.0 introduces breaking changes:

# torch.utils.dlpack objects affected
# DLDeviceType and related types changed
from torch.utils.dlpack import to_dlpack, from_dlpack
# review code using dlpack conversions

error handling changes

torch.cat error types now more specific:

# previously: generic RuntimeError
# now: ValueError, IndexError, or TypeError

try:
    result = torch.cat([tensor1, tensor2])
except ValueError:  # instead of RuntimeError
    handle_dimension_mismatch()

onnx export changes

default behavior switched to dynamo pipeline:

# old default (2.8): torchscript export
# new default (2.9): dynamo export
torch.onnx.export(model, args, "model.onnx")
# now uses dynamo=True by default

# default opset changed to 20 (from 18)
# removed apis:
# - dynamo_export (use torch.onnx.export with dynamo=True)
# - onnxrt compile backend
# - enable_fake_mode
# - caffe2 support

tf32 api deprecation

new tf32 api replaces old settings (deprecated after 2.9):

# old api (deprecated after 2.9)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# new api (use this)
torch.backends.cudnn.conv.fp32_precision = 'tf32'
torch.backends.cuda.matmul.fp32_precision = 'tf32'

# or for ieee precision
torch.backends.cuda.matmul.fp32_precision = 'ieee'

see cuda tf32 documentation for details.

deprecations

pin_memory_device parameter

# deprecated in dataloader
from torch.utils.data import DataLoader

# avoid using pin_memory_device parameter
loader = DataLoader(dataset, pin_memory=True)  # use this
# not: DataLoader(dataset, pin_memory_device="cuda")  # deprecated

export_for_training api

# deprecated: torch.export.export_for_training
# use: torch.export.export instead

from torch.export import export

# new approach
exported_program = export(model, (input_example,))

hardware-specific features

cuda 13.0 support

# install with cuda 13.0
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu130

cuda 13.0 improvements:

  • 71% smaller cuda math api binaries
  • 33.5% smaller wheel sizes (2.18gb vs 3.28gb)
  • blackwell architecture support (sm_120)
  • unified arm platform support
  • compression mode enabled

rocm enhancements

# install with rocm 6.4
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/rocm6.4

rocm 2.9 features:

  • wheel variant support on linux
  • ocp micro-scaling format (mx-fp8, mx-fp4) for gfx950
  • enhanced elementwise kernels
  • improved reduction operations
  • better hardware detection

intel xpu support

# install for intel gpus
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/xpu

xpu features:

  • wheel variant support (linux and windows)
  • flexattention support
  • optimized operations for intel gpus

arm platform improvements

linux aarch64 cuda builds available across all cuda versions:

# arm64 with cuda support
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128
# works on linux arm64 systems

arm optimizations:

  • improved backend performance
  • enhanced test coverage
  • better platform compatibility

migration guide

updating from pytorch 2.8

# basic upgrade
pip install --upgrade torch==2.9.0

# check python version first
python --version  # must be 3.10+

# if python 3.9, upgrade python first
# python 3.14 preview available

onnx export migration

# if using old torchscript export
# old (2.8)
torch.onnx.export(model, args, "model.onnx", dynamo=False)

# new (2.9 default)
torch.onnx.export(model, args, "model.onnx")  # dynamo=True default

# specify opset if needed
torch.onnx.export(model, args, "model.onnx", opset_version=20)

custom operator review

# review operators with shared input/output storage
# may produce undefined behavior under torch.compile

# check for operations like:
def custom_op(input):
    output = input.view_as(input)  # shares storage
    return output  # may cause issues with compile

macos compatibility

# check macos version before upgrading
import platform
macos_version = platform.mac_ver()[0]
major, minor = map(int, macos_version.split('.')[:2])

if major == 13:  # ventura
    print("warning: upgrade to macos 14+ for mps support")
    print("or stay on pytorch 2.8 with python 3.9")

wheel variant adoption

pytorch 2.9 expands wheel variant support for automatic hardware detection:

current state (2.9)

  • cuda: windows and linux (stable)
  • rocm: linux only (experimental)
  • xpu: linux and windows (experimental)

future vision

# goal: single command auto-detects hardware
uv pip install torch
# automatically installs:
# - cuda build for nvidia gpus
# - rocm build for amd gpus
# - xpu build for intel gpus
# - cpu build if no gpu detected

current usage

# experimental uv build with wheel variants
curl -LsSf https://astral.sh/uv/install.sh | \
  INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh/v0.0.2 sh

uv pip install torch  # auto-detects hardware

performance benchmarks

pytorch 2.9 includes several performance improvements:

training optimizations

  • fused rope kernels: reduced overhead for rotary position embeddings
  • aggressive persistent reduction: faster reduction operations
  • improved gemm templates: better a16w4/a16w8 performance
  • disabled cudagraph gcs: faster cuda graph capture by default

inference optimizations

  • nonblocking gpu indexing: eliminates synchronization overhead
  • optimized mps operations: faster cummin/cummax on apple silicon
  • enhanced rocm kernels: improved elementwise and reduction performance
  • cpu fp8 support: fp8 quantization available on cpu

tested examples

the following examples have been verified working with pytorch 2.9.0 and torchao 0.14.1:

basic quantization (cpu)

# tested: pytorch 2.9.0, torchao 0.14.1
import torch
from torchao.quantization import quantize_, Int8WeightOnlyConfig

# create simple model
model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32)
)

# quantize for cpu inference (new config api)
quantize_(model, Int8WeightOnlyConfig())

# run inference
x = torch.randn(32, 128)
output = model(x)
print(f"output shape: {output.shape}")  # torch.Size([32, 32])

muon optimizer

# tested: pytorch 2.9.0
import torch
from torch.optim import Muon

# muon only supports 2D parameters (no biases)
model = torch.nn.Linear(128, 64, bias=False)
optimizer = Muon(model.parameters(), lr=0.001)

# training step
x = torch.randn(32, 128)
y = model(x)
loss = y.sum()
loss.backward()
optimizer.step()
print("muon optimizer works")

graph break control

# tested: pytorch 2.9.0
import torch

# decorator usage (requires boolean argument)
@torch._dynamo.error_on_graph_break(True)
def simple_function(x):
    return x * 2

result = simple_function(torch.tensor([1.0, 2.0, 3.0]))
print(f"result: {result}")  # tensor([2., 4., 6.])

# context manager usage
with torch._dynamo.error_on_graph_break(True):
    x = torch.tensor([1.0, 2.0])
    y = x + 1
    print(f"context result: {y}")  # tensor([2., 3.])

torch.compile

# tested: pytorch 2.9.0
import torch

model = torch.nn.Linear(10, 5, bias=False)
compiled = torch.compile(model, mode='reduce-overhead')

x = torch.randn(32, 10)
output = compiled(x)
print(f"compiled output: {output.shape}")  # torch.Size([32, 5])

platform detection

# tested: pytorch 2.9.0
import torch

print(f"pytorch version: {torch.__version__}")  # 2.9.0+cu128
print(f"cuda available: {torch.cuda.is_available()}")  # False (cpu-only)
print(f"cpu capability: {torch.backends.cpu.get_cpu_capability()}")  # AVX2
print(f"cuda compiled: {torch.version.cuda}")  # 12.8

testing environment: ubuntu 25.10, python 3.13.7, pytorch 2.9.0+cu128 (cpu mode), torchao 0.14.1

key takeaways

  • expanded hardware support: rocm, xpu, and cuda 13 wheel variants available
  • multi-gpu programming: symmetric memory simplifies distributed kernels
  • c++ stability: stable abi reduces maintenance for extensions
  • python 3.10 minimum: python 3.9 support dropped, 3.13+ confirmed working
  • macos 14+ required: ventura support removed for mps
  • onnx defaults changed: dynamo export now default with opset 20
  • wheel variants experimental: automatic hardware detection improving
  • performance gains: multiple optimizations across training and inference
  • breaking changes: review custom operators and dlpack usage
  • api updates: torchao config-based api, error_on_graph_break requires boolean, tf32 api changes
  • muon optimizer: new optimizer but only supports 2D parameters (no biases)

references

pytorch 2.9 documentation

hardware-specific resources

on this page