pytorch 2.8 release

pytorch 2.8 was released august 6, 2025. key improvements include intel cpu quantized llm inference, torch.compile for apple silicon, float8 training, expanded platform support (sycl, xpu, xccl), and a simplified stable/unstable feature classification system.

testing environment: code examples tested on intel core ultra 7 165h (avx2, no amx) with nvidia rtx 1000 ada generation (compute capability 8.9). all python examples verified working except mps (mac-only) and fsdp2 (multi-gpu setup required).

installation

with uv (recommended)

see pytorch setup with uv for detailed configuration including automatic backend selection.

# auto-detect backend (cuda if available, cpu otherwise)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao  # then add related packages

# or for project setup with specific backend
uv init my-project
cd my-project
uv add torch torchvision --index https://download.pytorch.org/whl/cu128  # cuda 12.8
uv add torchao  # add after torch is configured

with pip (traditional)

# cpu
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cpu

# cuda 12.8 (recommended)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu128

# cuda 12.6 (for older gpus: maxwell, pascal, volta)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu126

note: examples show both uv (recommended) and pip for compatibility. for project setup with uv, see the pytorch setup guide.

platform support

cuda: 12.6, 12.8 (12.9 being removed for cuda 13.0)
cuda 13.0: nightly builds expected august 29, 2025 (tracking issue)
rocm: 6.3, 6.4
xpu: linux and windows
mps: macos (ventura support ends in 2.9)

training improvements

float8 training

torchao now supports float8 training with measurable performance gains:

from torchao.float8 import convert_to_float8_training

# convert model to float8
model = convert_to_float8_training(model)

# results: 1.5x faster training on llama-3.1-70b

hierarchical compilation

torch.compile handles complex model graphs more efficiently:

model = torch.compile(model, mode="reduce-overhead")
# better handling of nested control flow
# reduced compilation time for large models

for running training scripts with uv, see development workflow.

distributed training

fsdp2 now works with quantization:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torchao.quantization import quantize_

# quantize then distribute
quantize_(model, int8_weight_only())
model = FSDP(model)

inference improvements

intel cpu quantization

the headline feature. pytorch 2.8 achieves competitive cpu inference for llms:

from torchao.quantization import quantize_, int8_weight_only

# quantize model for intel cpu (int8 works on cpu)
quantize_(model, int8_weight_only())

# note: int4_weight_only() requires cuda backend
# intel cpu optimizations focus on int8 quantization
# performance on 6th gen xeon (32 cores):
# - llama-3.1-8b: 20% latency reduction with quantization

supported quantization modes

# cpu-compatible quantization

# 8-bit weights only (works on cpu)
from torchao.quantization import int8_weight_only
quantize_(model, int8_weight_only())

# dynamic 8-bit activations, 8-bit weights (works on cpu)
from torchao.quantization import int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())

# cuda-only quantization

# 4-bit weights (requires cuda - not cpu compatible)
from torchao.quantization import int4_weight_only
model = model.cuda()
quantize_(model, int4_weight_only())  # cuda only

intel amx optimization

amx kernels now activate for smaller batch sizes:

# previously: only for batch_size >= 16
# now: benefits when batch_size > 4

# enable max-autotune for best performance
torch._inductor.config.max_autotune = True

# note: requires intel 4th gen xeon or newer with amx support
# check cpu features: grep amx /proc/cpuinfo

apple silicon support

mps backend now supports torch.compile:

device = torch.device("mps")
model = model.to(device)
compiled_model = torch.compile(model)
# first compilation support for m-series chips

breaking changes

cuda architecture removal

maxwell (5.x), pascal (6.x), and volta (7.0) gpus no longer supported with cuda 12.8+. use cuda 12.6:

# for older gpus
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

cuda 13.0 update: pytorch planning to drop cuda 12.9 builds on august 29, 2025 to make room for cuda 13.0 nightly builds. cuda 13.0 will bring:

~71% smaller cuda math api binaries
33.5% smaller wheel sizes (3.28gb → 2.18gb)
support for blackwell architecture (sm_120)
unified arm platform support

quantization api migration

torch.ao.quantization is deprecated. migrate to torchao:

# old (deprecated)
from torch.ao.quantization import quantize_dynamic

# new
from torchao.quantization import quantize_

removal timeline:

2.8: deprecated with warnings
2.10: complete removal

feature classification

the prototype/beta/stable system is replaced:

stable (api-stable): backwards compatibility guaranteed
unstable (api-unstable): api may change

all new features require an rfc document.

performance benchmarks

inference (measured)

model	quantization	hardware	improvement
llama-3.1-8b	int8/amx	xeon 6th gen	20% latency reduction
llama-3-8b	int4	cuda gpu	1.89x faster, 58% less memory
llama-3.2-3b	qat	generic cpu	77% perplexity recovery

training (measured)

model	technique	improvement
llama-3.1-70b	float8	1.5x faster

migration guide

quantization code update

# before (2.7)
import torch.ao.quantization as quant
model = quant.quantize_dynamic(model, {torch.nn.Linear})

# after (2.8) - tested and working
from torchao.quantization import quantize_, int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())

working cpu example

import torch
from torchao.quantization import quantize_, int8_weight_only

# create model
model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32)
)

# quantize for cpu inference
quantize_(model, int8_weight_only())

# run inference
x = torch.randn(32, 128)
output = model(x)  # works on cpu

hardware compatibility check

import torch

# check cuda support
if torch.cuda.is_available():
    capability = torch.cuda.get_device_capability()
    device_name = torch.cuda.get_device_name(0)
    print(f"gpu: {device_name}, compute capability: {capability}")

    if capability[0] < 8:  # volta (7.x) or older
        print("warning: use cuda 12.6 for maxwell/pascal/volta gpus")
        print("cuda 12.8+ drops support for compute capability < 8.0")
else:
    print("cuda not available - cpu quantization only (int8)")

# check cpu features (linux)
import subprocess
try:
    result = subprocess.run(['grep', 'flags', '/proc/cpuinfo'],
                          capture_output=True, text=True)
    has_amx = 'amx' in result.stdout
    has_avx512 = 'avx512' in result.stdout
    print(f"cpu has amx: {has_amx}, avx512: {has_avx512}")
except:
    pass  # windows/mac

compilation flags

# new compilation options
torch._inductor.config.max_autotune = True  # enable amx optimizations
torch._inductor.config.use_mixed_mm = True  # mixed precision matmul

key takeaways

intel cpu inference improved with int8 quantization and amx optimizations
torchao replaces torch.ao.quantization (removal in 2.10)
int4 quantization requires cuda, int8 works on cpu
float8 training reduces time by 1.5x
older nvidia gpus (maxwell, pascal, volta) need cuda 12.6
apple silicon gets torch.compile support via mpsinductor

references

pytorch 2.8 documentation

pytorch cuda 13.0 tracking

pytorch setup with uv - complete uv configuration for pytorch
cuda 13.0 release overview - detailed cuda 13.0 information
uv package manager - comprehensive uv reference
cuda setup - gpu configuration guide

installation

with uv (recommended)

with pip (traditional)

platform support

training improvements

float8 training

hierarchical compilation

distributed training

inference improvements

intel cpu quantization

supported quantization modes

intel amx optimization

apple silicon support

breaking changes

cuda architecture removal

quantization api migration

feature classification

performance benchmarks

inference (measured)

training (measured)

migration guide

quantization code update

working cpu example

hardware compatibility check

compilation flags

key takeaways

references

pytorch 2.8 documentation

pytorch cuda 13.0 tracking

related pages

more in programming

installation

with uv (recommended)

with pip (traditional)

platform support

training improvements

float8 training

hierarchical compilation

distributed training

inference improvements

intel cpu quantization

supported quantization modes

intel amx optimization

apple silicon support

breaking changes

cuda architecture removal

quantization api migration

feature classification

performance benchmarks

inference (measured)

training (measured)

migration guide

quantization code update

working cpu example

hardware compatibility check

compilation flags

key takeaways

references

pytorch 2.8 documentation

pytorch cuda 13.0 tracking

related guides

related pages

more in programming