pytorch 2.8 release

published: August 7, 2025 updated: August 27, 2025 โ€ข

pytorch 2.8 was released august 6, 2025. key improvements include intel cpu quantized llm inference, torch.compile for apple silicon, float8 training, expanded platform support (sycl, xpu, xccl), and a simplified stable/unstable feature classification system.

testing environment: code examples tested on intel core ultra 7 165h (avx2, no amx) with nvidia rtx 1000 ada generation (compute capability 8.9). all python examples verified working except mps (mac-only) and fsdp2 (multi-gpu setup required).

installation

see pytorch setup with uv for detailed configuration including automatic backend selection.

# auto-detect backend (cuda if available, cpu otherwise)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao  # then add related packages

# or for project setup with specific backend
uv init my-project
cd my-project
uv add torch torchvision --index https://download.pytorch.org/whl/cu128  # cuda 12.8
uv add torchao  # add after torch is configured

with pip (traditional)

# cpu
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cpu

# cuda 12.8 (recommended)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu128

# cuda 12.6 (for older gpus: maxwell, pascal, volta)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu126

note: examples show both uv (recommended) and pip for compatibility. for project setup with uv, see the pytorch setup guide.

platform support

  • cuda: 12.6, 12.8 (12.9 being removed for cuda 13.0)
  • cuda 13.0: nightly builds expected august 29, 2025 (tracking issue)
  • rocm: 6.3, 6.4
  • xpu: linux and windows
  • mps: macos (ventura support ends in 2.9)

training improvements

float8 training

torchao now supports float8 training with measurable performance gains:

from torchao.float8 import convert_to_float8_training

# convert model to float8
model = convert_to_float8_training(model)

# results: 1.5x faster training on llama-3.1-70b

hierarchical compilation

torch.compile handles complex model graphs more efficiently:

model = torch.compile(model, mode="reduce-overhead")
# better handling of nested control flow
# reduced compilation time for large models

for running training scripts with uv, see development workflow.

distributed training

fsdp2 now works with quantization:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torchao.quantization import quantize_

# quantize then distribute
quantize_(model, int8_weight_only())
model = FSDP(model)

inference improvements

intel cpu quantization

the headline feature. pytorch 2.8 achieves competitive cpu inference for llms:

from torchao.quantization import quantize_, int8_weight_only

# quantize model for intel cpu (int8 works on cpu)
quantize_(model, int8_weight_only())

# note: int4_weight_only() requires cuda backend
# intel cpu optimizations focus on int8 quantization
# performance on 6th gen xeon (32 cores):
# - llama-3.1-8b: 20% latency reduction with quantization

supported quantization modes

# cpu-compatible quantization

# 8-bit weights only (works on cpu)
from torchao.quantization import int8_weight_only
quantize_(model, int8_weight_only())

# dynamic 8-bit activations, 8-bit weights (works on cpu)
from torchao.quantization import int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())

# cuda-only quantization

# 4-bit weights (requires cuda - not cpu compatible)
from torchao.quantization import int4_weight_only
model = model.cuda()
quantize_(model, int4_weight_only())  # cuda only

intel amx optimization

amx kernels now activate for smaller batch sizes:

# previously: only for batch_size >= 16
# now: benefits when batch_size > 4

# enable max-autotune for best performance
torch._inductor.config.max_autotune = True

# note: requires intel 4th gen xeon or newer with amx support
# check cpu features: grep amx /proc/cpuinfo

apple silicon support

mps backend now supports torch.compile:

device = torch.device("mps")
model = model.to(device)
compiled_model = torch.compile(model)
# first compilation support for m-series chips

breaking changes

cuda architecture removal

maxwell (5.x), pascal (6.x), and volta (7.0) gpus no longer supported with cuda 12.8+. use cuda 12.6:

# for older gpus
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

cuda 13.0 update: pytorch planning to drop cuda 12.9 builds on august 29, 2025 to make room for cuda 13.0 nightly builds. cuda 13.0 will bring:

  • ~71% smaller cuda math api binaries
  • 33.5% smaller wheel sizes (3.28gb โ†’ 2.18gb)
  • support for blackwell architecture (sm_120)
  • unified arm platform support

quantization api migration

torch.ao.quantization is deprecated. migrate to torchao:

# old (deprecated)
from torch.ao.quantization import quantize_dynamic

# new
from torchao.quantization import quantize_

removal timeline:

  • 2.8: deprecated with warnings
  • 2.10: complete removal

feature classification

the prototype/beta/stable system is replaced:

  • stable (api-stable): backwards compatibility guaranteed
  • unstable (api-unstable): api may change

all new features require an rfc document.

performance benchmarks

inference (measured)

modelquantizationhardwareimprovement
llama-3.1-8bint8/amxxeon 6th gen20% latency reduction
llama-3-8bint4cuda gpu1.89x faster, 58% less memory
llama-3.2-3bqatgeneric cpu77% perplexity recovery

training (measured)

modeltechniqueimprovement
llama-3.1-70bfloat81.5x faster

migration guide

quantization code update

# before (2.7)
import torch.ao.quantization as quant
model = quant.quantize_dynamic(model, {torch.nn.Linear})

# after (2.8) - tested and working
from torchao.quantization import quantize_, int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())

working cpu example

import torch
from torchao.quantization import quantize_, int8_weight_only

# create model
model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32)
)

# quantize for cpu inference
quantize_(model, int8_weight_only())

# run inference
x = torch.randn(32, 128)
output = model(x)  # works on cpu

hardware compatibility check

import torch

# check cuda support
if torch.cuda.is_available():
    capability = torch.cuda.get_device_capability()
    device_name = torch.cuda.get_device_name(0)
    print(f"gpu: {device_name}, compute capability: {capability}")

    if capability[0] < 8:  # volta (7.x) or older
        print("warning: use cuda 12.6 for maxwell/pascal/volta gpus")
        print("cuda 12.8+ drops support for compute capability < 8.0")
else:
    print("cuda not available - cpu quantization only (int8)")

# check cpu features (linux)
import subprocess
try:
    result = subprocess.run(['grep', 'flags', '/proc/cpuinfo'],
                          capture_output=True, text=True)
    has_amx = 'amx' in result.stdout
    has_avx512 = 'avx512' in result.stdout
    print(f"cpu has amx: {has_amx}, avx512: {has_avx512}")
except:
    pass  # windows/mac

compilation flags

# new compilation options
torch._inductor.config.max_autotune = True  # enable amx optimizations
torch._inductor.config.use_mixed_mm = True  # mixed precision matmul

key takeaways

  • intel cpu inference improved with int8 quantization and amx optimizations
  • torchao replaces torch.ao.quantization (removal in 2.10)
  • int4 quantization requires cuda, int8 works on cpu
  • float8 training reduces time by 1.5x
  • older nvidia gpus (maxwell, pascal, volta) need cuda 12.6
  • apple silicon gets torch.compile support via mpsinductor

references

pytorch 2.8 documentation

pytorch cuda 13.0 tracking

on this page