pytorch 2.8 release

pytorch 2.8 was released august 6, 2025. key improvements include intel cpu quantized llm inference, torch.compile for apple silicon, float8 training, expanded platform support (sycl, xpu, xccl), and a simplified stable/unstable feature classification system.

testing environment: code examples tested on intel core ultra 7 165h (avx2, no amx) with nvidia rtx 1000 ada generation (compute capability 8.9). all python examples verified working except mps (mac-only) and fsdp2 (multi-gpu setup required).

installation

see pytorch setup with uv for detailed configuration including automatic backend selection.

# auto-detect backend (cuda if available, cpu otherwise)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao  # then add related packages

# or for project setup with specific backend
uv init my-project
cd my-project
uv add torch torchvision --index https://download.pytorch.org/whl/cu128  # cuda 12.8
uv add torchao  # add after torch is configured

with pip (traditional)

# cpu
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cpu

# cuda 12.8 (recommended)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu128

# cuda 12.6 (for older gpus: maxwell, pascal, volta)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu126

note: examples show both uv (recommended) and pip for compatibility. for project setup with uv, see the pytorch setup guide.

platform support

  • cuda: 12.6, 12.8, 12.9
  • rocm: 6.3, 6.4
  • xpu: linux and windows
  • mps: macos (ventura support ends in 2.9)

training improvements

float8 training

torchao now supports float8 training with measurable performance gains:

from torchao.float8 import convert_to_float8_training

# convert model to float8
model = convert_to_float8_training(model)

# results: 1.5x faster training on llama-3.1-70b

hierarchical compilation

torch.compile handles complex model graphs more efficiently:

model = torch.compile(model, mode="reduce-overhead")
# better handling of nested control flow
# reduced compilation time for large models

for running training scripts with uv, see development workflow.

distributed training

fsdp2 now works with quantization:

from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torchao.quantization import quantize_

# quantize then distribute
quantize_(model, int8_weight_only())
model = FSDP(model)

inference improvements

intel cpu quantization

the headline feature. pytorch 2.8 achieves competitive cpu inference for llms:

from torchao.quantization import quantize_, int8_weight_only

# quantize model for intel cpu (int8 works on cpu)
quantize_(model, int8_weight_only())

# note: int4_weight_only() requires cuda backend
# intel cpu optimizations focus on int8 quantization
# performance on 6th gen xeon (32 cores):
# - llama-3.1-8b: 20% latency reduction with quantization

supported quantization modes

# cpu-compatible quantization

# 8-bit weights only (works on cpu)
from torchao.quantization import int8_weight_only
quantize_(model, int8_weight_only())

# dynamic 8-bit activations, 8-bit weights (works on cpu)
from torchao.quantization import int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())

# cuda-only quantization

# 4-bit weights (requires cuda - not cpu compatible)
from torchao.quantization import int4_weight_only
model = model.cuda()
quantize_(model, int4_weight_only())  # cuda only

intel amx optimization

amx kernels now activate for smaller batch sizes:

# previously: only for batch_size >= 16
# now: benefits when batch_size > 4

# enable max-autotune for best performance
torch._inductor.config.max_autotune = True

# note: requires intel 4th gen xeon or newer with amx support
# check cpu features: grep amx /proc/cpuinfo

apple silicon support

mps backend now supports torch.compile:

device = torch.device("mps")
model = model.to(device)
compiled_model = torch.compile(model)
# first compilation support for m-series chips

breaking changes

cuda architecture removal

maxwell (5.x), pascal (6.x), and volta (7.0) gpus no longer supported with cuda 12.8+. use cuda 12.6:

# for older gpus
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126

quantization api migration

torch.ao.quantization is deprecated. migrate to torchao:

# old (deprecated)
from torch.ao.quantization import quantize_dynamic

# new
from torchao.quantization import quantize_

removal timeline:

  • 2.8: deprecated with warnings
  • 2.10: complete removal

feature classification

the prototype/beta/stable system is replaced:

  • stable (api-stable): backwards compatibility guaranteed
  • unstable (api-unstable): api may change

all new features require an rfc document.

performance benchmarks

inference (measured)

modelquantizationhardwareimprovement
llama-3.1-8bint8/amxxeon 6th gen20% latency reduction
llama-3-8bint4cuda gpu1.89x faster, 58% less memory
llama-3.2-3bqatgeneric cpu77% perplexity recovery

training (measured)

modeltechniqueimprovement
llama-3.1-70bfloat81.5x faster

migration guide

quantization code update

# before (2.7)
import torch.ao.quantization as quant
model = quant.quantize_dynamic(model, {torch.nn.Linear})

# after (2.8) - tested and working
from torchao.quantization import quantize_, int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())

working cpu example

import torch
from torchao.quantization import quantize_, int8_weight_only

# create model
model = torch.nn.Sequential(
    torch.nn.Linear(128, 64),
    torch.nn.ReLU(),
    torch.nn.Linear(64, 32)
)

# quantize for cpu inference
quantize_(model, int8_weight_only())

# run inference
x = torch.randn(32, 128)
output = model(x)  # works on cpu

hardware compatibility check

import torch

# check cuda support
if torch.cuda.is_available():
    capability = torch.cuda.get_device_capability()
    device_name = torch.cuda.get_device_name(0)
    print(f"gpu: {device_name}, compute capability: {capability}")

    if capability[0] < 8:  # volta (7.x) or older
        print("warning: use cuda 12.6 for maxwell/pascal/volta gpus")
        print("cuda 12.8+ drops support for compute capability < 8.0")
else:
    print("cuda not available - cpu quantization only (int8)")

# check cpu features (linux)
import subprocess
try:
    result = subprocess.run(['grep', 'flags', '/proc/cpuinfo'],
                          capture_output=True, text=True)
    has_amx = 'amx' in result.stdout
    has_avx512 = 'avx512' in result.stdout
    print(f"cpu has amx: {has_amx}, avx512: {has_avx512}")
except:
    pass  # windows/mac

compilation flags

# new compilation options
torch._inductor.config.max_autotune = True  # enable amx optimizations
torch._inductor.config.use_mixed_mm = True  # mixed precision matmul

key takeaways

  • intel cpu inference improved with int8 quantization and amx optimizations
  • torchao replaces torch.ao.quantization (removal in 2.10)
  • int4 quantization requires cuda, int8 works on cpu
  • float8 training reduces time by 1.5x
  • older nvidia gpus (maxwell, pascal, volta) need cuda 12.6
  • apple silicon gets torch.compile support via mpsinductor

references

pytorch 2.8 documentation

══════════════════════════════════════════════════════════════════
on this page