pytorch 2.8 release
on this page
pytorch 2.8 was released august 6, 2025. key improvements include intel cpu quantized llm inference, torch.compile for apple silicon, float8 training, expanded platform support (sycl, xpu, xccl), and a simplified stable/unstable feature classification system.
testing environment: code examples tested on intel core ultra 7 165h (avx2, no amx) with nvidia rtx 1000 ada generation (compute capability 8.9). all python examples verified working except mps (mac-only) and fsdp2 (multi-gpu setup required).
installation
with uv (recommended)
see pytorch setup with uv for detailed configuration including automatic backend selection.
# auto-detect backend (cuda if available, cpu otherwise)
uv pip install torch --torch-backend=auto
uv pip install torchvision torchao # then add related packages
# or for project setup with specific backend
uv init my-project
cd my-project
uv add torch torchvision --index https://download.pytorch.org/whl/cu128 # cuda 12.8
uv add torchao # add after torch is configured
with pip (traditional)
# cpu
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cpu
# cuda 12.8 (recommended)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu128
# cuda 12.6 (for older gpus: maxwell, pascal, volta)
pip install torch==2.8.0 torchao --index-url https://download.pytorch.org/whl/cu126
note: examples show both uv
(recommended) and pip
for compatibility. for project setup with uv
, see the pytorch setup guide.
platform support
- cuda: 12.6, 12.8, 12.9
- rocm: 6.3, 6.4
- xpu: linux and windows
- mps: macos (ventura support ends in 2.9)
training improvements
float8 training
torchao now supports float8 training with measurable performance gains:
from torchao.float8 import convert_to_float8_training
# convert model to float8
model = convert_to_float8_training(model)
# results: 1.5x faster training on llama-3.1-70b
hierarchical compilation
torch.compile handles complex model graphs more efficiently:
model = torch.compile(model, mode="reduce-overhead")
# better handling of nested control flow
# reduced compilation time for large models
for running training scripts with uv
, see development workflow.
distributed training
fsdp2 now works with quantization:
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
from torchao.quantization import quantize_
# quantize then distribute
quantize_(model, int8_weight_only())
model = FSDP(model)
inference improvements
intel cpu quantization
the headline feature. pytorch 2.8 achieves competitive cpu inference for llms:
from torchao.quantization import quantize_, int8_weight_only
# quantize model for intel cpu (int8 works on cpu)
quantize_(model, int8_weight_only())
# note: int4_weight_only() requires cuda backend
# intel cpu optimizations focus on int8 quantization
# performance on 6th gen xeon (32 cores):
# - llama-3.1-8b: 20% latency reduction with quantization
supported quantization modes
# cpu-compatible quantization
# 8-bit weights only (works on cpu)
from torchao.quantization import int8_weight_only
quantize_(model, int8_weight_only())
# dynamic 8-bit activations, 8-bit weights (works on cpu)
from torchao.quantization import int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())
# cuda-only quantization
# 4-bit weights (requires cuda - not cpu compatible)
from torchao.quantization import int4_weight_only
model = model.cuda()
quantize_(model, int4_weight_only()) # cuda only
intel amx optimization
amx kernels now activate for smaller batch sizes:
# previously: only for batch_size >= 16
# now: benefits when batch_size > 4
# enable max-autotune for best performance
torch._inductor.config.max_autotune = True
# note: requires intel 4th gen xeon or newer with amx support
# check cpu features: grep amx /proc/cpuinfo
apple silicon support
mps backend now supports torch.compile:
device = torch.device("mps")
model = model.to(device)
compiled_model = torch.compile(model)
# first compilation support for m-series chips
breaking changes
cuda architecture removal
maxwell (5.x), pascal (6.x), and volta (7.0) gpus no longer supported with cuda 12.8+. use cuda 12.6:
# for older gpus
pip install torch==2.8.0 --index-url https://download.pytorch.org/whl/cu126
quantization api migration
torch.ao.quantization
is deprecated. migrate to torchao:
# old (deprecated)
from torch.ao.quantization import quantize_dynamic
# new
from torchao.quantization import quantize_
removal timeline:
- 2.8: deprecated with warnings
- 2.10: complete removal
feature classification
the prototype/beta/stable system is replaced:
- stable (api-stable): backwards compatibility guaranteed
- unstable (api-unstable): api may change
all new features require an rfc document.
performance benchmarks
inference (measured)
model | quantization | hardware | improvement |
---|---|---|---|
llama-3.1-8b | int8/amx | xeon 6th gen | 20% latency reduction |
llama-3-8b | int4 | cuda gpu | 1.89x faster, 58% less memory |
llama-3.2-3b | qat | generic cpu | 77% perplexity recovery |
training (measured)
model | technique | improvement |
---|---|---|
llama-3.1-70b | float8 | 1.5x faster |
migration guide
quantization code update
# before (2.7)
import torch.ao.quantization as quant
model = quant.quantize_dynamic(model, {torch.nn.Linear})
# after (2.8) - tested and working
from torchao.quantization import quantize_, int8_dynamic_activation_int8_weight
quantize_(model, int8_dynamic_activation_int8_weight())
working cpu example
import torch
from torchao.quantization import quantize_, int8_weight_only
# create model
model = torch.nn.Sequential(
torch.nn.Linear(128, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 32)
)
# quantize for cpu inference
quantize_(model, int8_weight_only())
# run inference
x = torch.randn(32, 128)
output = model(x) # works on cpu
hardware compatibility check
import torch
# check cuda support
if torch.cuda.is_available():
capability = torch.cuda.get_device_capability()
device_name = torch.cuda.get_device_name(0)
print(f"gpu: {device_name}, compute capability: {capability}")
if capability[0] < 8: # volta (7.x) or older
print("warning: use cuda 12.6 for maxwell/pascal/volta gpus")
print("cuda 12.8+ drops support for compute capability < 8.0")
else:
print("cuda not available - cpu quantization only (int8)")
# check cpu features (linux)
import subprocess
try:
result = subprocess.run(['grep', 'flags', '/proc/cpuinfo'],
capture_output=True, text=True)
has_amx = 'amx' in result.stdout
has_avx512 = 'avx512' in result.stdout
print(f"cpu has amx: {has_amx}, avx512: {has_avx512}")
except:
pass # windows/mac
compilation flags
# new compilation options
torch._inductor.config.max_autotune = True # enable amx optimizations
torch._inductor.config.use_mixed_mm = True # mixed precision matmul
key takeaways
- intel cpu inference improved with int8 quantization and amx optimizations
- torchao replaces torch.ao.quantization (removal in 2.10)
- int4 quantization requires cuda, int8 works on cpu
- float8 training reduces time by 1.5x
- older nvidia gpus (maxwell, pascal, volta) need cuda 12.6
- apple silicon gets torch.compile support via mpsinductor
references
pytorch 2.8 documentation
related guides
- pytorch setup with uv - complete uv configuration for pytorch
- uv package manager - comprehensive uv reference
- cuda setup - gpu configuration guide
══════════════════════════════════════════════════════════════════