mxfp4 quantization and gpu compute requirements
on this page
mxfp4 quantization requires specific gpu hardware with compute capability >= 9.0. this page explains the requirements and provides solutions for the common error when attempting to run mxfp4 quantized models on incompatible hardware.
the error
when trying to load gpt-oss models or other mxfp4 quantized models with transformers, you may encounter:
ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)
this error occurs because mxfp4 (mixed-precision fp4) quantization requires specific tensor core features only available in hopper architecture and newer.
what is mxfp4?
mxfp4 is a 4-bit mixed-precision quantization format developed for efficient inference of large language models. it enables:
- 4-bit weights: reducing memory footprint by ~4x
- mixed precision: maintaining quality through strategic precision mixing
- tensor core acceleration: leveraging specialized hardware for fp4 operations
- dynamic scaling: per-block scaling factors for accuracy preservation
gpu compute capability requirements
gpus supporting mxfp4 (compute >= 9.0)
from the official nvidia cuda gpus list, these gpus support mxfp4:
compute capability 9.0 (hopper)
- nvidia gh200 - grace hopper superchip
- nvidia h200 - enhanced h100 with 141gb hbm3e
- nvidia h100 - flagship datacenter gpu (80gb/90gb variants)
compute capability 10.0 (blackwell)
- nvidia gb200 - grace blackwell superchip
- nvidia b200 - next-gen datacenter gpu
- nvidia b100 - datacenter accelerator
compute capability 12.0 (blackwell rtx)
- professional:
- nvidia rtx pro 6000 blackwell (server/workstation editions)
- nvidia rtx pro 5000 blackwell
- nvidia rtx pro 4500 blackwell
- nvidia rtx pro 4000 blackwell
- consumer (rtx 50 series):
- geforce rtx 5090
- geforce rtx 5080
- geforce rtx 5070 ti
- geforce rtx 5070
- geforce rtx 5060 ti
- geforce rtx 5060
gpus that cannot run mxfp4
these popular gpus have compute capability < 9.0 and cannot run mxfp4 models:
compute capability 8.9 (ada lovelace)
- rtx 4090, rtx 4080, rtx 4070 ti, rtx 4070, rtx 4060
- rtx 6000 ada, rtx 5000 ada, rtx 4000 ada
- l40, l40s, l4
compute capability 8.6 (ampere)
- rtx 3090, rtx 3080, rtx 3070, rtx 3060
- rtx a6000, rtx a5000, rtx a4000
- a40, a10, a16
compute capability 8.0 (ampere datacenter)
- a100 (40gb/80gb)
- a30
compute capability 7.5 (turing)
- rtx 2080 ti, rtx 2080, rtx 2070, rtx 2060
- quadro rtx 8000, rtx 6000, rtx 5000
solutions and workarounds
1. use non-quantized models
load the model without mxfp4 quantization:
from transformers import AutoModelForCausalLM
# this will fail on compute < 9.0 for mxfp4
model = AutoModelForCausalLM.from_pretrained(
"openai/gpt-oss-20b",
torch_dtype="auto" # attempts mxfp4
)
# use fp16/bf16 instead
model = AutoModelForCausalLM.from_pretrained(
"openai/gpt-oss-20b",
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
2. use alternative quantization
for gpus with compute < 9.0, use these quantization methods:
- gptq: 4-bit quantization, widely supported
- awq: activation-aware quantization
- bnb (bitsandbytes): 4-bit/8-bit quantization
- gguf: llama.cpp format for cpu/gpu inference
3. use ollama with gguf
ollama provides pre-quantized versions:
# use gguf quantized version
ollama pull gpt-oss:20b
ollama run gpt-oss:20b
4. cloud solutions
if you need mxfp4 performance, use cloud providers with h100/h200:
- aws: p5 instances (h100)
- azure: nc h100 v5 series
- gcp: a3 instances (h100)
- lambda labs: h100 instances
- runpod: h100 pods
checking your gpu
find compute capability
# method 1: nvidia-smi
nvidia-smi --query-gpu=name,compute_cap --format=csv
# method 2: python
uv run --with=torch --index https://download.pytorch.org/whl/cu128 python -c "import torch; print(torch.cuda.get_device_capability())"
compute capability reference table
architecture | compute | example gpus | mxfp4 | fp8 |
---|---|---|---|---|
blackwell | 10.0+ | b100, b200, gb200 | ✅ | ✅ |
hopper | 9.0 | h100, h200, gh200 | ✅ | ✅ |
ada lovelace | 8.9 | rtx 4090, l40, rtx 6000 ada | ❌ | ❌ |
ampere | 8.0-8.6 | a100, rtx 3090, a40 | ❌ | ❌ |
turing | 7.5 | rtx 2080 ti, t4 | ❌ | ❌ |
volta | 7.0 | v100, titan v | ❌ | ❌ |
hardware recommendations
for mxfp4 models
minimum: nvidia h100 (80gb) cloud: aws p5, azure nc h100 v5
for alternative quantization
budget: rtx 3060 12gb with ollama high-end: rtx 5090
cuda version requirements
mxfp4 also requires:
- cuda 12.1+ for initial support
- cuda 13.0 recommended for full optimization
- triton with mxfp4 kernels
see cuda 13 setup guide for installation.
references
══════════════════════════════════════════════════════════════════