mxfp4 quantization and gpu compute requirements

Requirement:	Compute capability ≥ 9.0 (H100/H200/RTX 50 series only)
Cannot run:	RTX 4090, A100, RTX 3090 cannot run MXFP4 models
Alternative:	Use GPTQ, AWQ, or GGUF quantization for older GPUs
Check GPU:	`nvidia-smi --query-gpu=compute_cap --format=csv`
Cloud access:	AWS p5, Azure NC H100, or Lambda Labs instances

mxfp4 quantization requires specific gpu hardware with compute capability >= 9.0. this page explains the requirements and provides solutions for the common error when attempting to run mxfp4 quantized models on incompatible hardware.

the error

when trying to load gpt-oss models or other mxfp4 quantized models with transformers, you may encounter:

ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)

this error occurs because mxfp4 (mixed-precision fp4) quantization requires specific tensor core features only available in hopper architecture and newer.

what is mxfp4?

mxfp4 is a 4-bit mixed-precision quantization format developed for efficient inference of large language models. it enables:

4-bit weights: reducing memory footprint by ~4x
mixed precision: maintaining quality through strategic precision mixing
tensor core acceleration: leveraging specialized hardware for fp4 operations
dynamic scaling: per-block scaling factors for accuracy preservation

gpu compute capability requirements

gpus supporting mxfp4 (compute >= 9.0)

from the official nvidia cuda gpus list, these gpus support mxfp4:

compute capability 9.0 (hopper)

nvidia gh200 - grace hopper superchip
nvidia h200 - enhanced h100 with 141gb hbm3e
nvidia h100 - flagship datacenter gpu (80gb/90gb variants)

compute capability 10.0 (blackwell)

nvidia gb200 - grace blackwell superchip
nvidia b200 - next-gen datacenter gpu
nvidia b100 - datacenter accelerator

compute capability 12.0 (blackwell rtx)

professional:
- nvidia rtx pro 6000 blackwell (server/workstation editions)
- nvidia rtx pro 5000 blackwell
- nvidia rtx pro 4500 blackwell
- nvidia rtx pro 4000 blackwell
consumer (rtx 50 series):
- geforce rtx 5090
- geforce rtx 5080
- geforce rtx 5070 ti
- geforce rtx 5070
- geforce rtx 5060 ti
- geforce rtx 5060

gpus that cannot run mxfp4

these popular gpus have compute capability < 9.0 and cannot run mxfp4 models:

compute capability 8.9 (ada lovelace)

rtx 4090, rtx 4080, rtx 4070 ti, rtx 4070, rtx 4060
rtx 6000 ada, rtx 5000 ada, rtx 4000 ada
l40, l40s, l4

compute capability 8.6 (ampere)

rtx 3090, rtx 3080, rtx 3070, rtx 3060
rtx a6000, rtx a5000, rtx a4000
a40, a10, a16

compute capability 8.0 (ampere datacenter)

a100 (40gb/80gb)
a30

compute capability 7.5 (turing)

rtx 2080 ti, rtx 2080, rtx 2070, rtx 2060
quadro rtx 8000, rtx 6000, rtx 5000

solutions and workarounds

1. use non-quantized models

load the model without mxfp4 quantization:

from transformers import AutoModelForCausalLM

# this will fail on compute < 9.0 for mxfp4
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b",
    torch_dtype="auto"  # attempts mxfp4
)

# use fp16/bf16 instead
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)

2. use alternative quantization

for gpus with compute < 9.0, use these quantization methods:

gptq: 4-bit quantization, widely supported
awq: activation-aware quantization
bnb (bitsandbytes): 4-bit/8-bit quantization
gguf: llama.cpp format for cpu/gpu inference

3. use ollama with gguf

ollama provides pre-quantized versions:

# use gguf quantized version
ollama pull gpt-oss:20b
ollama run gpt-oss:20b

4. cloud solutions

if you need mxfp4 performance, use cloud providers with h100/h200:

aws: p5 instances (h100)
azure: nc h100 v5 series
gcp: a3 instances (h100)
lambda labs: h100 instances
runpod: h100 pods

checking your gpu

find compute capability

# method 1: nvidia-smi
nvidia-smi --query-gpu=name,compute_cap --format=csv

# method 2: python
uv run --with=torch --index https://download.pytorch.org/whl/cu128 python -c "import torch; print(torch.cuda.get_device_capability())"

compute capability reference table

architecture	compute	example gpus	mxfp4	fp8
blackwell	10.0+	b100, b200, gb200	✅	✅
hopper	9.0	h100, h200, gh200	✅	✅
ada lovelace	8.9	rtx 4090, l40, rtx 6000 ada	❌	❌
ampere	8.0-8.6	a100, rtx 3090, a40	❌	❌
turing	7.5	rtx 2080 ti, t4	❌	❌
volta	7.0	v100, titan v	❌	❌

hardware recommendations

for mxfp4 models

minimum: nvidia h100 (80gb) cloud: aws p5, azure nc h100 v5

for alternative quantization

budget: rtx 3060 12gb with ollama high-end: rtx 5090

cuda version requirements

mxfp4 also requires:

cuda 12.1+ for initial support
cuda 13.0 recommended for full optimization
triton with mxfp4 kernels

see cuda 13 setup guide for installation.