mxfp4 quantization and gpu compute requirements

mxfp4 quantization requires specific gpu hardware with compute capability >= 9.0. this page explains the requirements and provides solutions for the common error when attempting to run mxfp4 quantized models on incompatible hardware.

the error

when trying to load gpt-oss models or other mxfp4 quantized models with transformers, you may encounter:


      ValueError: MXFP4 quantized models is only supported on GPUs with compute capability >= 9.0 (e.g H100, or B100)

    

this error occurs because mxfp4 (mixed-precision fp4) quantization requires specific tensor core features only available in hopper architecture and newer.

what is mxfp4?

mxfp4 is a 4-bit mixed-precision quantization format developed for efficient inference of large language models. it enables:

  • 4-bit weights: reducing memory footprint by ~4x
  • mixed precision: maintaining quality through strategic precision mixing
  • tensor core acceleration: leveraging specialized hardware for fp4 operations
  • dynamic scaling: per-block scaling factors for accuracy preservation

gpu compute capability requirements

gpus supporting mxfp4 (compute >= 9.0)

from the official nvidia cuda gpus list, these gpus support mxfp4:

compute capability 9.0 (hopper)

  • nvidia gh200 - grace hopper superchip
  • nvidia h200 - enhanced h100 with 141gb hbm3e
  • nvidia h100 - flagship datacenter gpu (80gb/90gb variants)

compute capability 10.0 (blackwell)

  • nvidia gb200 - grace blackwell superchip
  • nvidia b200 - next-gen datacenter gpu
  • nvidia b100 - datacenter accelerator

compute capability 12.0 (blackwell rtx)

  • professional:
    • nvidia rtx pro 6000 blackwell (server/workstation editions)
    • nvidia rtx pro 5000 blackwell
    • nvidia rtx pro 4500 blackwell
    • nvidia rtx pro 4000 blackwell
  • consumer (rtx 50 series):
    • geforce rtx 5090
    • geforce rtx 5080
    • geforce rtx 5070 ti
    • geforce rtx 5070
    • geforce rtx 5060 ti
    • geforce rtx 5060

gpus that cannot run mxfp4

these popular gpus have compute capability < 9.0 and cannot run mxfp4 models:

compute capability 8.9 (ada lovelace)

  • rtx 4090, rtx 4080, rtx 4070 ti, rtx 4070, rtx 4060
  • rtx 6000 ada, rtx 5000 ada, rtx 4000 ada
  • l40, l40s, l4

compute capability 8.6 (ampere)

  • rtx 3090, rtx 3080, rtx 3070, rtx 3060
  • rtx a6000, rtx a5000, rtx a4000
  • a40, a10, a16

compute capability 8.0 (ampere datacenter)

  • a100 (40gb/80gb)
  • a30

compute capability 7.5 (turing)

  • rtx 2080 ti, rtx 2080, rtx 2070, rtx 2060
  • quadro rtx 8000, rtx 6000, rtx 5000

solutions and workarounds

1. use non-quantized models

load the model without mxfp4 quantization:

from transformers import AutoModelForCausalLM

# this will fail on compute < 9.0 for mxfp4
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b",
    torch_dtype="auto"  # attempts mxfp4
)

# use fp16/bf16 instead
model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-20b",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True
)

2. use alternative quantization

for gpus with compute < 9.0, use these quantization methods:

  • gptq: 4-bit quantization, widely supported
  • awq: activation-aware quantization
  • bnb (bitsandbytes): 4-bit/8-bit quantization
  • gguf: llama.cpp format for cpu/gpu inference

3. use ollama with gguf

ollama provides pre-quantized versions:

# use gguf quantized version
ollama pull gpt-oss:20b
ollama run gpt-oss:20b

4. cloud solutions

if you need mxfp4 performance, use cloud providers with h100/h200:

  • aws: p5 instances (h100)
  • azure: nc h100 v5 series
  • gcp: a3 instances (h100)
  • lambda labs: h100 instances
  • runpod: h100 pods

checking your gpu

find compute capability

# method 1: nvidia-smi
nvidia-smi --query-gpu=name,compute_cap --format=csv

# method 2: python
uv run --with=torch --index https://download.pytorch.org/whl/cu128 python -c "import torch; print(torch.cuda.get_device_capability())"

compute capability reference table

architecturecomputeexample gpusmxfp4fp8
blackwell10.0+b100, b200, gb200
hopper9.0h100, h200, gh200
ada lovelace8.9rtx 4090, l40, rtx 6000 ada
ampere8.0-8.6a100, rtx 3090, a40
turing7.5rtx 2080 ti, t4
volta7.0v100, titan v

hardware recommendations

for mxfp4 models

minimum: nvidia h100 (80gb) cloud: aws p5, azure nc h100 v5

for alternative quantization

budget: rtx 3060 12gb with ollama high-end: rtx 5090

cuda version requirements

mxfp4 also requires:

  • cuda 12.1+ for initial support
  • cuda 13.0 recommended for full optimization
  • triton with mxfp4 kernels

see cuda 13 setup guide for installation.

references

══════════════════════════════════════════════════════════════════
on this page