nvidia gpu compute capability reference
on this page
comprehensive reference of nvidia gpu compute capabilities. compute capability determines which cuda features and optimizations are available for each gpu.
source: nvidia cuda gpus official documentation
what is compute capability?
compute capability represents a gpu’s feature set and computational abilities:
- major version: indicates core architecture (e.g., 9.x = hopper)
- minor version: indicates incremental improvements within architecture
- feature support: determines available cuda features, tensor cores, fp formats
- optimization targets: compiler uses this for optimal code generation
quick reference
checking your gpu
# nvidia-smi query
nvidia-smi --query-gpu=name,compute_cap --format=csv
# python (pytorch)
uv run --with=torch --index https://download.pytorch.org/whl/cu128 python -c "import torch; print(torch.cuda.get_device_capability())"
# cuda devicequery
# deviceQuery | grep "CUDA Capability"
# this is where i found "devicequery" on ubuntu 25.04 after installing cuda via apt
/usr/local/cuda/bin/__nvcc_device_query
key capability thresholds
capability | milestone features |
---|---|
>= 12.0 | blackwell architecture, next-gen tensor cores |
>= 10.0 | blackwell datacenter, enhanced fp8 |
>= 9.0 | mxfp4 support, fp8 tensor engine, hopper |
>= 8.9 | ada lovelace, av1 encode, 3rd gen rt cores |
>= 8.6 | ampere consumer, 2x fp32 throughput |
>= 8.0 | ampere datacenter, tf32, structured sparsity |
>= 7.5 | turing, rt cores, tensor cores int8/int4 |
>= 7.0 | volta, tensor cores fp16, independent thread scheduling |
complete gpu list by compute capability
compute capability 12.0 (blackwell rtx)
professional workstation
- nvidia rtx pro 6000 blackwell server edition
- nvidia rtx pro 6000 blackwell workstation edition
- nvidia rtx pro 6000 blackwell max-q workstation edition
- nvidia rtx pro 5000 blackwell
- nvidia rtx pro 4500 blackwell
- nvidia rtx pro 4000 blackwell
consumer (rtx 50 series)
- geforce rtx 5090
- geforce rtx 5080
- geforce rtx 5070 ti
- geforce rtx 5070
- geforce rtx 5060 ti
- geforce rtx 5060
compute capability 10.0 (blackwell datacenter)
- nvidia gb200 (grace blackwell superchip)
- nvidia b200
- nvidia b100
compute capability 9.0 (hopper)
- nvidia gh200 (grace hopper superchip)
- nvidia h200 (141gb hbm3e)
- nvidia h100 (80gb/90gb hbm3)
compute capability 8.9 (ada lovelace)
datacenter
- l40 (48gb)
- l40s (48gb)
- l4 (24gb)
professional
- rtx 6000 ada generation (48gb)
- rtx 5000 ada generation (32gb)
- rtx 4500 ada generation (24gb)
- rtx 4000 ada generation (20gb)
- rtx 4000 sff ada generation (20gb)
consumer (rtx 40 series)
- geforce rtx 4090 (24gb)
- geforce rtx 4080 super (16gb)
- geforce rtx 4080 (16gb)
- geforce rtx 4070 ti super (16gb)
- geforce rtx 4070 ti (12gb)
- geforce rtx 4070 super (12gb)
- geforce rtx 4070 (12gb)
- geforce rtx 4060 ti (8gb/16gb)
- geforce rtx 4060 (8gb)
mobile
- geforce rtx 4090 laptop
- geforce rtx 4080 laptop
- geforce rtx 4070 laptop
- geforce rtx 4060 laptop
- geforce rtx 4050 laptop
compute capability 8.6 (ampere consumer)
datacenter
- a40 (48gb)
- a10 (24gb)
- a10g (24gb)
- a16 (4x16gb)
- a2 (16gb)
professional
- rtx a6000 (48gb)
- rtx a5500 (24gb)
- rtx a5000 (24gb)
- rtx a4500 (20gb)
- rtx a4000 (16gb)
- rtx a2000 (6gb/12gb)
consumer (rtx 30 series)
- geforce rtx 3090 ti (24gb)
- geforce rtx 3090 (24gb)
- geforce rtx 3080 ti (12gb)
- geforce rtx 3080 (10gb/12gb)
- geforce rtx 3070 ti (8gb)
- geforce rtx 3070 (8gb)
- geforce rtx 3060 ti (8gb)
- geforce rtx 3060 (8gb/12gb)
- geforce rtx 3050 (6gb/8gb)
compute capability 8.0 (ampere datacenter)
- nvidia a100 (40gb/80gb hbm2e)
- nvidia a30 (24gb hbm2)
compute capability 7.5 (turing)
datacenter
- t4 (16gb)
professional
- quadro rtx 8000 (48gb)
- quadro rtx 6000 (24gb)
- quadro rtx 5000 (16gb)
- quadro rtx 4000 (8gb)
consumer (rtx 20 series)
- titan rtx (24gb)
- geforce rtx 2080 ti (11gb)
- geforce rtx 2080 super (8gb)
- geforce rtx 2080 (8gb)
- geforce rtx 2070 super (8gb)
- geforce rtx 2070 (8gb)
- geforce rtx 2060 super (8gb)
- geforce rtx 2060 (6gb/12gb)
consumer (gtx 16 series)
- geforce gtx 1660 ti (6gb)
- geforce gtx 1660 super (6gb)
- geforce gtx 1660 (6gb)
- geforce gtx 1650 super (4gb)
- geforce gtx 1650 (4gb)
compute capability 7.0 (volta)
- nvidia gv100 (32gb)
- nvidia v100 (16gb/32gb hbm2)
- titan v (12gb)
- quadro gv100 (32gb)
compute capability 6.x (pascal)
6.1
- titan xp, titan x (pascal)
- geforce gtx 1080 ti, 1080, 1070 ti, 1070, 1060, 1050 ti, 1050
- quadro p6000, p5000, p4000, p2000, p1000, p600, p400
6.0
- tesla p100 (12gb/16gb)
- quadro gp100
compute capability 5.x (maxwell)
5.3
- jetson tx1, nano, tegra x1
5.2
- geforce gtx titan x, 980 ti, 980, 970, 960, 950
- quadro m6000, m5000, m4000
5.0
- geforce gtx 750 ti, 750
- quadro k2200, k1200, k620
feature support by capability
feature | min capability | notes |
---|---|---|
mxfp4 quantization | 9.0 | hopper and newer only |
fp8 tensor cores | 9.0 | transformer engine |
structured sparsity | 8.0 | 2:4 sparsity patterns |
tf32 tensor cores | 8.0 | automatic mixed precision |
bfloat16 | 8.0 | brain floating point |
multi-instance gpu (mig) | 8.0 | a100/a30/h100 only |
3rd gen tensor cores | 8.0 | int8, int4 support |
rt cores (2nd gen) | 8.6 | improved ray tracing |
av1 encode | 8.9 | ada lovelace |
independent thread scheduling | 7.0 | volta cooperative groups |
tensor cores (1st gen) | 7.0 | fp16 mixed precision |
rt cores (1st gen) | 7.5 | real-time ray tracing |
unified memory | 6.0 | pascal page migration engine |
nvlink | 6.0 | high-speed interconnect |
dynamic parallelism | 3.5 | kernel launch from device |
cuda version compatibility
compute capability | minimum cuda | recommended cuda | dropped in cuda |
---|---|---|---|
12.0 | 13.0 | 13.0 | - |
10.0 | 13.0 | 13.0 | - |
9.0 | 11.8 | 13.0 | - |
8.9 | 11.8 | 12.x/13.0 | - |
8.6 | 11.1 | 12.x | - |
8.0 | 11.0 | 12.x | - |
7.5 | 10.0 | 12.x | - |
7.0 | 9.0 | 11.x | 13.0 |
6.x | 8.0 | 11.x | 12.0 |
5.x | 6.5 | 11.x | 12.0 |
common use cases
deep learning frameworks
pytorch requirements
- minimum: compute 3.7 (kepler k80)
- recommended: compute 7.0+ (tensor cores)
- optimal: compute 8.0+ (ampere features)
tensorflow requirements
- minimum: compute 3.5
- recommended: compute 7.0+ (mixed precision)
- optimal: compute 8.0+ (tf32 automatic)
specific model requirements
llm inference
- flash attention: compute >= 7.5
- flash attention 2: compute >= 8.0
- mxfp4 models (gpt-oss): compute >= 9.0
- fp8 models: compute >= 9.0
vision models
- stable diffusion: compute >= 6.1 recommended
- sam (segment anything): compute >= 7.0
- flux: compute >= 7.5
troubleshooting
common errors
“gpu not supported”
# check if gpu is too old for framework
nvidia-smi --query-gpu=name,compute_cap --format=csv
# verify against framework requirements
“mxfp4 requires compute >= 9.0”
# use alternative quantization
model = AutoModelForCausalLM.from_pretrained(
"model-name",
load_in_4bit=True, # use bitsandbytes instead
bnb_4bit_compute_dtype=torch.float16
)
“cuda arch not found”
# specify arch during compilation
export TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;8.9;9.0"
pip install --no-cache-dir package-name
references
══════════════════════════════════════════════════════════════════