quantization
on this page
transformers v5.0 makes quantization a first-class citizen with redesigned weight loading architecture and support for 18+ quantization methods spanning 1-8 bits.
supported methods summary
| method | bits | calibration | peft | best for |
|---|---|---|---|---|
| bitsandbytes | 4, 8 | no | yes | qlora fine-tuning |
| torchao | 1-8, fp8 | varies | yes | pytorch-native |
| gptqmodel | 2-8 | yes | yes | production inference |
| awq | 4 | yes | yes | fastest inference |
| hqq | 1-8 | no | yes | fast on-the-fly |
| aqlm | 1-2 | yes | yes | extreme compression |
| fp8 | fp8 | no | yes | h100+ gpus |
| gguf | 1.5-8 | no | no | cpu/llama.cpp |
bitsandbytesconfig (qlora)
optimal configuration
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
) memory savings
| model | fp16 | 8-bit | 4-bit | 4-bit + dq |
|---|---|---|---|---|
| 7b | 14 gb | 7 gb | 3.5 gb | 3.3 gb |
| 13b | 26 gb | 13 gb | 6.5 gb | 6.2 gb |
| 70b | 140 gb | 70 gb | 35 gb | 33.6 gb |
torchao (pytorch native)
new api
from transformers import AutoModelForCausalLM, TorchAoConfig
from torchao.quantization import Int4WeightOnlyConfig
config = Int4WeightOnlyConfig(group_size=128)
torchao_config = TorchAoConfig(quant_type=config)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=torchao_config,
device_map="auto",
) quantization types
from torchao.quantization import (
Int4WeightOnlyConfig,
Int8WeightOnlyConfig,
Float8DynamicActivationFloat8WeightConfig,
)
# int4
config = Int4WeightOnlyConfig(group_size=128)
# int8
config = Int8WeightOnlyConfig()
# fp8 dynamic (h100)
config = Float8DynamicActivationFloat8WeightConfig() gptq (production)
quantizing a model
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
tokenizer=tokenizer,
group_size=128,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=gptq_config,
device_map="auto",
)
model.save_pretrained("llama-2-7b-gptq") loading pre-quantized
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto",
) awq (fastest inference)
from transformers import AwqConfig
awq_config = AwqConfig(
bits=4,
fuse_max_seq_len=2048,
do_fuse=True, # 2-3x decode speedup
)
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
quantization_config=awq_config,
device_map="auto",
) hqq (fast on-the-fly)
from transformers import HqqConfig
quant_config = HqqConfig(
nbits=4,
group_size=64,
axis=1,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=quant_config,
device_map="auto",
)
# enable optimized backend
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("torchao_int4") # 200 tok/s on 4090 qlora fine-tuning
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
# load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.1,
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || 0.0622% kv cache quantization
outputs = model.generate(
**inputs,
cache_implementation="quantized",
cache_config={
"backend": "quanto", # or "hqq"
"nbits": 4,
},
max_new_tokens=256
) serving quantized models
vllm
from vllm import LLM, SamplingParams
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
)
# or fp8
llm = LLM(
model="nm-testing/Meta-Llama-3.1-8B-Instruct-FP8",
quantization="fp8",
kv_cache_dtype="fp8",
) tgi
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-7B-AWQ \
--quantize awq selection guide
need to train?
├─ yes → bitsandbytes (qlora)
│
└─ no (inference only)
├─ fastest inference? → awq
├─ no calibration needed? → hqq or bitsandbytes
├─ have h100? → fp8
├─ extreme compression? → aqlm
├─ vllm deployment? → compressed-tensors
├─ cpu inference? → gguf
└─ general production? → gptq or awq performance benchmarks (mistral-7b, a100)
| method | memory | latency | throughput | accuracy drop |
|---|---|---|---|---|
| fp16 | 14.0 gb | 25ms | 100 tok/s | 0% |
| 8-bit bnb | 7.0 gb | 52ms | 140 tok/s | <0.5% |
| 4-bit gptq | 3.5 gb | 33ms | 168 tok/s | <1% |
| 4-bit awq | 3.5 gb | 28ms | 195 tok/s | <1% |
| fp8 (h100) | 7.0 gb | 22ms | 185 tok/s | <0.3% |