quantization

transformers v5.0 makes quantization a first-class citizen with redesigned weight loading architecture and support for 18+ quantization methods spanning 1-8 bits.

supported methods summary

methodbitscalibrationpeftbest for
bitsandbytes4, 8noyesqlora fine-tuning
torchao1-8, fp8variesyespytorch-native
gptqmodel2-8yesyesproduction inference
awq4yesyesfastest inference
hqq1-8noyesfast on-the-fly
aqlm1-2yesyesextreme compression
fp8fp8noyesh100+ gpus
gguf1.5-8nonocpu/llama.cpp

bitsandbytesconfig (qlora)

optimal configuration

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

memory savings

modelfp168-bit4-bit4-bit + dq
7b14 gb7 gb3.5 gb3.3 gb
13b26 gb13 gb6.5 gb6.2 gb
70b140 gb70 gb35 gb33.6 gb

torchao (pytorch native)

new api

from transformers import AutoModelForCausalLM, TorchAoConfig
from torchao.quantization import Int4WeightOnlyConfig

config = Int4WeightOnlyConfig(group_size=128)
torchao_config = TorchAoConfig(quant_type=config)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=torchao_config,
    device_map="auto",
)

quantization types

from torchao.quantization import (
    Int4WeightOnlyConfig,
    Int8WeightOnlyConfig,
    Float8DynamicActivationFloat8WeightConfig,
)

# int4
config = Int4WeightOnlyConfig(group_size=128)

# int8
config = Int8WeightOnlyConfig()

# fp8 dynamic (h100)
config = Float8DynamicActivationFloat8WeightConfig()

gptq (production)

quantizing a model

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",
    tokenizer=tokenizer,
    group_size=128,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=gptq_config,
    device_map="auto",
)

model.save_pretrained("llama-2-7b-gptq")

loading pre-quantized

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

awq (fastest inference)

from transformers import AwqConfig

awq_config = AwqConfig(
    bits=4,
    fuse_max_seq_len=2048,
    do_fuse=True,  # 2-3x decode speedup
)

model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Mistral-7B-Instruct-v0.2-AWQ",
    quantization_config=awq_config,
    device_map="auto",
)

hqq (fast on-the-fly)

from transformers import HqqConfig

quant_config = HqqConfig(
    nbits=4,
    group_size=64,
    axis=1,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=quant_config,
    device_map="auto",
)

# enable optimized backend
from hqq.core.quantize import HQQLinear
HQQLinear.set_backend("torchao_int4")  # 200 tok/s on 4090

qlora fine-tuning

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

model = prepare_model_for_kbit_training(model)

lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                   "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.1,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || 0.0622%

kv cache quantization

outputs = model.generate(
    **inputs,
    cache_implementation="quantized",
    cache_config={
        "backend": "quanto",  # or "hqq"
        "nbits": 4,
    },
    max_new_tokens=256
)

serving quantized models

vllm

from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
)

# or fp8
llm = LLM(
    model="nm-testing/Meta-Llama-3.1-8B-Instruct-FP8",
    quantization="fp8",
    kv_cache_dtype="fp8",
)

tgi

docker run --gpus all -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id TheBloke/Llama-2-7B-AWQ \
  --quantize awq

selection guide

need to train?
├─ yes → bitsandbytes (qlora)
│
└─ no (inference only)
    ├─ fastest inference? → awq
    ├─ no calibration needed? → hqq or bitsandbytes
    ├─ have h100? → fp8
    ├─ extreme compression? → aqlm
    ├─ vllm deployment? → compressed-tensors
    ├─ cpu inference? → gguf
    └─ general production? → gptq or awq

performance benchmarks (mistral-7b, a100)

methodmemorylatencythroughputaccuracy drop
fp1614.0 gb25ms100 tok/s0%
8-bit bnb7.0 gb52ms140 tok/s<0.5%
4-bit gptq3.5 gb33ms168 tok/s<1%
4-bit awq3.5 gb28ms195 tok/s<1%
fp8 (h100)7.0 gb22ms185 tok/s<0.3%

references

on this page