serving and continuous batching
on this page
transformers v5.0 introduces two major inference innovations: transformers serve for openai api-compatible server deployment, and continuous batching with paged attention for advanced request scheduling.
installation
pip install transformers[serving] # core serving
pip install transformers[open-telemetry] # monitoring the transformers serve command
basic usage
# start simple server
transformers serve
# with continuous batching
transformers serve --continuous-batching --attn_implementation "sdpa"
# gpu with flash attention 2
transformers serve \
--device cuda \
--attn_implementation "flash_attention_2" \
--dtype bfloat16 \
--continuous-batching
# 4-bit quantized
transformers serve \
--device cuda \
--quantization bnb-4bit \
--continuous-batching command line options
| flag | default | description |
|---|---|---|
--device | "cpu" | device (cpu, cuda, cuda:0) |
--torch_dtype | "auto" | model dtype |
--host | "localhost" | server host |
--port | 8000 | server port |
--continuous-batching | false | enable cb |
--attn_implementation | auto | attention backend |
--quantization | none | bnb-4bit, bnb-8bit |
continuous batching
how it works
- dynamic scheduling: requests grouped into batches dynamically
- interleaved processing: prefill and decode phases interleaved
- immediate resource release: completed sequences free resources instantly
- chunked prefill: long prompts split for memory efficiency
python api: generate_batch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-4B-Instruct",
attn_implementation="sdpa_paged",
device_map="cuda",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct")
# prepare batch
batch_inputs = [tokenizer(text)["input_ids"] for text in texts]
generation_config = GenerationConfig(
max_new_tokens=32,
max_batch_tokens=512, # memory constraint
eos_token_id=tokenizer.eos_token_id,
)
# generate with continuous batching
results = model.generate_batch(
inputs=batch_inputs,
generation_config=generation_config,
) python api: ContinuousBatchingManager
manager = model.init_continuous_batching(generation_config=generation_config)
manager.start()
# add requests asynchronously
for i, input_ids in enumerate(batch_inputs):
manager.add_request(input_ids=input_ids, request_id=f"req_{i}")
# retrieve results
for request_id, request in manager.get_result():
text = tokenizer.decode(request.generated_tokens)
print(f"{request_id}: {text}")
# streaming
manager.add_request(input_ids=input_ids, request_id="stream", stream=True)
for chunk in manager.request_id_iter(request_id="stream"):
print(tokenizer.decode(chunk.generated_tokens), end='', flush=True)
manager.stop() paged attention
concept
pagedattention applies virtual memory concepts to kv cache:
- non-contiguous memory allocation via fixed-size “pages”
- reduces memory waste from 60-80% to <4%
- dynamic allocation as generation proceeds
enabling paged attention
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="sdpa_paged", # enables paged attention
device_map="cuda",
) or via cli:
transformers serve --attn_implementation "sdpa_paged" --continuous-batching api endpoints
chat completions (openai compatible)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-0.5B-Instruct",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"stream": true
}' python client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key=""
)
# non-streaming
completion = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=500
)
# streaming
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-0.5B-Instruct",
messages=[{"role": "user", "content": "Tell me a joke"}],
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content, end='', flush=True) vision request
response = client.chat.completions.create(
model="Qwen/Qwen2.5-VL-7B-Instruct",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "http://example.com/image.jpg"}}
]
}],
max_tokens=300
) opentelemetry monitoring
export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_SERVICE_NAME="transformers-serve"
transformers serve --continuous-batching monitored metrics:
- request queue length
- active batch size
- token throughput
- time to first token (ttft)
- request completion time
performance optimization
attention selection
| implementation | use case |
|---|---|
sdpa | default, pytorch 2.1.1+ |
sdpa_paged | best memory efficiency |
flash_attention_2 | fastest compute |
data type
transformers serve --dtype bfloat16 # 2x faster than float32 batch tuning
generation_config = GenerationConfig(
max_batch_tokens=512, # lower for less memory
) comparison with vllm/sglang
| feature | transformers | vllm | sglang |
|---|---|---|---|
| continuous batching | yes | yes | yes |
| paged attention | yes | yes | yes |
| tensor parallelism | no | yes | yes |
| performance | moderate | high | highest |
| ease of use | highest | high | high |
| best for | dev/eval | production | production |
use transformers serve for: evaluation, development, moderate load
use vllm/sglang for: production >100 rps