serving and continuous batching

transformers v5.0 introduces two major inference innovations: transformers serve for openai api-compatible server deployment, and continuous batching with paged attention for advanced request scheduling.

installation

pip install transformers[serving]  # core serving
pip install transformers[open-telemetry]  # monitoring

the transformers serve command

basic usage

# start simple server
transformers serve

# with continuous batching
transformers serve --continuous-batching --attn_implementation "sdpa"

# gpu with flash attention 2
transformers serve \
  --device cuda \
  --attn_implementation "flash_attention_2" \
  --dtype bfloat16 \
  --continuous-batching

# 4-bit quantized
transformers serve \
  --device cuda \
  --quantization bnb-4bit \
  --continuous-batching

command line options

flagdefaultdescription
--device"cpu"device (cpu, cuda, cuda:0)
--torch_dtype"auto"model dtype
--host"localhost"server host
--port8000server port
--continuous-batchingfalseenable cb
--attn_implementationautoattention backend
--quantizationnonebnb-4bit, bnb-8bit

continuous batching

how it works

  1. dynamic scheduling: requests grouped into batches dynamically
  2. interleaved processing: prefill and decode phases interleaved
  3. immediate resource release: completed sequences free resources instantly
  4. chunked prefill: long prompts split for memory efficiency

python api: generate_batch

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct",
    attn_implementation="sdpa_paged",
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct")

# prepare batch
batch_inputs = [tokenizer(text)["input_ids"] for text in texts]

generation_config = GenerationConfig(
    max_new_tokens=32,
    max_batch_tokens=512,  # memory constraint
    eos_token_id=tokenizer.eos_token_id,
)

# generate with continuous batching
results = model.generate_batch(
    inputs=batch_inputs,
    generation_config=generation_config,
)

python api: ContinuousBatchingManager

manager = model.init_continuous_batching(generation_config=generation_config)
manager.start()

# add requests asynchronously
for i, input_ids in enumerate(batch_inputs):
    manager.add_request(input_ids=input_ids, request_id=f"req_{i}")

# retrieve results
for request_id, request in manager.get_result():
    text = tokenizer.decode(request.generated_tokens)
    print(f"{request_id}: {text}")

# streaming
manager.add_request(input_ids=input_ids, request_id="stream", stream=True)
for chunk in manager.request_id_iter(request_id="stream"):
    print(tokenizer.decode(chunk.generated_tokens), end='', flush=True)

manager.stop()

paged attention

concept

pagedattention applies virtual memory concepts to kv cache:

  • non-contiguous memory allocation via fixed-size “pages”
  • reduces memory waste from 60-80% to <4%
  • dynamic allocation as generation proceeds

enabling paged attention

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="sdpa_paged",  # enables paged attention
    device_map="cuda",
)

or via cli:

transformers serve --attn_implementation "sdpa_paged" --continuous-batching

api endpoints

chat completions (openai compatible)

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "stream": true
  }'

python client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key=""
)

# non-streaming
completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=500
)

# streaming
stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    messages=[{"role": "user", "content": "Tell me a joke"}],
    stream=True
)
for chunk in stream:
    print(chunk.choices[0].delta.content, end='', flush=True)

vision request

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-7B-Instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What is in this image?"},
            {"type": "image_url", "image_url": {"url": "http://example.com/image.jpg"}}
        ]
    }],
    max_tokens=300
)

opentelemetry monitoring

export OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4317"
export OTEL_SERVICE_NAME="transformers-serve"

transformers serve --continuous-batching

monitored metrics:

  • request queue length
  • active batch size
  • token throughput
  • time to first token (ttft)
  • request completion time

performance optimization

attention selection

implementationuse case
sdpadefault, pytorch 2.1.1+
sdpa_pagedbest memory efficiency
flash_attention_2fastest compute

data type

transformers serve --dtype bfloat16  # 2x faster than float32

batch tuning

generation_config = GenerationConfig(
    max_batch_tokens=512,  # lower for less memory
)

comparison with vllm/sglang

featuretransformersvllmsglang
continuous batchingyesyesyes
paged attentionyesyesyes
tensor parallelismnoyesyes
performancemoderatehighhighest
ease of usehighesthighhigh
best fordev/evalproductionproduction

use transformers serve for: evaluation, development, moderate load

use vllm/sglang for: production >100 rps

references

on this page