vllm and sglang integration
on this page
transformers v5.0’s theme is interoperability. models defined once in transformers can be deployed across multiple inference engines without reimplementation.
vllm integration
basic usage
from vllm import LLM, SamplingParams
# explicitly use transformers backend
llm = LLM(model="new-model", model_impl="transformers")
# with custom models
llm = LLM(model="custom-model", model_impl="transformers", trust_remote_code=True)
# generate
sampling_params = SamplingParams(max_tokens=100)
outputs = llm.generate(["Hello"], sampling_params) server deployment
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --model_impl transformers automatic fallback
vllm automatically switches to transformers if model not natively supported:
# model_impl options:
# "auto" - try native first, fallback to transformers
# "vllm" - native only
# "transformers" - transformers backend only performance
within 5% of native vllm implementation:
| batch size | vllm | huggingface | speedup |
|---|---|---|---|
| 4 | 2.32s | 5.32s | 2.3x |
| 8 | 2.51s | 7.24s | 2.9x |
| 16 | 3.01s | 9.50s | 3.2x |
| 32 | 3.38s | 12.9s | 3.8x |
vision-language models
from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor
model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
processor = AutoProcessor.from_pretrained(model_id)
messages = [{
"role": "user",
"content": [
{"type": "image", "url": "image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}]
prompt = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
vlm = LLM(model=model_id, model_impl="transformers")
outputs = vlm.generate(
{"prompt": prompt, "multi_modal_data": {"image": image}},
sampling_params=SamplingParams(max_tokens=100)
) bert/encoder support
# transformers backend is official way for encoder models in vllm
llm = LLM(model="bert-base-uncased", model_impl="transformers")
llm = LLM(model="answerdotai/ModernBERT-base", model_impl="transformers") sglang integration
basic usage
import sglang as sgl
# use transformers backend
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")
# generate
output = llm.generate(["Hello"], {"max_new_tokens": 20}) server deployment
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--impl transformers \
--host 0.0.0.0 \
--port 30000 radixattention integration
key advantage: automatic kv cache reuse during runtime:
- retains kv cache after generation
- radix tree structure for efficient prefix search
- lru eviction with cache-aware scheduling
performance: 5x higher throughput in multi-turn dialogue
tensor parallelism
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--impl transformers \
--tp 2 quantization
# torchao quantization
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B \
--impl transformers \
--torchao-config int8dq openai-compatible api
both vllm and sglang support openai sdk:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1", # vllm
# base_url="http://localhost:30000/v1", # sglang
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[{"role": "user", "content": "Hello!"}]
) model compatibility requirements
attention configuration
from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS
class MyAttention(nn.Module):
_supports_attention_backend = True
def forward(self, query, key, value, **kwargs):
attention_fn = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
return attention_fn(self, query, key, value, **kwargs) tensor parallelism plan
class MyModelConfig(PretrainedConfig):
base_model_tp_plan = {
"model.layers.*.self_attn.q_proj": "colwise",
"model.layers.*.self_attn.k_proj": "colwise",
"model.layers.*.self_attn.v_proj": "colwise",
"model.layers.*.self_attn.o_proj": "rowwise",
"model.layers.*.mlp.gate_proj": "colwise",
"model.layers.*.mlp.up_proj": "colwise",
"model.layers.*.mlp.down_proj": "rowwise",
} when to use each
use transformers backend when
- model not natively supported in vllm/sglang
- custom architectures from hub
- rapid prototyping
- encoder models (bert)
- vlm support needed
use native implementations when
- production deployment
- maximum performance critical
- high-volume serving
- cost optimization
vllm vs sglang
| use case | best choice |
|---|---|
| high-concurrency single-turn | vllm |
| multi-turn dialogue | sglang |
| structured output | sglang |
| real-time q&a | vllm |
| cache reuse scenarios | sglang |
known limitations
vllm
- video input not yet supported for vlms
- some models like internvl incompatible
sglang
- performance gap vs native (optimization ongoing)
- vlm integration in development
future roadmap
vllm q4 2025:
- complete removal of reimplementations
- centralized maintenance in transformers
- deeper integration
sglang 2025 h1:
- performance gap optimization
- vlm support
- lora improvements