vllm and sglang integration

transformers v5.0’s theme is interoperability. models defined once in transformers can be deployed across multiple inference engines without reimplementation.

vllm integration

basic usage

from vllm import LLM, SamplingParams

# explicitly use transformers backend
llm = LLM(model="new-model", model_impl="transformers")

# with custom models
llm = LLM(model="custom-model", model_impl="transformers", trust_remote_code=True)

# generate
sampling_params = SamplingParams(max_tokens=100)
outputs = llm.generate(["Hello"], sampling_params)

server deployment

vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --model_impl transformers

automatic fallback

vllm automatically switches to transformers if model not natively supported:

# model_impl options:
# "auto" - try native first, fallback to transformers
# "vllm" - native only
# "transformers" - transformers backend only

performance

within 5% of native vllm implementation:

batch size	vllm	huggingface	speedup
4	2.32s	5.32s	2.3x
8	2.51s	7.24s	2.9x
16	3.01s	9.50s	3.2x
32	3.38s	12.9s	3.8x

vision-language models

from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": "image.jpg"},
        {"type": "text", "text": "Describe this image."},
    ],
}]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

vlm = LLM(model=model_id, model_impl="transformers")
outputs = vlm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(max_tokens=100)
)

bert/encoder support

# transformers backend is official way for encoder models in vllm
llm = LLM(model="bert-base-uncased", model_impl="transformers")
llm = LLM(model="answerdotai/ModernBERT-base", model_impl="transformers")

sglang integration

basic usage

import sglang as sgl

# use transformers backend
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")

# generate
output = llm.generate(["Hello"], {"max_new_tokens": 20})

server deployment

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-1B-Instruct \
    --impl transformers \
    --host 0.0.0.0 \
    --port 30000

radixattention integration

key advantage: automatic kv cache reuse during runtime:

retains kv cache after generation
radix tree structure for efficient prefix search
lru eviction with cache-aware scheduling

performance: 5x higher throughput in multi-turn dialogue

tensor parallelism

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --impl transformers \
    --tp 2

quantization

# torchao quantization
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B \
    --impl transformers \
    --torchao-config int8dq

openai-compatible api

both vllm and sglang support openai sdk:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",  # vllm
    # base_url="http://localhost:30000/v1",  # sglang
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

model compatibility requirements

attention configuration

from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS

class MyAttention(nn.Module):
    _supports_attention_backend = True

    def forward(self, query, key, value, **kwargs):
        attention_fn = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
        return attention_fn(self, query, key, value, **kwargs)

tensor parallelism plan

class MyModelConfig(PretrainedConfig):
    base_model_tp_plan = {
        "model.layers.*.self_attn.q_proj": "colwise",
        "model.layers.*.self_attn.k_proj": "colwise",
        "model.layers.*.self_attn.v_proj": "colwise",
        "model.layers.*.self_attn.o_proj": "rowwise",
        "model.layers.*.mlp.gate_proj": "colwise",
        "model.layers.*.mlp.up_proj": "colwise",
        "model.layers.*.mlp.down_proj": "rowwise",
    }

when to use each

use transformers backend when

model not natively supported in vllm/sglang
custom architectures from hub
rapid prototyping
encoder models (bert)
vlm support needed

use native implementations when

production deployment
maximum performance critical
high-volume serving
cost optimization

vllm vs sglang

use case	best choice
high-concurrency single-turn	vllm
multi-turn dialogue	sglang
structured output	sglang
real-time q&a	vllm
cache reuse scenarios	sglang

known limitations

vllm

video input not yet supported for vlms
some models like internvl incompatible

sglang

performance gap vs native (optimization ongoing)
vlm integration in development

future roadmap

vllm q4 2025:

complete removal of reimplementations
centralized maintenance in transformers
deeper integration

sglang 2025 h1:

performance gap optimization
vlm support
lora improvements

vllm and sglang integration

vllm integration

basic usage

server deployment

automatic fallback

performance

vision-language models

bert/encoder support

sglang integration

basic usage

server deployment

radixattention integration

tensor parallelism

quantization

openai-compatible api

model compatibility requirements

attention configuration

tensor parallelism plan

when to use each

use transformers backend when

use native implementations when

vllm vs sglang

known limitations

vllm

sglang

future roadmap

references

related pages

more in programming