vllm and sglang integration

published: December 3, 2025 โ€ข

transformers v5.0โ€™s theme is interoperability. models defined once in transformers can be deployed across multiple inference engines without reimplementation.

vllm integration

basic usage

from vllm import LLM, SamplingParams

# explicitly use transformers backend
llm = LLM(model="new-model", model_impl="transformers")

# with custom models
llm = LLM(model="custom-model", model_impl="transformers", trust_remote_code=True)

# generate
sampling_params = SamplingParams(max_tokens=100)
outputs = llm.generate(["Hello"], sampling_params)

server deployment

vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --model_impl transformers

automatic fallback

vllm automatically switches to transformers if model not natively supported:

# model_impl options:
# "auto" - try native first, fallback to transformers
# "vllm" - native only
# "transformers" - transformers backend only

performance

within 5% of native vllm implementation:

batch sizevllmhuggingfacespeedup
42.32s5.32s2.3x
82.51s7.24s2.9x
163.01s9.50s3.2x
323.38s12.9s3.8x

vision-language models

from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": "image.jpg"},
        {"type": "text", "text": "Describe this image."},
    ],
}]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

vlm = LLM(model=model_id, model_impl="transformers")
outputs = vlm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(max_tokens=100)
)

bert/encoder support

# transformers backend is official way for encoder models in vllm
llm = LLM(model="bert-base-uncased", model_impl="transformers")
llm = LLM(model="answerdotai/ModernBERT-base", model_impl="transformers")

sglang integration

basic usage

import sglang as sgl

# use transformers backend
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")

# generate
output = llm.generate(["Hello"], {"max_new_tokens": 20})

server deployment

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-1B-Instruct \
    --impl transformers \
    --host 0.0.0.0 \
    --port 30000

radixattention integration

key advantage: automatic kv cache reuse during runtime:

  • retains kv cache after generation
  • radix tree structure for efficient prefix search
  • lru eviction with cache-aware scheduling

performance: 5x higher throughput in multi-turn dialogue

tensor parallelism

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --impl transformers \
    --tp 2

quantization

# torchao quantization
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B \
    --impl transformers \
    --torchao-config int8dq

openai-compatible api

both vllm and sglang support openai sdk:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",  # vllm
    # base_url="http://localhost:30000/v1",  # sglang
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

model compatibility requirements

attention configuration

from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS

class MyAttention(nn.Module):
    _supports_attention_backend = True

    def forward(self, query, key, value, **kwargs):
        attention_fn = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
        return attention_fn(self, query, key, value, **kwargs)

tensor parallelism plan

class MyModelConfig(PretrainedConfig):
    base_model_tp_plan = {
        "model.layers.*.self_attn.q_proj": "colwise",
        "model.layers.*.self_attn.k_proj": "colwise",
        "model.layers.*.self_attn.v_proj": "colwise",
        "model.layers.*.self_attn.o_proj": "rowwise",
        "model.layers.*.mlp.gate_proj": "colwise",
        "model.layers.*.mlp.up_proj": "colwise",
        "model.layers.*.mlp.down_proj": "rowwise",
    }

when to use each

use transformers backend when

  • model not natively supported in vllm/sglang
  • custom architectures from hub
  • rapid prototyping
  • encoder models (bert)
  • vlm support needed

use native implementations when

  • production deployment
  • maximum performance critical
  • high-volume serving
  • cost optimization

vllm vs sglang

use casebest choice
high-concurrency single-turnvllm
multi-turn dialoguesglang
structured outputsglang
real-time q&avllm
cache reuse scenariossglang

known limitations

vllm

  • video input not yet supported for vlms
  • some models like internvl incompatible

sglang

  • performance gap vs native (optimization ongoing)
  • vlm integration in development

future roadmap

vllm q4 2025:

  • complete removal of reimplementations
  • centralized maintenance in transformers
  • deeper integration

sglang 2025 h1:

  • performance gap optimization
  • vlm support
  • lora improvements

references

on this page