vllm and sglang integration

transformers v5.0’s theme is interoperability. models defined once in transformers can be deployed across multiple inference engines without reimplementation.

vllm integration

basic usage

from vllm import LLM, SamplingParams

# explicitly use transformers backend
llm = LLM(model="new-model", model_impl="transformers")

# with custom models
llm = LLM(model="custom-model", model_impl="transformers", trust_remote_code=True)

# generate
sampling_params = SamplingParams(max_tokens=100)
outputs = llm.generate(["Hello"], sampling_params)

server deployment

vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --model_impl transformers

automatic fallback

vllm automatically switches to transformers if model not natively supported:

# model_impl options:
# "auto" - try native first, fallback to transformers
# "vllm" - native only
# "transformers" - transformers backend only

performance

within 5% of native vllm implementation:

batch sizevllmhuggingfacespeedup
42.32s5.32s2.3x
82.51s7.24s2.9x
163.01s9.50s3.2x
323.38s12.9s3.8x

vision-language models

from vllm import LLM, SamplingParams
from PIL import Image
from transformers import AutoProcessor

model_id = "llava-hf/llava-onevision-qwen2-0.5b-ov-hf"
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "url": "image.jpg"},
        {"type": "text", "text": "Describe this image."},
    ],
}]

prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

vlm = LLM(model=model_id, model_impl="transformers")
outputs = vlm.generate(
    {"prompt": prompt, "multi_modal_data": {"image": image}},
    sampling_params=SamplingParams(max_tokens=100)
)

bert/encoder support

# transformers backend is official way for encoder models in vllm
llm = LLM(model="bert-base-uncased", model_impl="transformers")
llm = LLM(model="answerdotai/ModernBERT-base", model_impl="transformers")

sglang integration

basic usage

import sglang as sgl

# use transformers backend
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")

# generate
output = llm.generate(["Hello"], {"max_new_tokens": 20})

server deployment

python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.2-1B-Instruct \
    --impl transformers \
    --host 0.0.0.0 \
    --port 30000

radixattention integration

key advantage: automatic kv cache reuse during runtime:

  • retains kv cache after generation
  • radix tree structure for efficient prefix search
  • lru eviction with cache-aware scheduling

performance: 5x higher throughput in multi-turn dialogue

tensor parallelism

python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --impl transformers \
    --tp 2

quantization

# torchao quantization
python -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B \
    --impl transformers \
    --torchao-config int8dq

openai-compatible api

both vllm and sglang support openai sdk:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",  # vllm
    # base_url="http://localhost:30000/v1",  # sglang
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-1B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}]
)

model compatibility requirements

attention configuration

from transformers.modeling_utils import ALL_ATTENTION_FUNCTIONS

class MyAttention(nn.Module):
    _supports_attention_backend = True

    def forward(self, query, key, value, **kwargs):
        attention_fn = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
        return attention_fn(self, query, key, value, **kwargs)

tensor parallelism plan

class MyModelConfig(PretrainedConfig):
    base_model_tp_plan = {
        "model.layers.*.self_attn.q_proj": "colwise",
        "model.layers.*.self_attn.k_proj": "colwise",
        "model.layers.*.self_attn.v_proj": "colwise",
        "model.layers.*.self_attn.o_proj": "rowwise",
        "model.layers.*.mlp.gate_proj": "colwise",
        "model.layers.*.mlp.up_proj": "colwise",
        "model.layers.*.mlp.down_proj": "rowwise",
    }

when to use each

use transformers backend when

  • model not natively supported in vllm/sglang
  • custom architectures from hub
  • rapid prototyping
  • encoder models (bert)
  • vlm support needed

use native implementations when

  • production deployment
  • maximum performance critical
  • high-volume serving
  • cost optimization

vllm vs sglang

use casebest choice
high-concurrency single-turnvllm
multi-turn dialoguesglang
structured outputsglang
real-time q&avllm
cache reuse scenariossglang

known limitations

vllm

  • video input not yet supported for vlms
  • some models like internvl incompatible

sglang

  • performance gap vs native (optimization ongoing)
  • vlm integration in development

future roadmap

vllm q4 2025:

  • complete removal of reimplementations
  • centralized maintenance in transformers
  • deeper integration

sglang 2025 h1:

  • performance gap optimization
  • vlm support
  • lora improvements

references

on this page