Triton with vLLM Backend on Kubernetes
Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.
π‘ Quick Answer: Deploy Triton with the vLLM backend using
nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3. Pointmodel.jsonat your HuggingFace model β no engine compilation needed. vLLM handles PagedAttention, continuous batching, and quantization at runtime.
The Problem
TensorRT-LLM provides maximum performance but requires:
- Engine compilation β hours of build time per model/GPU combination
- Rebuilds on new hardware β engines are GPU-architecture specific
- Complex pipeline β convert checkpoint β build engine β deploy
vLLM offers a simpler alternative with excellent performance:
- No compilation step β load HuggingFace models directly
- PagedAttention β efficient KV cache management (inspired by OS virtual memory)
- AWQ/GPTQ quantization β load pre-quantized models without engine builds
- Fast iteration β swap models by changing a config, not rebuilding engines
The Solution
Step 1: Model Repository Structure
model_repository/
βββ mistral-7b/
βββ config.pbtxt
βββ 1/
βββ model.jsonStep 2: Create Model Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: triton-vllm-config
namespace: ai-inference
data:
config.pbtxt: |
backend: "vllm"
max_batch_size: 0
model_transaction_policy {
decoupled: True
}
input [
{
name: "text_input"
data_type: TYPE_STRING
dims: [ 1 ]
},
{
name: "stream"
data_type: TYPE_BOOL
dims: [ 1 ]
},
{
name: "sampling_parameters"
data_type: TYPE_STRING
dims: [ 1 ]
optional: true
}
]
output [
{
name: "text_output"
data_type: TYPE_STRING
dims: [ -1 ]
}
]
model.json: |
{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"disable_log_requests": true,
"gpu_memory_utilization": 0.85,
"max_model_len": 8192,
"tensor_parallel_size": 1,
"dtype": "float16",
"enable_chunked_prefill": true,
"max_num_seqs": 128,
"enforce_eager": false
}Step 3: Deploy Triton with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-vllm
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: triton-vllm
template:
metadata:
labels:
app: triton-vllm
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3
args:
- tritonserver
- --model-repository=/model-repository
- --log-verbose=1
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: TRANSFORMERS_CACHE
value: /cache/huggingface
resources:
limits:
nvidia.com/gpu: 1
memory: 48Gi
cpu: "8"
requests:
memory: 24Gi
cpu: "4"
volumeMounts:
- name: config
mountPath: /model-repository/mistral-7b/config.pbtxt
subPath: config.pbtxt
- name: config
mountPath: /model-repository/mistral-7b/1/model.json
subPath: model.json
- name: cache
mountPath: /cache
- name: shm
mountPath: /dev/shm
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumes:
- name: config
configMap:
name: triton-vllm-config
- name: cache
persistentVolumeClaim:
claimName: model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
name: triton-vllm
namespace: ai-inference
spec:
selector:
app: triton-vllm
ports:
- name: http
port: 8000
- name: grpc
port: 8001
- name: metrics
port: 8002Step 4: HuggingFace Token Secret
apiVersion: v1
kind: Secret
metadata:
name: hf-token
namespace: ai-inference
type: Opaque
stringData:
token: "hf_your_token_here"Step 5: AWQ Quantized Model (Fit Larger Models)
{
"model": "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ",
"quantization": "awq",
"gpu_memory_utilization": 0.90,
"max_model_len": 16384,
"tensor_parallel_size": 2,
"dtype": "float16",
"max_num_seqs": 64
}For tensor parallelism across 2 GPUs, update the Deployment:
resources:
limits:
nvidia.com/gpu: 2Step 6: Test Inference
# Generate text
curl -X POST http://triton-vllm.ai-inference:8000/v2/models/mistral-7b/generate \
-H "Content-Type: application/json" \
-d '{
"text_input": "<s>[INST] Explain RDMA in simple terms [/INST]",
"stream": false,
"sampling_parameters": "{\"temperature\": 0.7, \"max_tokens\": 256}"
}'
# Streaming
curl -X POST http://triton-vllm.ai-inference:8000/v2/models/mistral-7b/generate_stream \
-H "Content-Type: application/json" \
-d '{
"text_input": "<s>[INST] Write a Kubernetes YAML for nginx [/INST]",
"stream": true,
"sampling_parameters": "{\"temperature\": 0.3, \"max_tokens\": 512}"
}'flowchart TD
A[HuggingFace Model Hub] --> B[vLLM Backend]
B --> C[PagedAttention]
B --> D[Continuous Batching]
B --> E[AWQ or GPTQ Quantization]
C --> F[Efficient KV Cache]
D --> G[High Throughput]
F --> H[Triton Inference Server]
G --> H
H --> I[HTTP and gRPC APIs]
H --> J[Prometheus Metrics]Common Issues
Model download timeout
# Pre-download to PVC cache instead of downloading at startup
# Run a one-time Job:
apiVersion: batch/v1
kind: Job
metadata:
name: download-model
spec:
template:
spec:
containers:
- name: download
image: python:3.11-slim
command: ["python3", "-c", "from huggingface_hub import snapshot_download; snapshot_download('mistralai/Mistral-7B-Instruct-v0.3', cache_dir='/cache/huggingface')"]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: cache
mountPath: /cache
volumes:
- name: cache
persistentVolumeClaim:
claimName: model-cache
restartPolicy: NeverOOM on GPU
{
"gpu_memory_utilization": 0.80,
"max_model_len": 4096,
"max_num_seqs": 32,
"enforce_eager": true
}/dev/shm too small
# vLLM uses shared memory for tensor parallel
# Mount emptyDir with Memory medium
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi # Increase if using tensor parallelismBest Practices
- Mount
/dev/shmas Memory emptyDir β vLLM needs shared memory for NCCL communication - Pre-download models to a PVC β avoids download timeouts and repeated downloads across replicas
- Use
enable_chunked_prefill: trueβ improves TTFT (time to first token) for long prompts - Set
gpu_memory_utilization: 0.85β leave headroom for CUDA context and runtime allocations - Use AWQ quantization for larger models β 4-bit AWQ fits 70B models on 2x A100
- Cache HuggingFace models on a shared PVC β
TRANSFORMERS_CACHEshould point to persistent storage
Key Takeaways
- vLLM on Triton provides zero-compilation LLM serving β load HuggingFace models directly
- PagedAttention enables 2-4x more concurrent sequences than naive KV cache
- Configure everything in
model.jsonβ model name, quantization, parallelism, memory limits - AWQ/GPTQ quantized models run directly without engine builds β great for fitting large models
- vLLM trades ~10-20% peak throughput vs TensorRT-LLM for dramatically simpler deployment

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
