πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

Triton with vLLM Backend on Kubernetes

Deploy NVIDIA Triton Inference Server with vLLM backend on Kubernetes for flexible LLM serving with PagedAttention and continuous batching.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy Triton with the vLLM backend using nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3. Point model.json at your HuggingFace model β€” no engine compilation needed. vLLM handles PagedAttention, continuous batching, and quantization at runtime.

The Problem

TensorRT-LLM provides maximum performance but requires:

  • Engine compilation β€” hours of build time per model/GPU combination
  • Rebuilds on new hardware β€” engines are GPU-architecture specific
  • Complex pipeline β€” convert checkpoint β†’ build engine β†’ deploy

vLLM offers a simpler alternative with excellent performance:

  • No compilation step β€” load HuggingFace models directly
  • PagedAttention β€” efficient KV cache management (inspired by OS virtual memory)
  • AWQ/GPTQ quantization β€” load pre-quantized models without engine builds
  • Fast iteration β€” swap models by changing a config, not rebuilding engines

The Solution

Step 1: Model Repository Structure

model_repository/
└── mistral-7b/
    β”œβ”€β”€ config.pbtxt
    └── 1/
        └── model.json

Step 2: Create Model Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: triton-vllm-config
  namespace: ai-inference
data:
  config.pbtxt: |
    backend: "vllm"
    max_batch_size: 0

    model_transaction_policy {
      decoupled: True
    }

    input [
      {
        name: "text_input"
        data_type: TYPE_STRING
        dims: [ 1 ]
      },
      {
        name: "stream"
        data_type: TYPE_BOOL
        dims: [ 1 ]
      },
      {
        name: "sampling_parameters"
        data_type: TYPE_STRING
        dims: [ 1 ]
        optional: true
      }
    ]

    output [
      {
        name: "text_output"
        data_type: TYPE_STRING
        dims: [ -1 ]
      }
    ]

  model.json: |
    {
      "model": "mistralai/Mistral-7B-Instruct-v0.3",
      "disable_log_requests": true,
      "gpu_memory_utilization": 0.85,
      "max_model_len": 8192,
      "tensor_parallel_size": 1,
      "dtype": "float16",
      "enable_chunked_prefill": true,
      "max_num_seqs": 128,
      "enforce_eager": false
    }

Step 3: Deploy Triton with vLLM

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-vllm
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: triton-vllm
  template:
    metadata:
      labels:
        app: triton-vllm
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.12-vllm-python-py3
          args:
            - tritonserver
            - --model-repository=/model-repository
            - --log-verbose=1
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
            - name: TRANSFORMERS_CACHE
              value: /cache/huggingface
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 48Gi
              cpu: "8"
            requests:
              memory: 24Gi
              cpu: "4"
          volumeMounts:
            - name: config
              mountPath: /model-repository/mistral-7b/config.pbtxt
              subPath: config.pbtxt
            - name: config
              mountPath: /model-repository/mistral-7b/1/model.json
              subPath: model.json
            - name: cache
              mountPath: /cache
            - name: shm
              mountPath: /dev/shm
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
      volumes:
        - name: config
          configMap:
            name: triton-vllm-config
        - name: cache
          persistentVolumeClaim:
            claimName: model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 8Gi
---
apiVersion: v1
kind: Service
metadata:
  name: triton-vllm
  namespace: ai-inference
spec:
  selector:
    app: triton-vllm
  ports:
    - name: http
      port: 8000
    - name: grpc
      port: 8001
    - name: metrics
      port: 8002

Step 4: HuggingFace Token Secret

apiVersion: v1
kind: Secret
metadata:
  name: hf-token
  namespace: ai-inference
type: Opaque
stringData:
  token: "hf_your_token_here"

Step 5: AWQ Quantized Model (Fit Larger Models)

{
  "model": "TheBloke/Mixtral-8x7B-Instruct-v0.1-AWQ",
  "quantization": "awq",
  "gpu_memory_utilization": 0.90,
  "max_model_len": 16384,
  "tensor_parallel_size": 2,
  "dtype": "float16",
  "max_num_seqs": 64
}

For tensor parallelism across 2 GPUs, update the Deployment:

resources:
  limits:
    nvidia.com/gpu: 2

Step 6: Test Inference

# Generate text
curl -X POST http://triton-vllm.ai-inference:8000/v2/models/mistral-7b/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "<s>[INST] Explain RDMA in simple terms [/INST]",
    "stream": false,
    "sampling_parameters": "{\"temperature\": 0.7, \"max_tokens\": 256}"
  }'

# Streaming
curl -X POST http://triton-vllm.ai-inference:8000/v2/models/mistral-7b/generate_stream \
  -H "Content-Type: application/json" \
  -d '{
    "text_input": "<s>[INST] Write a Kubernetes YAML for nginx [/INST]",
    "stream": true,
    "sampling_parameters": "{\"temperature\": 0.3, \"max_tokens\": 512}"
  }'
flowchart TD
    A[HuggingFace Model Hub] --> B[vLLM Backend]
    B --> C[PagedAttention]
    B --> D[Continuous Batching]
    B --> E[AWQ or GPTQ Quantization]
    C --> F[Efficient KV Cache]
    D --> G[High Throughput]
    F --> H[Triton Inference Server]
    G --> H
    H --> I[HTTP and gRPC APIs]
    H --> J[Prometheus Metrics]

Common Issues

Model download timeout

# Pre-download to PVC cache instead of downloading at startup
# Run a one-time Job:
apiVersion: batch/v1
kind: Job
metadata:
  name: download-model
spec:
  template:
    spec:
      containers:
        - name: download
          image: python:3.11-slim
          command: ["python3", "-c", "from huggingface_hub import snapshot_download; snapshot_download('mistralai/Mistral-7B-Instruct-v0.3', cache_dir='/cache/huggingface')"]
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: cache
              mountPath: /cache
      volumes:
        - name: cache
          persistentVolumeClaim:
            claimName: model-cache
      restartPolicy: Never

OOM on GPU

{
  "gpu_memory_utilization": 0.80,
  "max_model_len": 4096,
  "max_num_seqs": 32,
  "enforce_eager": true
}

/dev/shm too small

# vLLM uses shared memory for tensor parallel
# Mount emptyDir with Memory medium
- name: shm
  emptyDir:
    medium: Memory
    sizeLimit: 16Gi  # Increase if using tensor parallelism

Best Practices

  • Mount /dev/shm as Memory emptyDir β€” vLLM needs shared memory for NCCL communication
  • Pre-download models to a PVC β€” avoids download timeouts and repeated downloads across replicas
  • Use enable_chunked_prefill: true β€” improves TTFT (time to first token) for long prompts
  • Set gpu_memory_utilization: 0.85 β€” leave headroom for CUDA context and runtime allocations
  • Use AWQ quantization for larger models β€” 4-bit AWQ fits 70B models on 2x A100
  • Cache HuggingFace models on a shared PVC β€” TRANSFORMERS_CACHE should point to persistent storage

Key Takeaways

  • vLLM on Triton provides zero-compilation LLM serving β€” load HuggingFace models directly
  • PagedAttention enables 2-4x more concurrent sequences than naive KV cache
  • Configure everything in model.json β€” model name, quantization, parallelism, memory limits
  • AWQ/GPTQ quantized models run directly without engine builds β€” great for fitting large models
  • vLLM trades ~10-20% peak throughput vs TensorRT-LLM for dramatically simpler deployment
#triton #vllm #nvidia #inference #llm #gpu #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens