📚Book Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) — free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

Kubernetes LLM Serving Frameworks Compared

Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.

By Luca Berton 📖 5 min read

💡 Quick Answer: Use vLLM for best throughput with simple setup. Use NVIDIA NIM for maximum performance with TensorRT-LLM (but stricter version requirements). Use Ollama for quick local testing. Use Triton for multi-model serving. Use llama.cpp for CPU-only inference.

Choosing the right inference server depends on your model size, hardware, throughput needs, and operational complexity tolerance.

Feature Comparison

FeaturevLLMNVIDIA NIMTritonOllamallama.cpp
BackendPyTorchTensorRT-LLMMultiplellama.cppllama.cpp
APIOpenAI-compatibleOpenAI-compatibleCustom + OpenAIOpenAI-compatibleOpenAI-compatible
GPU RequiredYesYesDependsOptionalNo
Quantized ModelsAWQ, GPTQFP8, INT8AllGGUFGGUF
Multi-GPU (TP)YesYesYesNoLimited
Continuous BatchingYesYesYesNoNo
CUDA GraphsOptionalBuilt-inOptionalNoNo
Production ReadyYesYesYesDev/TestEdge/CPU
Ease of SetupEasyMediumComplexVery EasyEasy
Kubernetes SupportExcellentExcellentExcellentGoodGood
LicenseApache 2.0ProprietaryBSDMITMIT

Performance Comparison (Mistral-7B, A100-80GB)

MetricvLLMNIM (TRT-LLM)Ollama
Throughput (tokens/s)~2,500~3,500~150
Latency (first token)~50 ms~30 ms~200 ms
Startup time~15s~60–120s~5s
Memory usage~14 GB~30 GB (engine)~5 GB (Q4)

Values are approximate and depend on batch size, sequence length, and hardware.

When to Use Each

vLLM — Best General-Purpose Choice

✅ Use when:
  - You want simple, reliable production serving
  - OpenAI-compatible API is essential
  - You need AWQ/GPTQ quantized models
  - You want active open-source community support
  - Fast iteration and deployment cycles

❌ Avoid when:
  - You need absolute maximum throughput (NIM is faster)
  - You're serving on CPU only

Deploy: See Deploy Mistral with vLLM

NVIDIA NIM — Maximum GPU Performance

✅ Use when:
  - Maximum throughput is critical
  - You have A100/H100 GPUs
  - NVIDIA enterprise support is valued
  - You need FP8 quantization (H100)
  - TensorRT-LLM optimization is worth the complexity

❌ Avoid when:
  - Rapid prototyping (slower startup)
  - Version mismatch tolerance is low
  - Non-NVIDIA hardware
  - Open-source licensing is required

Deploy: See Deploy Mistral with NVIDIA NIM

Triton Inference Server — Multi-Model Serving

✅ Use when:
  - Serving multiple models simultaneously
  - Mixing LLMs with other model types (CV, NLP, etc.)
  - You need model versioning and A/B testing
  - Dynamic batching across different model types
  - Enterprise multi-model platform

❌ Avoid when:
  - Serving a single LLM (overkill)
  - Simple setup is preferred

Deploy with Helm:

helm install triton-server nvcr.io/nvidia/tritonserver \
  --set image.repository=nvcr.io/nvidia/tritonserver \
  --set image.tag=24.01-trtllm-python-py3

Ollama — Quick Testing and Development

✅ Use when:
  - Local development and testing
  - Quick model evaluation
  - No GPU available (CPU inference)
  - Simple chat-style interface needed
  - Running GGUF quantized models

❌ Avoid when:
  - Production serving with SLAs
  - High throughput required
  - Multi-GPU inference needed

Deploy on Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: "1"    # Optional — works without GPU too
          volumeMounts:
            - name: ollama-data
              mountPath: /root/.ollama
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-pvc
# Pull and run a model
kubectl exec -it <ollama-pod> -- ollama pull mistral
kubectl exec -it <ollama-pod> -- ollama run mistral "Hello!"

# API (OpenAI-compatible)
curl http://ollama:11434/v1/completions \
  -d '{"model": "mistral", "prompt": "Hello!", "max_tokens": 32}'

llama.cpp — CPU and Edge Inference

✅ Use when:
  - No GPU available
  - Edge deployment or IoT
  - Minimal dependencies required
  - GGUF quantized models (2-bit to 8-bit)
  - Resource-constrained environments

❌ Avoid when:
  - Throughput matters (GPU options are 10–20× faster)
  - Serving many concurrent users

Deploy on Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpp
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-cpp
  template:
    metadata:
      labels:
        app: llama-cpp
    spec:
      containers:
        - name: llama-cpp
          image: ghcr.io/ggerganov/llama.cpp:server
          args:
            - --model
            - /data/mistral-7b-v0.1.Q4_K_M.gguf
            - --host
            - "0.0.0.0"
            - --port
            - "8080"
            - --ctx-size
            - "4096"
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "8Gi"
              cpu: "4"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc

Decision Matrix

Need maximum throughput?
  └─ Yes → NVIDIA NIM or vLLM
  └─ No
      └─ Need GPU?
          └─ Yes → vLLM (simple) or NIM (optimized)
          └─ No → llama.cpp (production) or Ollama (testing)

Serving multiple model types?
  └─ Yes → Triton Inference Server
  └─ No → vLLM or NIM

Air-gapped / disconnected cluster?
  └─ All frameworks work — just pre-load images and model weights

Multi-GPU models (70B+)?
  └─ vLLM or NIM (both support tensor parallelism)
  └─ NOT Ollama or llama.cpp

Model ID Gotchas

Each framework uses different model identification:

FrameworkModel ID FormatExample
vLLMFull path/data/Mistral-7B-v0.1
NIMConfigured nameMistral-7B-v0.1
OllamaShort namemistral
llama.cppFilenamemistral-7b-v0.1.Q4_K_M.gguf

Always check /v1/models first to get the exact ID.

#vllm #nvidia-nim #triton #ollama #llama-cpp #llm #comparison #inference #ai-workloads
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens