What is the quick answer for Kubernetes LLM Serving Frameworks Compared?

Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.

Kubernetes LLM Serving Frameworks Compared

💡 Quick Answer: Use vLLM for best throughput with simple setup. Use NVIDIA NIM for maximum performance with TensorRT-LLM (but stricter version requirements). Use Ollama for quick local testing. Use Triton for multi-model serving. Use llama.cpp for CPU-only inference.

Choosing the right inference server depends on your model size, hardware, throughput needs, and operational complexity tolerance.

Feature Comparison

Feature	vLLM	NVIDIA NIM	Triton	Ollama	llama.cpp
Backend	PyTorch	TensorRT-LLM	Multiple	llama.cpp	llama.cpp
API	OpenAI-compatible	OpenAI-compatible	Custom + OpenAI	OpenAI-compatible	OpenAI-compatible
GPU Required	Yes	Yes	Depends	Optional	No
Quantized Models	AWQ, GPTQ	FP8, INT8	All	GGUF	GGUF
Multi-GPU (TP)	Yes	Yes	Yes	No	Limited
Continuous Batching	Yes	Yes	Yes	No	No
CUDA Graphs	Optional	Built-in	Optional	No	No
Production Ready	Yes	Yes	Yes	Dev/Test	Edge/CPU
Ease of Setup	Easy	Medium	Complex	Very Easy	Easy
Kubernetes Support	Excellent	Excellent	Excellent	Good	Good
License	Apache 2.0	Proprietary	BSD	MIT	MIT

Performance Comparison (Mistral-7B, A100-80GB)

Metric	vLLM	NIM (TRT-LLM)	Ollama
Throughput (tokens/s)	~2,500	~3,500	~150
Latency (first token)	~50 ms	~30 ms	~200 ms
Startup time	~15s	~60–120s	~5s
Memory usage	~14 GB	~30 GB (engine)	~5 GB (Q4)

Values are approximate and depend on batch size, sequence length, and hardware.

When to Use Each

vLLM — Best General-Purpose Choice

✅ Use when:
  - You want simple, reliable production serving
  - OpenAI-compatible API is essential
  - You need AWQ/GPTQ quantized models
  - You want active open-source community support
  - Fast iteration and deployment cycles

❌ Avoid when:
  - You need absolute maximum throughput (NIM is faster)
  - You're serving on CPU only

Deploy: See Deploy Mistral with vLLM

NVIDIA NIM — Maximum GPU Performance

✅ Use when:
  - Maximum throughput is critical
  - You have A100/H100 GPUs
  - NVIDIA enterprise support is valued
  - You need FP8 quantization (H100)
  - TensorRT-LLM optimization is worth the complexity

❌ Avoid when:
  - Rapid prototyping (slower startup)
  - Version mismatch tolerance is low
  - Non-NVIDIA hardware
  - Open-source licensing is required

Deploy: See Deploy Mistral with NVIDIA NIM

Triton Inference Server — Multi-Model Serving

✅ Use when:
  - Serving multiple models simultaneously
  - Mixing LLMs with other model types (CV, NLP, etc.)
  - You need model versioning and A/B testing
  - Dynamic batching across different model types
  - Enterprise multi-model platform

❌ Avoid when:
  - Serving a single LLM (overkill)
  - Simple setup is preferred

Deploy with Helm:

helm install triton-server nvcr.io/nvidia/tritonserver \
  --set image.repository=nvcr.io/nvidia/tritonserver \
  --set image.tag=24.01-trtllm-python-py3

Ollama — Quick Testing and Development

✅ Use when:
  - Local development and testing
  - Quick model evaluation
  - No GPU available (CPU inference)
  - Simple chat-style interface needed
  - Running GGUF quantized models

❌ Avoid when:
  - Production serving with SLAs
  - High throughput required
  - Multi-GPU inference needed

Deploy on Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            limits:
              nvidia.com/gpu: "1"    # Optional — works without GPU too
          volumeMounts:
            - name: ollama-data
              mountPath: /root/.ollama
      volumes:
        - name: ollama-data
          persistentVolumeClaim:
            claimName: ollama-pvc

# Pull and run a model
kubectl exec -it <ollama-pod> -- ollama pull mistral
kubectl exec -it <ollama-pod> -- ollama run mistral "Hello!"

# API (OpenAI-compatible)
curl http://ollama:11434/v1/completions \
  -d '{"model": "mistral", "prompt": "Hello!", "max_tokens": 32}'

llama.cpp — CPU and Edge Inference

✅ Use when:
  - No GPU available
  - Edge deployment or IoT
  - Minimal dependencies required
  - GGUF quantized models (2-bit to 8-bit)
  - Resource-constrained environments

❌ Avoid when:
  - Throughput matters (GPU options are 10–20× faster)
  - Serving many concurrent users

Deploy on Kubernetes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-cpp
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llama-cpp
  template:
    metadata:
      labels:
        app: llama-cpp
    spec:
      containers:
        - name: llama-cpp
          image: ghcr.io/ggerganov/llama.cpp:server
          args:
            - --model
            - /data/mistral-7b-v0.1.Q4_K_M.gguf
            - --host
            - "0.0.0.0"
            - --port
            - "8080"
            - --ctx-size
            - "4096"
          ports:
            - containerPort: 8080
          resources:
            requests:
              memory: "8Gi"
              cpu: "4"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc

Decision Matrix

Need maximum throughput?
  └─ Yes → NVIDIA NIM or vLLM
  └─ No
      └─ Need GPU?
          └─ Yes → vLLM (simple) or NIM (optimized)
          └─ No → llama.cpp (production) or Ollama (testing)

Serving multiple model types?
  └─ Yes → Triton Inference Server
  └─ No → vLLM or NIM

Air-gapped / disconnected cluster?
  └─ All frameworks work — just pre-load images and model weights

Multi-GPU models (70B+)?
  └─ vLLM or NIM (both support tensor parallelism)
  └─ NOT Ollama or llama.cpp

Model ID Gotchas

Each framework uses different model identification:

Framework	Model ID Format	Example
vLLM	Full path	`/data/Mistral-7B-v0.1`
NIM	Configured name	`Mistral-7B-v0.1`
Ollama	Short name	`mistral`
llama.cpp	Filename	`mistral-7b-v0.1.Q4_K_M.gguf`

Always check /v1/models first to get the exact ID.