Kubernetes LLM Serving Frameworks Compared
Compare vLLM, NVIDIA NIM, Triton, Ollama, and llama.cpp for serving LLMs on Kubernetes — features, performance, and when to use each.
💡 Quick Answer: Use vLLM for best throughput with simple setup. Use NVIDIA NIM for maximum performance with TensorRT-LLM (but stricter version requirements). Use Ollama for quick local testing. Use Triton for multi-model serving. Use llama.cpp for CPU-only inference.
Choosing the right inference server depends on your model size, hardware, throughput needs, and operational complexity tolerance.
Feature Comparison
| Feature | vLLM | NVIDIA NIM | Triton | Ollama | llama.cpp |
|---|---|---|---|---|---|
| Backend | PyTorch | TensorRT-LLM | Multiple | llama.cpp | llama.cpp |
| API | OpenAI-compatible | OpenAI-compatible | Custom + OpenAI | OpenAI-compatible | OpenAI-compatible |
| GPU Required | Yes | Yes | Depends | Optional | No |
| Quantized Models | AWQ, GPTQ | FP8, INT8 | All | GGUF | GGUF |
| Multi-GPU (TP) | Yes | Yes | Yes | No | Limited |
| Continuous Batching | Yes | Yes | Yes | No | No |
| CUDA Graphs | Optional | Built-in | Optional | No | No |
| Production Ready | Yes | Yes | Yes | Dev/Test | Edge/CPU |
| Ease of Setup | Easy | Medium | Complex | Very Easy | Easy |
| Kubernetes Support | Excellent | Excellent | Excellent | Good | Good |
| License | Apache 2.0 | Proprietary | BSD | MIT | MIT |
Performance Comparison (Mistral-7B, A100-80GB)
| Metric | vLLM | NIM (TRT-LLM) | Ollama |
|---|---|---|---|
| Throughput (tokens/s) | ~2,500 | ~3,500 | ~150 |
| Latency (first token) | ~50 ms | ~30 ms | ~200 ms |
| Startup time | ~15s | ~60–120s | ~5s |
| Memory usage | ~14 GB | ~30 GB (engine) | ~5 GB (Q4) |
Values are approximate and depend on batch size, sequence length, and hardware.
When to Use Each
vLLM — Best General-Purpose Choice
✅ Use when:
- You want simple, reliable production serving
- OpenAI-compatible API is essential
- You need AWQ/GPTQ quantized models
- You want active open-source community support
- Fast iteration and deployment cycles
❌ Avoid when:
- You need absolute maximum throughput (NIM is faster)
- You're serving on CPU onlyDeploy: See Deploy Mistral with vLLM
NVIDIA NIM — Maximum GPU Performance
✅ Use when:
- Maximum throughput is critical
- You have A100/H100 GPUs
- NVIDIA enterprise support is valued
- You need FP8 quantization (H100)
- TensorRT-LLM optimization is worth the complexity
❌ Avoid when:
- Rapid prototyping (slower startup)
- Version mismatch tolerance is low
- Non-NVIDIA hardware
- Open-source licensing is requiredDeploy: See Deploy Mistral with NVIDIA NIM
Triton Inference Server — Multi-Model Serving
✅ Use when:
- Serving multiple models simultaneously
- Mixing LLMs with other model types (CV, NLP, etc.)
- You need model versioning and A/B testing
- Dynamic batching across different model types
- Enterprise multi-model platform
❌ Avoid when:
- Serving a single LLM (overkill)
- Simple setup is preferredDeploy with Helm:
helm install triton-server nvcr.io/nvidia/tritonserver \
--set image.repository=nvcr.io/nvidia/tritonserver \
--set image.tag=24.01-trtllm-python-py3Ollama — Quick Testing and Development
✅ Use when:
- Local development and testing
- Quick model evaluation
- No GPU available (CPU inference)
- Simple chat-style interface needed
- Running GGUF quantized models
❌ Avoid when:
- Production serving with SLAs
- High throughput required
- Multi-GPU inference neededDeploy on Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: "1" # Optional — works without GPU too
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc# Pull and run a model
kubectl exec -it <ollama-pod> -- ollama pull mistral
kubectl exec -it <ollama-pod> -- ollama run mistral "Hello!"
# API (OpenAI-compatible)
curl http://ollama:11434/v1/completions \
-d '{"model": "mistral", "prompt": "Hello!", "max_tokens": 32}'llama.cpp — CPU and Edge Inference
✅ Use when:
- No GPU available
- Edge deployment or IoT
- Minimal dependencies required
- GGUF quantized models (2-bit to 8-bit)
- Resource-constrained environments
❌ Avoid when:
- Throughput matters (GPU options are 10–20× faster)
- Serving many concurrent usersDeploy on Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-cpp
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: llama-cpp
template:
metadata:
labels:
app: llama-cpp
spec:
containers:
- name: llama-cpp
image: ghcr.io/ggerganov/llama.cpp:server
args:
- --model
- /data/mistral-7b-v0.1.Q4_K_M.gguf
- --host
- "0.0.0.0"
- --port
- "8080"
- --ctx-size
- "4096"
ports:
- containerPort: 8080
resources:
requests:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: model-data
mountPath: /data
readOnly: true
volumes:
- name: model-data
persistentVolumeClaim:
claimName: model-storage-pvcDecision Matrix
Need maximum throughput?
└─ Yes → NVIDIA NIM or vLLM
└─ No
└─ Need GPU?
└─ Yes → vLLM (simple) or NIM (optimized)
└─ No → llama.cpp (production) or Ollama (testing)
Serving multiple model types?
└─ Yes → Triton Inference Server
└─ No → vLLM or NIM
Air-gapped / disconnected cluster?
└─ All frameworks work — just pre-load images and model weights
Multi-GPU models (70B+)?
└─ vLLM or NIM (both support tensor parallelism)
└─ NOT Ollama or llama.cppModel ID Gotchas
Each framework uses different model identification:
| Framework | Model ID Format | Example |
|---|---|---|
| vLLM | Full path | /data/Mistral-7B-v0.1 |
| NIM | Configured name | Mistral-7B-v0.1 |
| Ollama | Short name | mistral |
| llama.cpp | Filename | mistral-7b-v0.1.Q4_K_M.gguf |
Always check /v1/models first to get the exact ID.
Related Recipes

Recommended
Kubernetes Recipes — The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book →Learn by Doing
CopyPasteLearn — Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses →🎓 Deepen Your Skills — Hands-on Courses
Courses by CopyPasteLearn.com — Learn IT by Doing
