NVIDIA Dynamo Distributed Inference
Deploy NVIDIA Dynamo on Kubernetes for disaggregated LLM inference. KV-aware routing, prefill/decode splitting, Grove operator, and zero-config deployment.
π‘ Quick Answer: NVIDIA Dynamo is the open-source successor to Triton Inference Server. It orchestrates multi-node LLM inference with disaggregated prefill/decode, KV-aware routing, and SLA-driven autoscaling. Deploy on Kubernetes using the Grove operator and DynamoGraphDeploymentRequest CRD for zero-config model serving.
The Problem
Serving large language models at datacenter scale requires more than a single inference engine on a single GPU. You need to coordinate prefill and decode phases across GPU pools, route requests intelligently to avoid redundant KV cache computation, autoscale to meet latency SLAs, and handle failures without dropping requests. Individual engines (vLLM, SGLang, TensorRT-LLM) optimize single-node execution but lack the orchestration layer for multi-node coordination.
flowchart TB
CLIENT["Client Requests"] --> FE["Dynamo Frontend<br/>OpenAI-compatible API"]
FE --> ROUTER["KV-Aware Router<br/>Routes by cache overlap"]
ROUTER --> PF1["Prefill Worker 1<br/>(GPU Pool A)"]
ROUTER --> PF2["Prefill Worker 2<br/>(GPU Pool A)"]
PF1 -->|"KV cache via NIXL"| DC1["Decode Worker 1<br/>(GPU Pool B)"]
PF2 -->|"KV cache via NIXL"| DC2["Decode Worker 2<br/>(GPU Pool B)"]
DC1 --> CLIENT
DC2 --> CLIENT
PLANNER["SLO Planner"] -.->|"Autoscale"| PF1
PLANNER -.->|"Autoscale"| DC1
KVBM["KV Block Manager"] -.->|"GPUβCPUβSSD offload"| DC1The Solution
What NVIDIA Dynamo Does
Dynamo sits above inference engines β it doesnβt replace vLLM, SGLang, or TensorRT-LLM, it coordinates them into a multi-node inference system.
| Component | Function |
|---|---|
| Frontend | OpenAI-compatible API gateway |
| KV-Aware Router | Routes requests based on worker load + KV cache overlap β eliminates redundant prefill |
| Disaggregated Serving | Splits prefill and decode into independently scalable GPU pools |
| NIXL | Low-latency point-to-point KV cache transfer (GPU-to-GPU via NVLink, RDMA) |
| KV Block Manager (KVBM) | Offloads KV cache across GPU β CPU β SSD β remote storage |
| ModelExpress | Streams model weights GPU-to-GPU for 7Γ faster cold-start |
| Planner | SLA-driven autoscaler β profiles workloads, right-sizes GPU pools |
| Grove | K8s operator for topology-aware gang scheduling (NVL72, multi-rack) |
| AIConfigurator | Simulates 10K+ deployment configs in seconds to find optimal setup |
Backend Support Matrix
| Feature | SGLang | TensorRT-LLM | vLLM |
|---|---|---|---|
| Disaggregated Serving | β | β | β |
| KV-Aware Routing | β | β | β |
| SLA-Based Planner | β | β | β |
| KV Block Manager | π§ | β | β |
| Multimodal | β | β | β |
| Tool Calling | β | β | β |
Quick Start: Docker (Single Node)
# Pull pre-built container (SGLang backend)
docker run --gpus all --network host --rm -it \
nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.1
# Inside the container β start frontend and worker
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file &
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file &
# Test
curl -s localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}' | jq .Available runtime containers:
nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.1nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:1.0.1nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.0.1
Zero-Config Kubernetes Deployment
The simplest way to deploy on K8s β specify model, backend, and SLA targets:
# dynamo-deploy.yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
name: llama-70b-service
spec:
model: meta-llama/Llama-3.1-70B-Instruct
backend: vllm
sla:
ttft: 200.0 # Time to first token (ms)
itl: 20.0 # Inter-token latency (ms)
autoApply: true # AIConfigurator auto-profiles and deployskubectl apply -f dynamo-deploy.yamlDynamo automatically:
- Profiles the workload with AIConfigurator
- Selects optimal topology (aggregated vs disaggregated, TP, PP)
- Deploys frontend, router, prefill workers, and decode workers
- Planner monitors SLAs and autoscales GPU pools
Manual Kubernetes Deployment with Grove
For full control over the deployment topology:
# Install Grove operator (prerequisite)
# Grove handles topology-aware gang scheduling
helm repo add grove https://ai-dynamo.github.io/grove
helm install grove grove/grove-operator -n dynamo-system --create-namespaceDisaggregated Prefill/Decode Deployment
# dynamo-disaggregated.yaml
apiVersion: nvidia.com/v1beta1
kind: DynamoGraph
metadata:
name: llama-70b-disagg
namespace: inference
spec:
model: meta-llama/Llama-3.1-70B-Instruct
frontend:
replicas: 2
port: 8000
resources:
requests:
cpu: "2"
memory: 4Gi
router:
type: kv-aware # Routes based on KV cache overlap
replicas: 2
resources:
requests:
cpu: "2"
memory: 4Gi
prefill:
backend: sglang
replicas: 4
tensorParallelSize: 4
resources:
limits:
nvidia.com/gpu: 4
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
decode:
backend: sglang
replicas: 8
tensorParallelSize: 1
resources:
limits:
nvidia.com/gpu: 1
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
planner:
enabled: true
sla:
ttft: 200.0
itl: 20.0
minPrefillReplicas: 2
maxPrefillReplicas: 8
minDecodeReplicas: 4
maxDecodeReplicas: 16
kvCache:
nixl:
enabled: true # Low-latency KV transfer between prefill β decode
kvbm:
enabled: true
tiers:
- type: gpu # Hot tier
- type: cpu # Warm tier
maxSizeGi: 64
- type: ssd # Cold tier
maxSizeGi: 500Aggregated Deployment (Simpler)
When you donβt need disaggregated serving:
apiVersion: nvidia.com/v1beta1
kind: DynamoGraph
metadata:
name: llama-8b-agg
namespace: inference
spec:
model: meta-llama/Llama-3.1-8B-Instruct
frontend:
replicas: 1
port: 8000
router:
type: kv-aware
replicas: 1
workers:
backend: vllm
replicas: 4
tensorParallelSize: 1
resources:
limits:
nvidia.com/gpu: 1
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: HF_TOKEN
planner:
enabled: true
sla:
ttft: 100.0
itl: 10.0Pre-Built Recipes
Dynamo ships tested recipes for common models:
| Model | Backend | Mode | GPUs |
|---|---|---|---|
| Llama 3 70B | vLLM | Aggregated | 4Γ H100 |
| DeepSeek-R1 | SGLang | Disaggregated | 8Γ H100 (multinode) |
| Qwen3-32B-FP8 | TensorRT-LLM | Aggregated | 1Γ H100 |
# Clone and deploy a recipe
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo/recipes/llama-3-70b/vllm
kubectl apply -f .KV-Aware Routing
The KV-aware router eliminates redundant prefill computation by routing requests to workers that already have relevant KV cache:
Request: "Summarize the following document: <long context>"
ββ Worker A: Has 80% of this context in KV cache β Route here (skip 80% prefill)
ββ Worker B: Has 20% of this context β Don't route here
ββ Worker C: Has 0% β Only if A is overloadedThis delivers 2Γ faster time to first token in production workloads.
SLA-Driven Autoscaling (Planner)
The Planner monitors real-time latency metrics and adjusts GPU allocation:
SLA: TTFT < 200ms, ITL < 20ms
Current state:
Prefill workers: 4 (P99 TTFT = 180ms) β OK
Decode workers: 6 (P99 ITL = 25ms) β BREACH
Planner action:
Scale decode workers: 6 β 8
Result: P99 ITL drops to 15ms β Within SLAThe Planner achieves 80% fewer SLA breaches at 5% lower TCO compared to static provisioning.
NIXL: Low-Latency KV Transfer
NIXL (NIM Inference eXchange Library) handles KV cache transfer between prefill and decode workers:
Prefill Worker β [NIXL via NVLink/RDMA] β Decode Worker
Transfer methods (fastest to slowest):
1. NVLink (intra-node): ~900 GB/s
2. InfiniBand RDMA (inter-node): ~400 GB/s
3. RoCE (inter-node): ~200 GB/s
4. TCP (fallback): ~25 GB/sModelExpress: Fast Cold Start
ModelExpress streams model weights from running instances to new replicas via NIXL:
Traditional cold start: Download from storage β Load to GPU (120s)
ModelExpress cold start: Stream from neighbor GPU β Load (17s) = 7Γ fasterService Discovery on Kubernetes
Dynamo uses K8s-native service discovery β no etcd or NATS required:
| Deployment | etcd | NATS | Notes |
|---|---|---|---|
| Local dev | β | β | Use --discovery-backend file |
| Kubernetes | β | β | K8s CRDs + EndpointSlices |
| KV-Aware Routing | β | β | NATS needed for prefix caching coordination |
| Slurm | β | β | Both required |
Cloud-Specific Guides
- AWS EKS:
dynamo/examples/deployments/EKS/ - Google GKE:
dynamo/examples/deployments/GKE/
Benchmarking with AIPerf
# Install
pip install "ai-dynamo[sglang]"
# Benchmark your deployment
python3 -m dynamo.aiperf \
--endpoint http://dynamo-frontend:8000 \
--model meta-llama/Llama-3.1-70B-Instruct \
--concurrency 32 \
--duration 300 \
--output results.jsonCommon Issues
| Issue | Cause | Fix |
|---|---|---|
| KV cache transfer slow | TCP fallback instead of RDMA | Configure NIXL with InfiniBand/RoCE; check NCCL_IB_DISABLE |
| Decode workers starved | Prefill consuming all GPUs | Enable disaggregated serving β separate GPU pools for prefill/decode |
| Cold start too slow | Downloading model from storage | Enable ModelExpress for GPU-to-GPU weight streaming |
| SLA breaches under load | Static GPU allocation | Enable Planner with TTFT/ITL targets for automatic scaling |
| Router not distributing evenly | KV-aware routing without NATS | Deploy NATS (nats-server -js) for prefix caching coordination |
| Grove scheduling suboptimal | Missing topology labels | Ensure nodes have NVLink/NUMA topology labels for Grove |
Best Practices
- Start aggregated, move to disaggregated β disaggregation adds complexity; only split when prefill is the bottleneck
- Use KV-aware routing always β free performance gain even in aggregated mode
- Set realistic SLA targets β Planner optimizes for your targets; too aggressive = over-provisioned
- Enable KVBM tiering β GPU β CPU β SSD offloading extends effective context length
- Use ModelExpress for autoscaling β 7Γ faster cold-start means faster scale-out
- Benchmark before production β use AIPerf to validate topology choices
- Pin to Dynamo 1.0.1+ β production-ready release with all core features
Key Takeaways
- NVIDIA Dynamo is the open-source successor to Triton, built for datacenter-scale LLM inference
- Disaggregated serving splits prefill and decode into independently scalable GPU pools
- KV-aware routing eliminates redundant prefill computation for 2Γ faster TTFT
- The SLA Planner autoscales GPU pools to meet latency targets at minimum cost
- Grove operator enables topology-aware gang scheduling on Kubernetes (NVL72, multi-rack)
- Zero-config deployment via DynamoGraphDeploymentRequest CRD β specify model + SLA, Dynamo does the rest
- Works with all major backends: SGLang, TensorRT-LLM, and vLLM

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
