πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 40 minutes K8s 1.28+

Run:ai + Dynamo Multi-Node Scheduling on K8s

Deploy NVIDIA Dynamo with Run:ai v2.23 for gang scheduling and topology-aware placement. Atomic pod launches, zone co-location, and disaggregated inference.

By Luca Berton β€’ β€’ πŸ“– 7 min read

πŸ’‘ Quick Answer: NVIDIA Run:ai v2.23 integrates with Dynamo for gang scheduling (all-or-nothing pod launches) and topology-aware placement (co-locate prefill/decode workers in the same zone/rack). Annotate workloads with kai.scheduler/topology-preferred-placement and kai.scheduler/topology β€” Run:ai handles the rest automatically.

The Problem

Dynamo’s disaggregated inference creates tightly coupled components β€” frontends, routers, prefill workers, and decode workers β€” that must run together. Standard Kubernetes scheduling treats each pod independently, leading to:

  • Partial deployments: Decode pods running while prefill pods are pending β†’ idle GPUs
  • Poor placement: Leaders and workers spread across distant racks β†’ high latency, NCCL bottlenecks
  • Resource fragmentation: Half-deployed workloads consuming resources that could serve complete workloads

You need an orchestration layer that understands multi-component inference workloads and places them together on nodes with the fastest interconnect.

flowchart TB
    subgraph RunAI["Run:ai v2.23 Scheduler"]
        GANG["Gang Scheduling<br/>All-or-nothing launch"]
        TOPO["Topology-Aware Placement<br/>Same zone β†’ same rack β†’ same node"]
    end
    subgraph Dynamo["Dynamo Workload"]
        FE["Frontend"] 
        RT["Router"]
        PF["Prefill Workers<br/>(GPU Pool)"]
        DC["Decode Workers<br/>(GPU Pool)"]
    end
    RunAI -->|"Atomic deploy"| Dynamo
    GANG -->|"All pods or none"| FE
    GANG --> RT
    GANG --> PF
    GANG --> DC
    TOPO -->|"Same zone/rack"| PF
    TOPO -->|"Same zone/rack"| DC

The Solution

Prerequisites

  • Kubernetes cluster with Run:ai v2.23+ installed
  • A Run:ai project initialized (e.g., runai-project-a)
  • kubeconfig access and Helm installed
  • Nodes labeled with topology indicators

Step 1: Create HuggingFace Token Secret

kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN='<your-huggingface-token>' \
  -n runai-project-a

Step 2: Label Nodes with Topology

Ensure nodes have standard Kubernetes topology labels:

# Check existing labels
kubectl get nodes --show-labels | grep topology

# If needed, label nodes manually
kubectl label node gpu-node-01 topology.kubernetes.io/zone=us-west-1a
kubectl label node gpu-node-02 topology.kubernetes.io/zone=us-west-1a
kubectl label node gpu-node-03 topology.kubernetes.io/zone=us-west-1b
kubectl label node gpu-node-04 topology.kubernetes.io/zone=us-west-1b

# Region labels for broader placement
kubectl label node gpu-node-01 topology.kubernetes.io/region=us-west

Step 3: Configure Network Topology in Run:ai

In the Run:ai UI:

  1. Open Cluster Settings
  2. Add topology label keys:
    • topology.kubernetes.io/zone (closest)
    • topology.kubernetes.io/region (farthest)
  3. Create a topology (e.g., topology-1) ordering keys from closest β†’ farthest
  4. Attach the topology to the relevant node pool(s)

Run:ai applies a β€œpreferred” soft constraint at the closest tier first. If the cluster can’t place the entire gang at that level, it relaxes to broader tiers. Combined with gang scheduling, pods land together at the best-available proximity β€” or wait until they can.

Step 4: Install Dynamo Platform Components

# Set environment variables
export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
export NAMESPACE=dynamo-cloud
export RELEASE_VERSION=0.5.1

# Create namespace
kubectl create namespace $NAMESPACE

# Install CRDs
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-$RELEASE_VERSION.tgz
helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz \
  --namespace $NAMESPACE

# Install platform components
helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
  --namespace $NAMESPACE \
  --set dynamo-operator.namespaceRestriction.enabled=false

# Verify
kubectl -n $NAMESPACE get pods

Step 5: Deploy Disaggregated Dynamo Workload

Download the Dynamo disaggregated example YAML and add Run:ai annotations:

# disagg.yaml β€” Dynamo disaggregated inference with Run:ai scheduling
apiVersion: nvidia.com/v1beta1
kind: DynamoGraph
metadata:
  name: vllm-disagg
  namespace: runai-project-a
  annotations:
    # Topology-aware placement: prefer pods in the same zone
    kai.scheduler/topology-preferred-placement: "topology.kubernetes.io/zone"
    kai.scheduler/topology: "topology-1"
    # For strict placement (pods MUST be in the same zone):
    # kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
spec:
  model: Qwen/Qwen3-0.6B

  frontend:
    replicas: 1
    port: 8000
    image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1

  router:
    type: kv-aware
    replicas: 1

  prefill:
    backend: vllm
    replicas: 1
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token-secret
            key: HF_TOKEN

  decode:
    backend: vllm
    replicas: 1
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token-secret
            key: HF_TOKEN
kubectl apply -f disagg.yaml

Step 6: Verify Deployment

# Check pod status β€” all should start atomically (gang scheduling)
kubectl -n runai-project-a get pods
# NAME                                          READY   STATUS    AGE
# vllm-disagg-frontend-79f459c95-57fm6          1/1     Running   30m
# vllm-disagg-vllmdecodeworker-6c8d64f569-56phf 1/1     Running   30m
# vllm-disagg-vllmprefillworker-755cb88fcf-pflb5 1/1    Running   30m

# Verify topology placement β€” pods should be in the same zone
kubectl -n runai-project-a get pods -o wide
# Check NODE column β€” all pods should be on nodes in the same zone

Step 7: Test Inference

# Port-forward the frontend
kubectl -n runai-project-a port-forward \
  deployment/vllm-disagg-frontend 8000:8000

# Send a request
curl -s localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What is Kubernetes?"}],
    "max_tokens": 100
  }' | jq .

Large-Scale Production Example

For a production Llama 70B deployment across multiple nodes:

apiVersion: nvidia.com/v1beta1
kind: DynamoGraph
metadata:
  name: llama-70b-prod
  namespace: runai-project-a
  annotations:
    kai.scheduler/topology-preferred-placement: "topology.kubernetes.io/zone"
    kai.scheduler/topology: "topology-1"
spec:
  model: meta-llama/Llama-3.1-70B-Instruct

  frontend:
    replicas: 2
    port: 8000

  router:
    type: kv-aware
    replicas: 2

  prefill:
    backend: vllm
    replicas: 4
    tensorParallelSize: 4
    resources:
      limits:
        nvidia.com/gpu: 4
    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token-secret
            key: HF_TOKEN

  decode:
    backend: vllm
    replicas: 8
    tensorParallelSize: 1
    resources:
      limits:
        nvidia.com/gpu: 1
    env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token-secret
            key: HF_TOKEN

With Run:ai gang scheduling, all 16 pods (2 frontend + 2 router + 4 prefill + 8 decode) either start together or wait. No more idle GPUs from partial deployments.

Scheduling Annotations Reference

AnnotationValueEffect
kai.scheduler/topology-preferred-placementLabel key (e.g., topology.kubernetes.io/zone)Soft constraint: try to place in same zone, relax if needed
kai.scheduler/topology-required-placementLabel keyHard constraint: pods MUST be in same zone, or wait
kai.scheduler/topologyTopology name (e.g., topology-1)Which topology definition to use

How Gang Scheduling Works

Without gang scheduling:
  t=0:  Prefill pod 1 starts β†’ 1 GPU idle
  t=5:  Prefill pod 2 starts β†’ 2 GPUs idle
  t=30: Decode pod 1 starts  β†’ still waiting for frontend
  t=60: Frontend starts       β†’ NOW serving (60s wasted)

With Run:ai gang scheduling:
  t=0:  All resources available? No β†’ WAIT
  t=45: All resources available? Yes β†’ ALL pods start atomically
  Result: 0 idle GPUs, 0 partial deployments

How Topology-Aware Placement Works

Cluster topology:
  Zone A (rack 1): node-01, node-02 (NVLink between GPUs)
  Zone A (rack 2): node-03, node-04
  Zone B (rack 1): node-05, node-06

Preferred placement: topology.kubernetes.io/zone

Run:ai scheduler:
  1. Try Zone A, rack 1: Can fit all pods? Yes β†’ Place here βœ…
     Prefill + Decode on node-01 and node-02 (fastest NVLink/IB)
  2. Fallback: Zone A (any rack) if rack 1 too small
  3. Fallback: Any zone if Zone A is full

Common Issues

IssueCauseFix
All pods stuck in PendingGang scheduling waiting for full resource availabilityCheck kubectl describe pod β€” need enough GPUs for ALL components simultaneously
Pods in different zonesUsing preferred placement, not requiredSwitch to kai.scheduler/topology-required-placement for strict co-location
Topology not appliedTopology not attached to node poolIn Run:ai UI, attach the topology to the correct node pool
Node labels missingNodes not labeled with topology keysLabel nodes: kubectl label node <node> topology.kubernetes.io/zone=<zone>
Dynamo operator not foundCRDs not installedInstall dynamo-crds Helm chart first
Partial deployment (old behavior)Run:ai version < 2.23Upgrade to Run:ai v2.23+ for Dynamo integration

Best Practices

  • Use preferred placement by default β€” strict required placement may cause unnecessary waiting
  • Label nodes consistently β€” use standard topology.kubernetes.io/zone and topology.kubernetes.io/region labels
  • Match topology to physical layout β€” zones should represent racks or switch domains with fastest interconnect
  • Size GPU pools for complete workloads β€” gang scheduling needs enough resources for all components at once
  • Monitor with Run:ai dashboard β€” track GPU utilization, queue times, and topology placement decisions
  • Start with small models β€” validate gang scheduling + topology with Qwen 0.6B before scaling to 70B+

Key Takeaways

  • Run:ai v2.23 integrates with Dynamo for gang scheduling (atomic all-or-nothing pod launches) and topology-aware placement (zone/rack co-location)
  • Gang scheduling eliminates partial deployments and idle GPUs from incomplete workloads
  • Topology-aware placement minimizes cross-node latency by co-locating prefill and decode workers
  • Two annotations control everything: topology-preferred-placement and topology name
  • Combined with Dynamo’s disaggregated serving and KV-aware routing, this delivers predictable, low-latency multi-node inference at scale
#nvidia-dynamo #run-ai #gang-scheduling #topology-aware #distributed-inference
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens