Run:ai Topology-Aware Scheduling Deep Dive
Configure Run:ai topology-aware scheduling for distributed AI workloads. Multi-level hierarchies, required vs preferred placement, LeaderWorkerSet.
π‘ Quick Answer: Run:ai topology-aware scheduling uses Kubernetes node labels to represent your clusterβs physical network hierarchy (region β zone β block β rack β hostname). The scheduler evaluates the full workload as a unit and places all pods at the closest available topology level β eliminating cross-rack fragmentation that pod affinity causes.
The Problem
Distributed AI workloads (training and inference) involve tightly coupled pods that communicate constantly via NCCL, RDMA, or NVLink. Standard Kubernetes scheduling (including pod affinity) places pods one by one, checking βclosenessβ to already-placed pods without seeing the full picture. This leads to:
- Workload fragmentation: Pods split across racks, introducing unnecessary cross-rack latency
- Bandwidth bottlenecks: Cross-switch communication at 25-100 Gb/s instead of NVLink at 900 GB/s
- Wasted GPU hours: Training jobs take 2-3Γ longer due to communication overhead
flowchart TB
subgraph BAD["β Pod Affinity (per-pod decisions)"]
direction LR
RA1["Rack A<br/>4 pods placed"]
RB1["Rack B<br/>2 pods placed"]
RA1 -.->|"Cross-rack<br/>latency"| RB1
end
subgraph GOOD["β
Topology-Aware (full-workload decision)"]
direction LR
RA2["Rack A<br/>All 6 pods placed together"]
endThe Solution
How Topology-Aware Scheduling Works
Run:ai evaluates the entire workload against available nodes before placing any pod. It uses the configured topology hierarchy to find the tightest grouping possible:
Topology hierarchy (farthest β closest):
region β zone β block β rack β hostname
Workload needs 6 nodes:
1. Can all 6 fit in one rack? Yes β Place in rack β
2. If not: Can all 6 fit in one block? Try that
3. If not: Can all 6 fit in one zone? Try that
4. Last resort: Spread across zonesStep 1: Label Your Nodes
Topology labels are standard Kubernetes node labels. Apply them to reflect physical network layout:
# Cloud environments often have these automatically
# On-prem: label manually or use NVIDIA Topograph
# Example: 2 racks, 4 nodes each
kubectl label node gpu-node-01 \
topology.kubernetes.io/region=us-west \
topology.kubernetes.io/zone=us-west-1a \
cluster.local/rack=rack-01
kubectl label node gpu-node-02 \
topology.kubernetes.io/region=us-west \
topology.kubernetes.io/zone=us-west-1a \
cluster.local/rack=rack-01
kubectl label node gpu-node-03 \
topology.kubernetes.io/region=us-west \
topology.kubernetes.io/zone=us-west-1a \
cluster.local/rack=rack-02
kubectl label node gpu-node-04 \
topology.kubernetes.io/region=us-west \
topology.kubernetes.io/zone=us-west-1a \
cluster.local/rack=rack-02
# Verify labels
kubectl get nodes -L topology.kubernetes.io/zone,cluster.local/rackTip: For on-prem clusters, NVIDIA Topograph can auto-discover topology from NVIDIA NetQ and apply labels automatically.
Step 2: Create a Network Topology in Run:ai
Define the hierarchy via Run:ai API or UI. Order matters β first label is farthest, last is closest:
{
"levels": [
"topology.kubernetes.io/region",
"topology.kubernetes.io/zone",
"cluster.local/rack",
"kubernetes.io/hostname"
],
"name": "datacenter-topology",
"clusterId": "<CLUSTER_ID>"
}Via the Run:ai UI:
- Go to Cluster Settings β Network Topologies
- Add label keys in order: region (farthest) β hostname (closest)
- Name the topology (e.g.,
datacenter-topology)
Step 3: Attach Topology to Node Pools
| Scenario | Action |
|---|---|
| Single node pool (default) | Attach topology to default pool |
| Multiple pools, same topology | Attach same topology to all pools |
| Multiple pools, different topologies | Attach matching topology to each pool |
β οΈ Submitting a workload to multiple node pools with different topologies is not supported.
Step 4: Deploy Workloads
Automatic (Distributed Workloads via Run:ai)
For distributed workloads submitted through Run:ai UI, API, or CLI β topology-aware scheduling is automatic. No annotations needed. Run:ai applies Preferred at the lowest topology level.
Manual (kubectl YAML)
When submitting via kubectl, add annotations:
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
annotations:
kai.scheduler/topology: "datacenter-topology"
kai.scheduler/topology-preferred-placement: "cluster.local/rack"
labels:
runai/queue: team-a
spec:
parallelism: 8
completions: 8
template:
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.04-py3
resources:
limits:
nvidia.com/gpu: 8
restartPolicy: NeverAnnotation Reference
| Annotation | Type | Effect |
|---|---|---|
kai.scheduler/topology | Required | Topology name (must match a defined topology) |
kai.scheduler/topology-preferred-placement | Soft | Try to place all pods at this level; relax upward if impossible |
kai.scheduler/topology-required-placement | Hard | All pods MUST be at this level; pend if impossible |
Combining Required + Preferred
You can use both β but the hierarchy matters:
annotations:
kai.scheduler/topology: "datacenter-topology"
kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology-preferred-placement: "cluster.local/rack"Rules:
- Preferred at the same level or higher than Required β no effect (redundant)
- Preferred at a lower (more specific) level than Required β scheduler enforces Required, then tries to group at Preferred level
Example: required=zone, preferred=rack
β
All pods guaranteed in same zone
β
Scheduler TRIES to place in same rack within that zone
β
If rack is full, pods spread across racks but stay in the zoneLeaderWorkerSet (LWS) Topology
For Dynamo and other LWS-based workloads, topology is applied per replica:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: dynamo-disagg
namespace: runai-project-a
annotations:
kai.scheduler/topology: "datacenter-topology"
kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone"
kai.scheduler/topology-preferred-placement: "cluster.local/rack"
labels:
runai/queue: inference-team
spec:
replicas: 2 # 2 independent replicas
leaderWorkerTemplate:
size: 4 # Each replica: 1 leader + 3 workers
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
metadata:
labels:
role: leader
spec:
containers:
- name: prefill
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
resources:
limits:
nvidia.com/gpu: 4
workerTemplate:
metadata:
labels:
role: worker
spec:
containers:
- name: decode
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
resources:
limits:
nvidia.com/gpu: 1Each replica (leader + 3 workers) is gang-scheduled together and placed per topology. Replica 1 and Replica 2 may land in different racks β but each replicaβs internal pods are co-located.
GB200 and Multi-Node NVLink Domains
For GB200 NVL72 systems, topology-aware scheduling ensures workloads stay within NVLink domains:
{
"levels": [
"topology.kubernetes.io/region",
"topology.kubernetes.io/zone",
"nvidia.com/nvlink-domain",
"kubernetes.io/hostname"
],
"name": "gb200-topology",
"clusterId": "<CLUSTER_ID>"
}# Label GB200 nodes with NVLink domain
kubectl label node gb200-node-01 nvidia.com/nvlink-domain=nvl72-rack-a
kubectl label node gb200-node-02 nvidia.com/nvlink-domain=nvl72-rack-a
# ... all 72 GPUs in the NVLink domain share the same labelPod Affinity vs Topology-Aware Scheduling
| Aspect | Pod Affinity | Topology-Aware |
|---|---|---|
| Evaluation scope | Per-pod | Full workload |
| Fragmentation risk | High β splits across racks | Low β evaluates total capacity first |
| Hierarchy awareness | Single level | Multi-level (rack β block β zone) |
| Automatic fallback | No | Yes β escalates up hierarchy |
| Configuration | Per-pod spec | Per-topology (cluster-level) |
Real example from the docs:
Two workloads already running (12 nodes + 15 nodes). New workload needs 6 nodes:
- Pod affinity: Splits 4 in Rack A + 2 in Rack B (cross-rack overhead)
- Topology-aware: Places all 6 in Rack A (sees full capacity before placing)
Common Issues
| Issue | Cause | Fix |
|---|---|---|
| Workload pending indefinitely | required-placement canβt be satisfied | Switch to preferred-placement or add nodes at the required topology level |
| Topology not applied via kubectl | Missing annotations | Add all three annotations: topology, topology-*-placement |
| Preferred has no effect | Preferred at same or higher level than Required | Set Preferred at a lower (more specific) level than Required |
| Workloads unschedulable after topology deleted | Suspended workloads referencing deleted topology | Recreate the topology with the same name, or remove annotations |
| Multiple topologies on multi-pool submit | Not supported | Target a single node pool per workload |
| Labels not matching | Typo in label key between node and topology definition | Verify with kubectl get nodes -L <label-key> |
Best Practices
- Map labels to physical reality β racks, switches, NVLink domains, not arbitrary groupings
- Use
preferredby default βrequiredcauses unnecessary pending when cluster is busy - Combine
required+preferredfor critical workloads β guarantee zone, prefer rack - Use Topograph for auto-discovery β avoids manual labeling errors on large clusters
- Test with small workloads first β verify placement before committing large GPU counts
- One topology per node pool β donβt mix topologies in a multi-pool submit
- Monitor placement in Run:ai UI β verify pods are co-located as expected
Key Takeaways
- Topology-aware scheduling evaluates the full workload before placing any pod β fundamentally different from pod affinity
- Multi-level hierarchy (region β zone β block β rack β hostname) with automatic fallback up the tree
preferred= soft constraint (relax if needed),required= hard constraint (pend if impossible)- Distributed workloads submitted via Run:ai get automatic topology-aware scheduling β zero config
- LeaderWorkerSet workloads get per-replica topology placement β each replica co-located independently
- For GB200 NVL72, use
nvidia.com/nvlink-domainas a topology level to keep workloads within NVLink domains

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
