Multi-Node NVLink with ComputeDomains
Configure ComputeDomains for robust and secure Multi-Node NVLink (MNNVL) workloads on NVIDIA GB200 and similar systems using DRA
The Problem
Running distributed GPU workloads across multiple nodes requires high-bandwidth, low-latency interconnects. Multi-Node NVLink (MNNVL) provides this, but orchestrating MNNVL workloads securely while ensuring network isolation between different workloads is complex.
The Solution
Use ComputeDomains with the NVIDIA DRA Driver to create isolated, ephemeral domains that guarantee NVLink reachability between pods while providing secure isolation from other workloads.
Understanding ComputeDomains
flowchart TB
subgraph cluster["βΈοΈ KUBERNETES CLUSTER"]
subgraph cd1["π· ComputeDomain A"]
direction LR
P1["Pod 1<br/>GPU 0-3"]
P2["Pod 2<br/>GPU 4-7"]
P1 <-->|"NVLink"| P2
end
subgraph cd2["πΆ ComputeDomain B"]
direction LR
P3["Pod 3<br/>GPU 0-3"]
P4["Pod 4<br/>GPU 4-7"]
P3 <-->|"NVLink"| P4
end
subgraph isolation["π ISOLATION"]
I1["CD-A pods cannot<br/>reach CD-B GPUs"]
I2["Secure IMEX<br/>channels"]
end
end
cd1 -.->|"Isolated"| cd2Key Concepts
| Concept | Description |
|---|---|
| ComputeDomain | Abstraction guaranteeing MNNVL reachability between pods |
| MNNVL | Multi-Node NVLink for high-bandwidth GPU-to-GPU communication |
| IMEX | Inter-Memory Exchange - underlying technology for ComputeDomains |
| Ephemeral Lifetime | CD lifetime is bound to the consuming workload |
Step 1: Install NVIDIA DRA Driver with ComputeDomain Support
# Install with compute-domain plugin enabled
helm install nvidia-dra-driver nvidia/nvidia-dra-driver \
--namespace nvidia-dra-driver \
--create-namespace \
--set computeDomain.enabled=true \
--set gpu.enabled=trueVerify the compute-domain kubelet plugin is running:
kubectl get pods -n nvidia-dra-driver -l app=compute-domain-kubelet-plugin
# Check ComputeDomain CRD
kubectl get crd computedomains.nvidia.comStep 2: Create a ComputeDomain ResourceClaimTemplate
# compute-domain-template.yaml
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: compute-domain-template
namespace: distributed-training
spec:
spec:
devices:
requests:
- name: compute-domain
deviceClassName: nvidia.com/compute-domain
count: 1
config:
- requests: ["compute-domain"]
opaque:
driver: nvidia.com/compute-domain
parameters:
# Number of nodes in the domain
nodeCount: "4"
# GPUs per node
gpusPerNode: "8"kubectl create namespace distributed-training
kubectl apply -f compute-domain-template.yamlStep 3: Deploy Multi-Node Training Job
# multi-node-training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llm-distributed-training
namespace: distributed-training
spec:
parallelism: 4 # 4 pods across nodes
completions: 4
completionMode: Indexed
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
command:
- torchrun
- --nnodes=4
- --nproc_per_node=8
- --node_rank=$(JOB_COMPLETION_INDEX)
- --master_addr=$(MASTER_ADDR)
- --master_port=29500
- train_llm.py
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_IB_DISABLE
value: "0"
resources:
claims:
- name: gpu-claim
- name: cd-claim
resourceClaims:
- name: gpu-claim
resourceClaimTemplateName: multi-gpu-template
- name: cd-claim
resourceClaimTemplateName: compute-domain-template
# All pods share the same ComputeDomain
shareProcessNamespace: trueStep 4: Configure GPU Claim for Multi-Node
# multi-node-gpu-template.yaml
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
name: multi-gpu-template
namespace: distributed-training
spec:
spec:
devices:
requests:
- name: gpus
deviceClassName: nvidia.com/gpu
count: 8 # All GPUs on the node
allocationMode: ExactCount
constraints:
- requests: ["gpus"]
matchAttribute: "nvidia.com/nvlink-capable"Step 5: Shared ComputeDomain Across Pods
For pods that need to communicate via NVLink, they must share the same ComputeDomain claim:
# shared-compute-domain.yaml
apiVersion: v1
kind: Pod
metadata:
name: training-coordinator
namespace: distributed-training
spec:
containers:
- name: coordinator
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
claims:
- name: shared-cd
resourceClaims:
- name: shared-cd
resourceClaimName: my-compute-domain # Named claim, not template
---
apiVersion: v1
kind: Pod
metadata:
name: training-worker-1
namespace: distributed-training
spec:
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
claims:
- name: shared-cd
resourceClaims:
- name: shared-cd
resourceClaimName: my-compute-domain # Same named claim
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
name: my-compute-domain
namespace: distributed-training
spec:
devices:
requests:
- name: cd
deviceClassName: nvidia.com/compute-domainStep 6: NCCL Configuration for MNNVL
Optimize NCCL for Multi-Node NVLink:
# nccl-optimized-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: nccl-optimized-training
namespace: distributed-training
spec:
containers:
- name: trainer
image: nvcr.io/nvidia/pytorch:24.01-py3
env:
# Enable NVLink for inter-GPU communication
- name: NCCL_P2P_LEVEL
value: "NVL" # NVLink level
- name: NCCL_NET_GDR_LEVEL
value: "PIX" # GPU Direct RDMA
- name: NCCL_CROSS_NIC
value: "1"
- name: NCCL_SOCKET_IFNAME
value: "eth0"
# Debug settings
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,GRAPH,ENV"
resources:
claims:
- name: gpus
- name: compute-domain
# ... rest of pod specStep 7: Monitor ComputeDomain Status
# List all ComputeDomains
kubectl get computedomains -n distributed-training
# Detailed ComputeDomain info
kubectl describe computedomain <cd-name> -n distributed-training
# Check IMEX daemon status
kubectl get pods -n nvidia-dra-driver -l app=imex-daemon
# View ComputeDomain events
kubectl get events -n distributed-training --field-selector involvedObject.kind=ComputeDomainStep 8: Verify NVLink Connectivity
# Inside a pod, verify NVLink topology
kubectl exec -it training-coordinator -- nvidia-smi topo -m
# Check NVLink status
kubectl exec -it training-coordinator -- nvidia-smi nvlink -s
# Test NCCL all-reduce across nodes
kubectl exec -it training-coordinator -- \
python -c "
import torch
import torch.distributed as dist
dist.init_process_group('nccl')
tensor = torch.ones(1000000).cuda()
dist.all_reduce(tensor)
print(f'All-reduce successful, sum: {tensor.sum()}')
"Step 9: ComputeDomain with MPI Jobs
# mpi-compute-domain-job.yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-test-mpi
namespace: distributed-training
spec:
slotsPerWorker: 8
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: launcher
image: nvcr.io/nvidia/pytorch:24.01-py3
command:
- mpirun
- --allow-run-as-root
- -np
- "32"
- -bind-to
- none
- python
- nccl_test.py
Worker:
replicas: 4
template:
spec:
containers:
- name: worker
image: nvcr.io/nvidia/pytorch:24.01-py3
resources:
claims:
- name: gpus
- name: cd
resourceClaims:
- name: gpus
resourceClaimTemplateName: multi-gpu-template
- name: cd
resourceClaimTemplateName: compute-domain-templateTroubleshooting
ComputeDomain Not Creating
# Check DRA driver logs
kubectl logs -n nvidia-dra-driver -l app=compute-domain-kubelet-plugin --tail=100
# Verify IMEX is available on nodes
kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | jq
# Check for node label requirements
kubectl get nodes --show-labels | grep nvlinkNVLink Not Detected
# Verify hardware supports MNNVL
kubectl exec -it <pod> -- nvidia-smi nvlink -c
# Check for NVSwitch
kubectl exec -it <pod> -- nvidia-smi -L
# Verify NCCL sees NVLink
kubectl exec -it <pod> -- bash -c "NCCL_DEBUG=INFO python -c 'import torch; torch.cuda.init()'"Best Practices
- Size ComputeDomains appropriately - Match domain size to workload requirements
- Use shared claims for pods that need NVLink connectivity
- Configure NCCL optimally for your network topology
- Monitor IMEX daemon health for ComputeDomain reliability
- Plan for ephemeral lifetime - ComputeDomains are tied to workload lifecycle
Summary
ComputeDomains provide a Kubernetes-native abstraction for Multi-Node NVLink workloads. They guarantee MNNVL reachability between pods while maintaining secure isolation, making distributed GPU training on systems like NVIDIA GB200 both powerful and manageable.
π Go Further with Kubernetes Recipes
Love this recipe? Thereβs so much more! This is just one of 100+ hands-on recipes in our comprehensive Kubernetes Recipes book.
Inside the book, youβll master:
- β Production-ready deployment strategies
- β Advanced networking and security patterns
- β Observability, monitoring, and troubleshooting
- β Real-world best practices from industry experts
βThe practical, recipe-based approach made complex Kubernetes concepts finally click for me.β
π Get Your Copy Now β Start building production-grade Kubernetes skills today!
π Get All 100+ Recipes in One Book
Stop searching β get every production-ready pattern with detailed explanations, best practices, and copy-paste YAML.
Want More Kubernetes Recipes?
This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.