NVIDIA IPAM for GPU Fabric IP Address Allocation
Configure nv-ipam (NVIDIA IPAM) to assign IP addresses on GPU fabric SR-IOV networks in Kubernetes. Covers IPPool CRDs, per-node allocation, InfiniBand IPoIB
π‘ Quick Answer: NVIDIA IPAM (nv-ipam) is a Kubernetes-native IPAM plugin that assigns IP addresses to GPU fabric SR-IOV interfaces. It uses IPPool and CIDRPool CRDs for deterministic, per-node IP allocation β ensuring each GPU worker gets a consistent, predictable address range on the InfiniBand/RoCE fabric.
The Problem
GPU fabric networking needs IP assignment for RDMA interfaces:
- SR-IOV VFs on InfiniBand need IPoIB addresses for NCCL bootstrap
- Standard IPAM plugins (host-local, whereabouts) donβt understand GPU topology
- Need deterministic IPs per node (same IP after Pod restart for NCCL rank mapping)
- Multi-subnet support for separate GPU fabric and storage fabric
- Per-node IP ranges prevent conflicts in large clusters (100+ GPU nodes)
The Solution
Install nv-ipam
# Deploy NVIDIA IPAM CNI plugin
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install nv-ipam nvidia/nvidia-ipam \
--namespace kube-system \
--set image.repository=nvcr.io/nvidia/cloud-native/nvidia-k8s-ipam \
--set image.tag=v0.2.0
# Or via manifest:
kubectl apply -f https://raw.githubusercontent.com/Mellanox/nvidia-k8s-ipam/main/deploy/nv-ipam.yamlIPPool for GPU Fabric
# Define IP pool for InfiniBand GPU fabric
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: gpu-fabric-pool
namespace: kube-system
spec:
subnet: "10.0.100.0/24"
perNodeBlockSize: 16 # 16 IPs per node (matches max VFs)
gateway: "10.0.100.1"
nodeSelector:
matchLabels:
node-role.kubernetes.io/gpu-worker: ""
# Result: Each GPU worker gets a /28 block:
# gpu-worker-01: 10.0.100.16 - 10.0.100.31
# gpu-worker-02: 10.0.100.32 - 10.0.100.47
# gpu-worker-03: 10.0.100.48 - 10.0.100.63
# ...CIDRPool for Larger Deployments
# For large clusters needing bigger ranges
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: CIDRPool
metadata:
name: gpu-fabric-cidr
namespace: kube-system
spec:
cidr: "10.0.0.0/16"
perNodeNetworkPrefix: 24 # Each node gets a /24 (256 IPs)
gatewayIndex: 1 # .1 is gateway on each /24
nodeSelector:
matchLabels:
nvidia.com/gpu.present: "true"
# Result:
# gpu-worker-01: 10.0.1.0/24 (gateway 10.0.1.1)
# gpu-worker-02: 10.0.2.0/24 (gateway 10.0.2.1)
# gpu-worker-03: 10.0.3.0/24 (gateway 10.0.3.1)Separate Pools Per Fabric
# GPU Fabric (InfiniBand) β high-bandwidth NCCL traffic
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: gpu-ib-fabric
namespace: kube-system
spec:
subnet: "10.100.0.0/16"
perNodeBlockSize: 8
gateway: "" # No gateway needed for L2 IB fabric
nodeSelector:
matchLabels:
nvidia.com/gpu.present: "true"
---
# Storage Fabric (Ethernet/RoCE) β NFS, Ceph, Lustre
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: storage-fabric
namespace: kube-system
spec:
subnet: "10.200.0.0/16"
perNodeBlockSize: 4
gateway: "10.200.0.1"
nodeSelector:
matchLabels:
node-role.kubernetes.io/gpu-worker: ""SriovNetwork with nv-ipam
# GPU RDMA network using nv-ipam for addressing
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: gpu-rdma-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: ai-training
resourceName: gpu-rdma
capabilities: '{"rdma": true}'
ipam: |
{
"type": "nv-ipam",
"poolName": "gpu-ib-fabric"
}
---
# Storage network using nv-ipam
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: storage-network
namespace: openshift-sriov-network-operator
spec:
networkNamespace: ai-training
resourceName: storage-net
ipam: |
{
"type": "nv-ipam",
"poolName": "storage-fabric"
}NetworkAttachmentDefinition (Non-OpenShift)
# For vanilla Kubernetes with Multus
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: gpu-rdma-net
namespace: ai-training
spec:
config: |
{
"cniVersion": "0.3.1",
"name": "gpu-rdma-net",
"type": "sriov",
"vlan": 0,
"spoofchk": "off",
"trust": "on",
"rdma": true,
"ipam": {
"type": "nv-ipam",
"poolName": "gpu-ib-fabric"
}
}Pod with nv-ipam Assigned Addresses
apiVersion: v1
kind: Pod
metadata:
name: nccl-training
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: |
[
{"name": "gpu-rdma-network", "interface": "rdma0"},
{"name": "gpu-rdma-network", "interface": "rdma1"},
{"name": "storage-network", "interface": "stor0"}
]
spec:
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.07-py3
env:
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1"
- name: NCCL_NET_GDR_LEVEL
value: "5"
# NCCL bootstrap uses the GPU fabric IPs assigned by nv-ipam
- name: MASTER_ADDR
value: "10.100.0.8" # Rank 0 GPU fabric IP
- name: NCCL_SOCKET_IFNAME
value: "eth0" # Bootstrap over default interface
resources:
requests:
nvidia.com/gpu: "8"
openshift.io/gpu-rdma: "2"
openshift.io/storage-net: "1"Verify IP Allocation
# Check nv-ipam node allocation status
kubectl get ippools -n kube-system -o wide
kubectl describe ippool gpu-ib-fabric -n kube-system
# Check per-node allocations
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.annotations.nv-ipam\.nvidia\.com/ip-blocks}{"\n"}{end}'
# Inside a Pod, verify assigned IPs
kubectl exec -it nccl-training -- ip addr show rdma0
# Should show: inet 10.100.0.X/16 (from gpu-ib-fabric pool)
kubectl exec -it nccl-training -- ip addr show stor0
# Should show: inet 10.200.0.X/16 (from storage-fabric pool)
# Check nv-ipam daemon logs
kubectl logs -n kube-system -l app=nv-ipam-node
# Verify IP allocation state (stored as node annotations)
kubectl get node gpu-worker-01 -o yaml | grep -A5 "nv-ipam"IPPool Status and Troubleshooting
# Check pool utilization
kubectl get ippool gpu-ib-fabric -n kube-system -o yaml
# status:
# allocations:
# gpu-worker-01:
# startIP: "10.100.0.16"
# endIP: "10.100.0.23"
# gpu-worker-02:
# startIP: "10.100.0.24"
# endIP: "10.100.0.31"
# If IP not assigned β check nv-ipam-node Pod on that node
kubectl logs -n kube-system $(kubectl get pods -n kube-system \
-l app=nv-ipam-node --field-selector spec.nodeName=gpu-worker-01 \
-o name)
# Common log messages:
# "allocated IP 10.100.0.16 for pod ai-training/nccl-training" β success
# "no free IPs in pool" β perNodeBlockSize exhausted
# "node not matching selector" β missing labelStatic IP Assignment (Predictable Rank Mapping)
# For frameworks needing deterministic IPs per rank:
apiVersion: nv-ipam.nvidia.com/v1alpha1
kind: IPPool
metadata:
name: gpu-fabric-static
namespace: kube-system
spec:
subnet: "10.100.0.0/24"
perNodeBlockSize: 8
gateway: ""
nodeSelector:
matchLabels:
nvidia.com/gpu.present: "true"
# Static allocations override dynamic:
staticAllocations:
- nodeName: "gpu-worker-01"
prefix: "10.100.0.0/28" # .1-.15 for node 1
- nodeName: "gpu-worker-02"
prefix: "10.100.0.16/28" # .16-.31 for node 2
- nodeName: "gpu-worker-03"
prefix: "10.100.0.32/28" # .32-.47 for node 3nv-ipam vs Other IPAM Plugins
Plugin Deterministic Per-Node Blocks GPU-Aware CRD-Based
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
nv-ipam β
Yes β
Yes β
Yes β
IPPool/CIDRPool
whereabouts β No β No β No β Config-based
host-local β No β No β No β Config-based
Calico IPAM β
Partial β
IPPool per node β No β
IPPool
Cilium IPAM β
Partial β
Per-node CIDR β No β
CiliumNode
nv-ipam advantages for GPU clusters:
β’ Per-node block allocation (predictable, no conflicts)
β’ Multiple pools (GPU fabric vs storage fabric)
β’ Node selector (only allocate to GPU nodes)
β’ Lightweight (no etcd dependency like whereabouts)
β’ Works with SR-IOV + Multus seamlesslyCommon Issues
βno free IPs in poolβ for new Pods
- Cause:
perNodeBlockSizetoo small; all IPs in the nodeβs block used - Fix: Increase
perNodeBlockSizeor delete unused Pods/allocations
IPs not released after Pod deletion
- Cause: nv-ipam GC not running or Multus CNI not calling DEL
- Fix: Restart nv-ipam-node DaemonSet; check CNI DEL in Multus logs
Node gets no IP block (unallocated)
- Cause: Node doesnβt match
nodeSelectoron IPPool - Fix: Add required label (
nvidia.com/gpu.present: "true")
IP conflict between two Pods
- Cause: Multiple IPPools with overlapping subnets
- Fix: Use non-overlapping ranges; one pool per fabric
nv-ipam not found as CNI plugin
- Cause: Binary not installed on node at
/opt/cni/bin/nv-ipam - Fix: Verify nv-ipam DaemonSet is running; check init container copied binary
Best Practices
- One IPPool per fabric β separate GPU, storage, management pools
- Size perNodeBlockSize to max VFs β 8 or 16 typically
- Use nodeSelector β only allocate GPU fabric IPs to GPU nodes
- No gateway for L2 IB fabric β InfiniBand is flat L2, no routing needed
- CIDRPool for 50+ nodes β automatic /24 per node from a /16
- Monitor allocations β alert when pool utilization > 80%
- Label nodes before creating pool β nv-ipam allocates blocks on first match
Key Takeaways
- nv-ipam assigns IPs to GPU fabric SR-IOV interfaces via IPPool/CIDRPool CRDs
perNodeBlockSizegives each node a deterministic IP range (no conflicts)- Separate pools for GPU fabric (IB) and storage fabric (Ethernet)
- Integrates with SR-IOV Network Operator via
"type": "nv-ipam"in IPAM config - Lightweight β no etcd/database; state stored as node annotations
- Designed for large GPU clusters (100+ nodes) with predictable addressing
- Static allocations available for frameworks needing fixed rank-to-IP mapping
- Works with both InfiniBand (IPoIB) and Ethernet (RoCE) interfaces

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
