Validate GPUDirect RDMA Performance with DMA-BUF
Run ib_write_bw with CUDA DMA-BUF to verify GPUDirect RDMA data transfer rates between GPU pods and validate network operator configuration.
π‘ Quick Answer: Deploy two pods with
mellanox/cuda-perftest, runib_write_bw --use_cuda=0 --use_cuda_dmabufbetween them, and verify throughput reaches near line-rate (80β95+ Gbps for 100G NICs).
After configuring GPUDirect RDMA, validate that GPU-to-GPU transfers over the network achieve expected throughput using the ib_write_bw benchmark with DMA-BUF.
Step 1 β Get Network Interface Name
kubectl exec -it -n network-operator mofed-ubuntu22.04-ds-xxxxx -- ibdev2netdevExample output:
mlx5_0 port 1 ==> ens64np1 (Up)Step 2 β Deploy Test Pods
Create pods requesting GPU and RDMA resources on two different nodes:
apiVersion: v1
kind: Pod
metadata:
name: rdma-gpu-pod-1
annotations:
k8s.v1.cni.cncf.io/networks: rdma-test-network
spec:
nodeSelector:
kubernetes.io/hostname: gpu-node-1
restartPolicy: OnFailure
containers:
- image: mellanox/cuda-perftest
name: rdma-gpu-test
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_a: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_a: 1Create a matching pod for the second node (rdma-gpu-pod-2 on gpu-node-2).
kubectl apply -f rdma-gpu-pod-1.yaml -f rdma-gpu-pod-2.yaml
kubectl get pods -o wideStep 3 β Run the Benchmark
Start the server on pod 1:
kubectl exec -it rdma-gpu-pod-1 -- ib_write_bw --use_cuda=0 --use_cuda_dmabuf \
-d mlx5_0 -a -F --report_gbits -q 1Run the client on pod 2 (replace IP with pod 1 address):
kubectl exec -it rdma-gpu-pod-2 -- ib_write_bw -n 5000 --use_cuda=0 --use_cuda_dmabuf \
-d mlx5_0 -a -F --report_gbits -q 1 <pod-1-ip>Step 4 β Interpret Results
Expected output (100G NIC):
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 5000 92.39 92.38 0.176196
131072 5000 92.42 92.41 0.088131
1048576 5000 92.40 92.40 0.011015
8388608 5000 92.39 92.39 0.001377Performance targets:
- 100G NIC β expect 90β95 Gbps for large messages
- 200G NIC β expect 180β195 Gbps for large messages
- Small messages β lower throughput is normal due to latency overhead
Step 5 β Validate DMA-BUF Path
The --use_cuda_dmabuf flag confirms the DMA-BUF path. If it falls back to legacy nvidia-peermem, you will see errors or warnings in the output.
Also verify with NCCL:
NCCL_DEBUG=INFO NCCL_NET_GDR_LEVEL=5 all_reduce_testLook for GPUDirect RDMA DMA-BUF enabled and no peer memory fallback.
Cleanup
kubectl delete pod rdma-gpu-pod-1 rdma-gpu-pod-2Why This Matters
Benchmarking confirms that GPUDirect RDMA is functioning at the hardware level. Without validation, misconfigurations can silently degrade multi-node training throughput by falling back to CPU-staged transfers.

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
