NVIDIA GPUDirect Storage Benchmark on K8s
Benchmark NVIDIA GPUDirect Storage (GDS) on Kubernetes for direct NVMe-to-GPU data transfers. Covers gdsio, gds_stats, performance validation, and comparison
π‘ Quick Answer: GPUDirect Storage (GDS) bypasses the CPU and system memory for storage I/O, transferring data directly from NVMe/NFS to GPU memory. Use
gdsioto benchmark raw GDS throughput and compare against traditionalcuFilepaths to validate your storage stack delivers full bandwidth.
The Problem
Traditional storage I/O for GPU workloads:
NVMe β PCIe β CPU RAM β PCIe β GPU VRAM (2 copies, CPU involved)GPUDirect Storage:
NVMe β PCIe β GPU VRAM (1 copy, zero CPU involvement)Without GDS, the CPU becomes a bottleneck for large dataset loading, and PCIe bandwidth is wasted on double copies.
The Solution
Verify GDS is Available
# Check GDS driver in GPU Operator
kubectl get pods -n gpu-operator -l app=nvidia-driver-daemonset
kubectl exec -n gpu-operator <driver-pod> -- nvidia-smi -q | grep "GPUDirect"
# Check GDS support on node
kubectl debug node/<node-name> -it --image=nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 -- \
sh -c '/usr/local/cuda/gds/tools/gds_stats'Run GDS Benchmark with gdsio
apiVersion: batch/v1
kind: Job
metadata:
name: gds-benchmark
namespace: ai-bench
spec:
template:
spec:
containers:
- name: gdsio
image: nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
command:
- /bin/bash
- -c
- |
echo "=== GDS Benchmark ==="
nvidia-smi
# Verify GDS is active
/usr/local/cuda/gds/tools/gds_stats
# Sequential read benchmark (GDS enabled)
/usr/local/cuda/gds/tools/gdsio \
-f /mnt/nvme/testfile \
-d 0 \
-s 10G \
-i 1M \
-x 0 \
-I 1 \
-T 120
echo "--- GDS disabled (bounce buffer) ---"
# Same test with GDS disabled (for comparison)
/usr/local/cuda/gds/tools/gdsio \
-f /mnt/nvme/testfile \
-d 0 \
-s 10G \
-i 1M \
-x 1 \
-I 1 \
-T 120
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: nvme-storage
mountPath: /mnt/nvme
- name: shm
mountPath: /dev/shm
securityContext:
privileged: true # Required for GDS direct access
volumes:
- name: nvme-storage
hostPath:
path: /mnt/local-nvme
- name: shm
emptyDir:
medium: Memory
restartPolicy: Never
nodeSelector:
nvidia.com/gds.present: "true"gdsio Parameters Explained
/usr/local/cuda/gds/tools/gdsio \
-f /mnt/nvme/testfile \ # File to benchmark
-d 0 \ # GPU device index
-s 10G \ # File size
-i 1M \ # I/O block size
-x 0 \ # 0=GDS enabled, 1=GDS disabled (bounce buffer)
-I 1 \ # 1=read, 2=write, 3=randread, 4=randwrite
-T 120 \ # Runtime in seconds
-t 4 # Number of threadsGDS with NFS over RDMA
apiVersion: batch/v1
kind: Job
metadata:
name: gds-nfsrdma-bench
namespace: ai-bench
spec:
template:
spec:
containers:
- name: gdsio
image: nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04
command:
- /bin/bash
- -c
- |
# GDS over NFSoRDMA β direct NFS-to-GPU path
/usr/local/cuda/gds/tools/gdsio \
-f /mnt/nfs-rdma/dataset/testfile \
-d 0 \
-s 10G \
-i 4M \
-x 0 \
-I 1 \
-T 120 \
-t 8
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: nfs-rdma
mountPath: /mnt/nfs-rdma
securityContext:
privileged: true
volumes:
- name: nfs-rdma
nfs:
server: 10.0.1.10
path: /export/gds-enabled
restartPolicy: NeverExpected Performance Comparison
I/O Path Seq Read 1M Seq Write 1M
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Local NVMe (no GDS) 3.2 GB/s 2.8 GB/s
Local NVMe (GDS) 6.8 GB/s 5.5 GB/s
NFS over TCP (no GDS) 1.5 GB/s 1.2 GB/s
NFS over RDMA (no GDS) 3.0 GB/s 2.5 GB/s
NFS over RDMA (GDS) 5.5 GB/s 4.5 GB/s
Improvement with GDS: ~2Γ throughput (CPU bottleneck removed)Monitor GDS Statistics
# Real-time GDS stats during benchmark
watch -n 1 '/usr/local/cuda/gds/tools/gds_stats'
# Key metrics:
# bytes_read_gds β bytes transferred via GDS path
# bytes_read_posix β bytes via fallback (bounce buffer)
# gds_read_bandwidth β current GDS throughputCommon Issues
GDS falls back to bounce buffer silently
- Cause: Filesystem not GDS-compatible or alignment wrong
- Fix: Check
gds_statsforposixcounters increasing; verify ext4/xfs with 4K alignment
Permission denied for GDS
- Cause: Pod not running with sufficient privileges
- Fix:
securityContext.privileged: trueorCAP_SYS_ADMIN+ device access
GDS not available on node
- Cause: GPU Operator not configured for GDS, or kernel module missing
- Fix: Enable GDS in GPU Operator Helm values:
driver.gds.enabled=true
Best Practices
- Always compare GDS vs non-GDS β use
-x 0vs-x 1to quantify improvement - Use large I/O sizes β GDS benefits most with 1M+ block sizes
- Verify with
gds_statsβ confirm data flows through GDS path, not bounce buffer - NFS requires RDMA β GDS over NFS only works with NFSoRDMA (not TCP)
- Node selector β schedule GDS benchmarks only on nodes with GDS support
Key Takeaways
- GDS provides ~2Γ throughput improvement by eliminating CPU bounce buffer
- Use
gdsiofor benchmarking,gds_statsfor verification - Works with local NVMe and NFS over RDMA
- Requires privileged Pods and GPU Operator GDS driver
- Critical for large-scale training where dataset loading is the bottleneck

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
