πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Storage advanced ⏱ 25 minutes K8s 1.27+

GPU Operator GPUDirect Storage GDS Module

Enable the GPUDirect Storage GDS module in the NVIDIA GPU Operator ClusterPolicy for direct GPU-to-storage data transfers bypassing CPU and system memory.

By Luca Berton β€’ β€’ Updated February 26, 2026 β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Enable GDS in the ClusterPolicy with gds.enabled: true and configure the nvidia-fs kernel module to allow GPUs to read/write directly from NVMe or RDMA-capable storage, bypassing CPU bounce buffers.

The Problem

AI training and inference workloads process massive datasets β€” loading terabytes of training data through the traditional storage path creates a bottleneck:

Traditional: Storage β†’ PCIe β†’ CPU β†’ System Memory β†’ PCIe β†’ GPU Memory
GPUDirect:   Storage β†’ PCIe β†’ GPU Memory (direct!)

Without GPUDirect Storage (GDS), every byte of data passes through the CPU and system memory, adding latency and consuming CPU resources. GDS eliminates this bottleneck with direct DMA transfers between storage and GPU memory.

The Solution

Step 1: Verify GDS Prerequisites

# Check GPU compatibility (A100, H100, H200, L40S)
nvidia-smi --query-gpu=name,pci.bus_id --format=csv

# Verify MOFED is running (required for RDMA-based storage)
kubectl get pods -n gpu-operator -l app=mofed-ubuntu

# Check NVMe devices (for local NVMe GDS)
lsblk | grep nvme

Step 2: Enable GDS in ClusterPolicy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: cluster-policy
spec:
  driver:
    enabled: true
    version: "550.127.08"
    rdma:
      enabled: true
  gds:
    enabled: true
    image: nvidia-fs
    repository: nvcr.io/nvidia/cloud-native
    version: "2.20.5"
    imagePullPolicy: IfNotPresent
    args: []
    env:
      - name: GDS_LOG_LEVEL
        value: "3"
  mofed:
    enabled: true
    image: mofed
    repository: nvcr.io/nvstaging/mellanox
    version: "24.07-0.6.1.0"
  toolkit:
    enabled: true
  devicePlugin:
    enabled: true
kubectl apply -f cluster-policy.yaml

Step 3: Verify GDS Module is Loaded

# Check nvidia-fs driver pods
kubectl get pods -n gpu-operator -l app=nvidia-fs-ctr

# Verify the nvidia-fs kernel module is loaded on nodes
kubectl exec -n gpu-operator -it $(kubectl get pod -n gpu-operator \
  -l app=nvidia-fs-ctr -o jsonpath='{.items[0].metadata.name}') \
  -- lsmod | grep nvidia_fs
# Expected: nvidia_fs  <size>  0

# Check GDS configuration
kubectl exec -n gpu-operator -it $(kubectl get pod -n gpu-operator \
  -l app=nvidia-fs-ctr -o jsonpath='{.items[0].metadata.name}') \
  -- cat /proc/driver/nvidia-fs/stats

Step 4: Test GDS Performance

Use the gdsio benchmark tool:

apiVersion: v1
kind: Pod
metadata:
  name: gds-benchmark
spec:
  nodeSelector:
    nvidia.com/gpu.present: "true"
  containers:
    - name: benchmark
      image: nvcr.io/nvidia/pytorch:24.07-py3
      command: ["sleep", "infinity"]
      resources:
        limits:
          nvidia.com/gpu: "1"
      securityContext:
        privileged: true
      volumeMounts:
        - name: nvme-data
          mountPath: /data
  volumes:
    - name: nvme-data
      hostPath:
        path: /mnt/nvme0
        type: Directory
kubectl apply -f gds-benchmark.yaml

# Run GDS benchmark (read test)
kubectl exec gds-benchmark -- gdsio -f /data/testfile \
  -d 0 -w 4 -s 1G -i 1M -x 0 -I 1

# Run without GDS for comparison (CPU path)
kubectl exec gds-benchmark -- gdsio -f /data/testfile \
  -d 0 -w 4 -s 1G -i 1M -x 0 -I 0

# Expected: GDS path shows 2-5x higher throughput

Step 5: Configure GDS with cuFile API

Applications must use the cuFile API to leverage GDS. Example in Python with kvikio:

import kvikio
import cupy as cp
import numpy as np

# Open file with GDS
f = kvikio.CuFile("/data/model-weights.bin", "r")

# Read directly into GPU memory β€” bypasses CPU!
gpu_buffer = cp.empty(1024 * 1024 * 1024, dtype=cp.uint8)  # 1GB
bytes_read = f.read(gpu_buffer)
f.close()

print(f"Read {bytes_read / 1e9:.2f} GB directly to GPU memory via GDS")

GDS Architecture

flowchart TD
    subgraph Without GDS
        A1[NVMe Storage] -->|PCIe| B1[CPU]
        B1 -->|System RAM| C1[Bounce Buffer]
        C1 -->|PCIe| D1[GPU Memory]
    end

    subgraph With GDS
        A2[NVMe Storage] -->|"PCIe Direct DMA"| D2[GPU Memory]
    end

    style A2 fill:#76b900
    style D2 fill:#76b900

GDS Compatibility Matrix

Storage TypeGDS SupportNotes
Local NVMeβœ… FullBest performance, direct PCIe path
NFS over RDMAβœ… FullRequires MOFED + NFS RDMA server
Lustreβœ… FullParallel filesystem, common in HPC
GPFS/Spectrum Scaleβœ… FullIBM parallel filesystem
WekaFSβœ… FullHigh-performance distributed FS
Ceph RBD⚠️ PartialRequires RDMA-capable Ceph OSD
Standard NFS❌ NoTCP-based NFS doesn’t support GDS
iSCSI❌ NoNot supported

Common Issues

nvidia-fs Module Fails to Load

# Check pod logs
kubectl logs -n gpu-operator -l app=nvidia-fs-ctr --tail=50

# Common cause: kernel module version mismatch
# Fix: update GDS version to match GPU driver

GDS Not Used Despite Being Enabled

Applications must explicitly use the cuFile API. Standard read()/write() syscalls bypass GDS:

# Verify GDS is being used by checking stats
kubectl exec gds-benchmark -- cat /proc/driver/nvidia-fs/stats
# "nr_reads" and "nr_writes" should increase during GDS operations

Permission Issues

GDS requires elevated privileges for direct DMA:

securityContext:
  privileged: true
  # Or more targeted:
  capabilities:
    add:
      - SYS_ADMIN
      - IPC_LOCK

Best Practices

  • Use NVMe for highest throughput β€” local NVMe provides the shortest PCIe path
  • Pin GDS version to GPU driver version β€” mismatches cause module load failures
  • Monitor /proc/driver/nvidia-fs/stats β€” verify GDS is actually being used
  • Use kvikio for Python workloads β€” provides a Pythonic cuFile API wrapper
  • Enable MOFED first β€” GDS over network storage requires RDMA drivers
  • Benchmark before and after β€” use gdsio to quantify the improvement
  • Pre-stage data on NVMe β€” GDS works best with data already on fast storage

Key Takeaways

  • GDS eliminates the CPU bounce buffer for storage I/O, enabling direct GPU-to-storage DMA
  • Enable via gds.enabled: true in the GPU Operator ClusterPolicy
  • Applications must use the cuFile API (or kvikio for Python) β€” standard I/O calls don’t use GDS
  • GDS provides 2-5x throughput improvement for data loading in AI training pipelines
  • Works with NVMe, NFS over RDMA, Lustre, GPFS, and WekaFS
#nvidia #gpu-operator #gds #gpudirect-storage #storage #performance
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens