πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

NVIDIA GPU Feature Discovery for Kubernetes

Deploy GPU Feature Discovery (GFD) to auto-label Kubernetes nodes with GPU model, MIG capability, CUDA version, and driver info for intelligent scheduling.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: GPU Feature Discovery (GFD) is an NVIDIA DaemonSet that reads GPU hardware info via NVML and auto-labels Kubernetes nodes with GPU model, count, MIG capability, CUDA version, and driver info. Essential for GPU-aware scheduling and MIG management.

The Problem

Without GFD, Kubernetes nodes with GPUs only show basic info:

nvidia.com/gpu.present=true

That’s it. You can’t:

  • Schedule workloads to specific GPU models (A100 vs H100 vs T4)
  • Know if a node supports MIG partitioning
  • Check CUDA/driver versions for compatibility
  • Differentiate GPU memory sizes across nodes

The Solution

GFD queries NVML (NVIDIA Management Library) and populates node labels automatically.

Labels Generated by GFD

nvidia.com/gpu.present=true
nvidia.com/gpu.count=1
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.family=ampere
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=true
nvidia.com/mig.strategy=mixed
nvidia.com/cuda.driver.major=570
nvidia.com/cuda.driver.minor=133
nvidia.com/cuda.driver.rev=20
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=9
helm install gpu-operator nvidia/gpu-operator \
  --namespace nvidia-gpu-operator \
  --create-namespace \
  --set gfd.enabled=true \
  --set devicePlugin.enabled=true

Deploy GFD Standalone (Without GPU Operator)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-feature-discovery
  namespace: kube-system
  labels:
    app: gpu-feature-discovery
spec:
  selector:
    matchLabels:
      app: gpu-feature-discovery
  template:
    metadata:
      labels:
        app: gpu-feature-discovery
    spec:
      nodeSelector:
        feature.node.kubernetes.io/pci-10de.present: "true"
      containers:
        - name: gpu-feature-discovery
          image: nvcr.io/nvidia/gpu-feature-discovery:v0.16.0
          env:
            - name: GFD_SLEEP_INTERVAL
              value: "60"
            - name: GFD_MIG_STRATEGY
              value: "mixed"
          volumeMounts:
            - name: output-dir
              mountPath: /etc/kubernetes/node-feature-discovery/features.d
            - name: host-sys
              mountPath: /sys
      volumes:
        - name: output-dir
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/features.d
        - name: host-sys
          hostPath:
            path: /sys

Use GFD Labels for Scheduling

# Schedule only on A100 nodes
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
  containers:
    - name: trainer
      image: registry.example.com/training:v2.0
      resources:
        limits:
          nvidia.com/gpu: 1
---
# Schedule on any MIG-capable GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference
spec:
  nodeSelector:
    nvidia.com/mig.capable: "true"
  containers:
    - name: model
      image: nvcr.io/nim/nim-llm:2.0.2
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1
---
# Require minimum CUDA driver version
apiVersion: v1
kind: Pod
metadata:
  name: cuda-app
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/cuda.driver.major
                operator: In
                values: ["570", "575", "580"]
  containers:
    - name: app
      image: registry.example.com/cuda-app:v1.0
      resources:
        limits:
          nvidia.com/gpu: 1

Verify GFD is Running

# Check DaemonSet
kubectl get ds -A | grep -E 'gfd|gpu-feature'

# Check pods
kubectl get pods -A | grep -i gpu-feature-discovery

# View all NVIDIA labels on a node
kubectl get node worker-gpu-gwc-0 --show-labels | tr ',' '\n' | grep nvidia.com

# Check GFD logs
kubectl logs -n nvidia-gpu-operator -l app=gpu-feature-discovery --tail=50

Troubleshoot Missing Labels

# Only nvidia.com/gpu.present=true? Check if GFD is running:
kubectl get pods -A | grep gpu-feature
# Empty β†’ GFD not deployed

# GFD running but no labels? Check logs:
kubectl logs -n nvidia-gpu-operator -l app=gpu-feature-discovery

# Common errors:
# "Failed to initialize NVML" β†’ driver not loaded
# "No GPU devices found" β†’ driver loaded but no GPU visible to container
# Check driver:
kubectl debug node/worker-gpu-gwc-0 -it --image=nvidia/cuda:12.9.0-base-ubuntu24.04 -- nvidia-smi

Common Issues

GFD pod in CrashLoopBackOff

  • Cause: NVIDIA driver not loaded or NVML library not accessible
  • Fix: Verify driver is installed; on Talos check extensions are loaded; ensure /dev/nvidia* devices exist

Labels not updating after GPU change

  • Cause: GFD only polls at GFD_SLEEP_INTERVAL (default 60s)
  • Fix: Wait 60s or restart the GFD pod on that node

MIG labels missing despite MIG-capable GPU

  • Cause: GFD_MIG_STRATEGY not set or GPU Operator mig.strategy not configured
  • Fix: Set GFD_MIG_STRATEGY=mixed or single in GFD config

Best Practices

  1. Deploy via GPU Operator β€” manages GFD, device plugin, and mig-manager together
  2. Use mixed MIG strategy β€” provides most flexibility for heterogeneous workloads
  3. Build scheduling rules on GFD labels β€” gpu.product, mig.capable, cuda.driver.major
  4. Don’t hardcode node names β€” use label selectors for GPU-aware scheduling
  5. Monitor GFD health β€” if GFD stops, new nodes won’t get labeled

Key Takeaways

  • GFD auto-labels nodes with GPU model, count, memory, MIG capability, CUDA/driver versions
  • Without GFD, you only see nvidia.com/gpu.present=true β€” not enough for smart scheduling
  • Deployed as DaemonSet via GPU Operator or standalone
  • Labels enable GPU-model-specific scheduling, MIG management, and driver compatibility checks
  • On Talos: GPU Operator with driver.enabled=false still deploys GFD correctly
#nvidia #gpu #scheduling #node-labels #gpu-operator
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens