NVIDIA GPU Feature Discovery for Kubernetes

💡 Quick Answer: GPU Feature Discovery (GFD) is an NVIDIA DaemonSet that reads GPU hardware info via NVML and auto-labels Kubernetes nodes with GPU model, count, MIG capability, CUDA version, and driver info. Essential for GPU-aware scheduling and MIG management.

The Problem

Without GFD, Kubernetes nodes with GPUs only show basic info:

nvidia.com/gpu.present=true

That’s it. You can’t:

Schedule workloads to specific GPU models (A100 vs H100 vs T4)
Know if a node supports MIG partitioning
Check CUDA/driver versions for compatibility
Differentiate GPU memory sizes across nodes

The Solution

GFD queries NVML (NVIDIA Management Library) and populates node labels automatically.

Labels Generated by GFD

nvidia.com/gpu.present=true
nvidia.com/gpu.count=1
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.family=ampere
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=true
nvidia.com/mig.strategy=mixed
nvidia.com/cuda.driver.major=570
nvidia.com/cuda.driver.minor=133
nvidia.com/cuda.driver.rev=20
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=9

Deploy GFD via GPU Operator (Recommended)

helm install gpu-operator nvidia/gpu-operator \
  --namespace nvidia-gpu-operator \
  --create-namespace \
  --set gfd.enabled=true \
  --set devicePlugin.enabled=true

Deploy GFD Standalone (Without GPU Operator)

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-gpu-feature-discovery
  namespace: kube-system
  labels:
    app: gpu-feature-discovery
spec:
  selector:
    matchLabels:
      app: gpu-feature-discovery
  template:
    metadata:
      labels:
        app: gpu-feature-discovery
    spec:
      nodeSelector:
        feature.node.kubernetes.io/pci-10de.present: "true"
      containers:
        - name: gpu-feature-discovery
          image: nvcr.io/nvidia/gpu-feature-discovery:v0.16.0
          env:
            - name: GFD_SLEEP_INTERVAL
              value: "60"
            - name: GFD_MIG_STRATEGY
              value: "mixed"
          volumeMounts:
            - name: output-dir
              mountPath: /etc/kubernetes/node-feature-discovery/features.d
            - name: host-sys
              mountPath: /sys
      volumes:
        - name: output-dir
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/features.d
        - name: host-sys
          hostPath:
            path: /sys

Use GFD Labels for Scheduling

# Schedule only on A100 nodes
apiVersion: v1
kind: Pod
metadata:
  name: training-job
spec:
  nodeSelector:
    nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
  containers:
    - name: trainer
      image: registry.example.com/training:v2.0
      resources:
        limits:
          nvidia.com/gpu: 1
---
# Schedule on any MIG-capable GPU
apiVersion: v1
kind: Pod
metadata:
  name: inference
spec:
  nodeSelector:
    nvidia.com/mig.capable: "true"
  containers:
    - name: model
      image: nvcr.io/nim/nim-llm:2.0.2
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1
---
# Require minimum CUDA driver version
apiVersion: v1
kind: Pod
metadata:
  name: cuda-app
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: nvidia.com/cuda.driver.major
                operator: In
                values: ["570", "575", "580"]
  containers:
    - name: app
      image: registry.example.com/cuda-app:v1.0
      resources:
        limits:
          nvidia.com/gpu: 1

Verify GFD is Running

# Check DaemonSet
kubectl get ds -A | grep -E 'gfd|gpu-feature'

# Check pods
kubectl get pods -A | grep -i gpu-feature-discovery

# View all NVIDIA labels on a node
kubectl get node worker-gpu-gwc-0 --show-labels | tr ',' '\n' | grep nvidia.com

# Check GFD logs
kubectl logs -n nvidia-gpu-operator -l app=gpu-feature-discovery --tail=50

Troubleshoot Missing Labels

# Only nvidia.com/gpu.present=true? Check if GFD is running:
kubectl get pods -A | grep gpu-feature
# Empty → GFD not deployed

# GFD running but no labels? Check logs:
kubectl logs -n nvidia-gpu-operator -l app=gpu-feature-discovery

# Common errors:
# "Failed to initialize NVML" → driver not loaded
# "No GPU devices found" → driver loaded but no GPU visible to container
# Check driver:
kubectl debug node/worker-gpu-gwc-0 -it --image=nvidia/cuda:12.9.0-base-ubuntu24.04 -- nvidia-smi

Common Issues

GFD pod in CrashLoopBackOff

Cause: NVIDIA driver not loaded or NVML library not accessible
Fix: Verify driver is installed; on Talos check extensions are loaded; ensure /dev/nvidia* devices exist

Labels not updating after GPU change

Cause: GFD only polls at GFD_SLEEP_INTERVAL (default 60s)
Fix: Wait 60s or restart the GFD pod on that node

MIG labels missing despite MIG-capable GPU

Cause: GFD_MIG_STRATEGY not set or GPU Operator mig.strategy not configured
Fix: Set GFD_MIG_STRATEGY=mixed or single in GFD config

Best Practices

Deploy via GPU Operator — manages GFD, device plugin, and mig-manager together
Use mixed MIG strategy — provides most flexibility for heterogeneous workloads
Build scheduling rules on GFD labels — gpu.product, mig.capable, cuda.driver.major
Don’t hardcode node names — use label selectors for GPU-aware scheduling
Monitor GFD health — if GFD stops, new nodes won’t get labeled

Key Takeaways

GFD auto-labels nodes with GPU model, count, memory, MIG capability, CUDA/driver versions
Without GFD, you only see nvidia.com/gpu.present=true — not enough for smart scheduling
Deployed as DaemonSet via GPU Operator or standalone
Labels enable GPU-model-specific scheduling, MIG management, and driver compatibility checks
On Talos: GPU Operator with driver.enabled=false still deploys GFD correctly

The Problem

The Solution

Labels Generated by GFD

Deploy GFD via GPU Operator (Recommended)

Deploy GFD Standalone (Without GPU Operator)

Use GFD Labels for Scheduling

Verify GFD is Running

Troubleshoot Missing Labels

Common Issues

GFD pod in CrashLoopBackOff

Labels not updating after GPU change

MIG labels missing despite MIG-capable GPU

Best Practices

Key Takeaways

Want More Kubernetes Recipes?