NVIDIA GPU Feature Discovery for Kubernetes
Deploy GPU Feature Discovery (GFD) to auto-label Kubernetes nodes with GPU model, MIG capability, CUDA version, and driver info for intelligent scheduling.
π‘ Quick Answer: GPU Feature Discovery (GFD) is an NVIDIA DaemonSet that reads GPU hardware info via NVML and auto-labels Kubernetes nodes with GPU model, count, MIG capability, CUDA version, and driver info. Essential for GPU-aware scheduling and MIG management.
The Problem
Without GFD, Kubernetes nodes with GPUs only show basic info:
nvidia.com/gpu.present=trueThatβs it. You canβt:
- Schedule workloads to specific GPU models (A100 vs H100 vs T4)
- Know if a node supports MIG partitioning
- Check CUDA/driver versions for compatibility
- Differentiate GPU memory sizes across nodes
The Solution
GFD queries NVML (NVIDIA Management Library) and populates node labels automatically.
Labels Generated by GFD
nvidia.com/gpu.present=true
nvidia.com/gpu.count=1
nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.memory=81920
nvidia.com/gpu.family=ampere
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=0
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=true
nvidia.com/mig.strategy=mixed
nvidia.com/cuda.driver.major=570
nvidia.com/cuda.driver.minor=133
nvidia.com/cuda.driver.rev=20
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=9Deploy GFD via GPU Operator (Recommended)
helm install gpu-operator nvidia/gpu-operator \
--namespace nvidia-gpu-operator \
--create-namespace \
--set gfd.enabled=true \
--set devicePlugin.enabled=trueDeploy GFD Standalone (Without GPU Operator)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-gpu-feature-discovery
namespace: kube-system
labels:
app: gpu-feature-discovery
spec:
selector:
matchLabels:
app: gpu-feature-discovery
template:
metadata:
labels:
app: gpu-feature-discovery
spec:
nodeSelector:
feature.node.kubernetes.io/pci-10de.present: "true"
containers:
- name: gpu-feature-discovery
image: nvcr.io/nvidia/gpu-feature-discovery:v0.16.0
env:
- name: GFD_SLEEP_INTERVAL
value: "60"
- name: GFD_MIG_STRATEGY
value: "mixed"
volumeMounts:
- name: output-dir
mountPath: /etc/kubernetes/node-feature-discovery/features.d
- name: host-sys
mountPath: /sys
volumes:
- name: output-dir
hostPath:
path: /etc/kubernetes/node-feature-discovery/features.d
- name: host-sys
hostPath:
path: /sysUse GFD Labels for Scheduling
# Schedule only on A100 nodes
apiVersion: v1
kind: Pod
metadata:
name: training-job
spec:
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"
containers:
- name: trainer
image: registry.example.com/training:v2.0
resources:
limits:
nvidia.com/gpu: 1
---
# Schedule on any MIG-capable GPU
apiVersion: v1
kind: Pod
metadata:
name: inference
spec:
nodeSelector:
nvidia.com/mig.capable: "true"
containers:
- name: model
image: nvcr.io/nim/nim-llm:2.0.2
resources:
limits:
nvidia.com/mig-1g.10gb: 1
---
# Require minimum CUDA driver version
apiVersion: v1
kind: Pod
metadata:
name: cuda-app
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/cuda.driver.major
operator: In
values: ["570", "575", "580"]
containers:
- name: app
image: registry.example.com/cuda-app:v1.0
resources:
limits:
nvidia.com/gpu: 1Verify GFD is Running
# Check DaemonSet
kubectl get ds -A | grep -E 'gfd|gpu-feature'
# Check pods
kubectl get pods -A | grep -i gpu-feature-discovery
# View all NVIDIA labels on a node
kubectl get node worker-gpu-gwc-0 --show-labels | tr ',' '\n' | grep nvidia.com
# Check GFD logs
kubectl logs -n nvidia-gpu-operator -l app=gpu-feature-discovery --tail=50Troubleshoot Missing Labels
# Only nvidia.com/gpu.present=true? Check if GFD is running:
kubectl get pods -A | grep gpu-feature
# Empty β GFD not deployed
# GFD running but no labels? Check logs:
kubectl logs -n nvidia-gpu-operator -l app=gpu-feature-discovery
# Common errors:
# "Failed to initialize NVML" β driver not loaded
# "No GPU devices found" β driver loaded but no GPU visible to container
# Check driver:
kubectl debug node/worker-gpu-gwc-0 -it --image=nvidia/cuda:12.9.0-base-ubuntu24.04 -- nvidia-smiCommon Issues
GFD pod in CrashLoopBackOff
- Cause: NVIDIA driver not loaded or NVML library not accessible
- Fix: Verify driver is installed; on Talos check extensions are loaded; ensure
/dev/nvidia*devices exist
Labels not updating after GPU change
- Cause: GFD only polls at
GFD_SLEEP_INTERVAL(default 60s) - Fix: Wait 60s or restart the GFD pod on that node
MIG labels missing despite MIG-capable GPU
- Cause:
GFD_MIG_STRATEGYnot set or GPU Operatormig.strategynot configured - Fix: Set
GFD_MIG_STRATEGY=mixedorsinglein GFD config
Best Practices
- Deploy via GPU Operator β manages GFD, device plugin, and mig-manager together
- Use
mixedMIG strategy β provides most flexibility for heterogeneous workloads - Build scheduling rules on GFD labels β
gpu.product,mig.capable,cuda.driver.major - Donβt hardcode node names β use label selectors for GPU-aware scheduling
- Monitor GFD health β if GFD stops, new nodes wonβt get labeled
Key Takeaways
- GFD auto-labels nodes with GPU model, count, memory, MIG capability, CUDA/driver versions
- Without GFD, you only see
nvidia.com/gpu.present=trueβ not enough for smart scheduling - Deployed as DaemonSet via GPU Operator or standalone
- Labels enable GPU-model-specific scheduling, MIG management, and driver compatibility checks
- On Talos: GPU Operator with
driver.enabled=falsestill deploys GFD correctly

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
