GPU Feature Discovery Node Labels
Configure NVIDIA GPU Feature Discovery for automatic node labeling on Kubernetes. GPU model, driver version, CUDA, and MIG labels for scheduling.
π‘ Quick Answer: Configure NVIDIA GPU Feature Discovery for automatic node labeling on Kubernetes. GPU model, driver version, CUDA, and MIG labels for scheduling.
The Problem
Configure NVIDIA GPU Feature Discovery for automatic node labeling on Kubernetes. Without proper setup, GPU workloads on Kubernetes suffer from wasted resources, failed scheduling, or degraded inference performance.
The Solution
Prerequisites
# Verify GPU nodes are available
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
# Check NVIDIA driver and CUDA
kubectl exec -it <gpu-pod> -- nvidia-smiConfiguration
# GPU Feature Discovery Node Labels β production configuration
apiVersion: v1
kind: Pod
metadata:
name: gpu-workload
namespace: gpu-inference
spec:
containers:
- name: inference
image: nvcr.io/nvidia/pytorch:24.07-py3
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: NVIDIA_DRIVER_CAPABILITIES
value: "compute,utility"
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoScheduleDeployment
# Apply GPU workload
kubectl apply -f gpu-workload.yaml
# Verify GPU allocation
kubectl describe pod gpu-workload | grep -A3 "Limits"
# Monitor GPU utilization
kubectl exec -it gpu-workload -- nvidia-smi dmon -s pucvmet -d 5Verification
# Check GPU is accessible inside the pod
kubectl exec -it gpu-workload -- python3 -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'GPU count: {torch.cuda.device_count()}')
print(f'GPU name: {torch.cuda.get_device_name(0)}')
print(f'Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB')
"graph TD
A[GPU Node] --> B[NVIDIA Driver]
B --> C[Container Toolkit]
C --> D[Device Plugin]
D --> E[Pod GPU Access]
E --> F{Inference / Training}
F --> G[Monitor with nvidia-smi]
G --> H[Scale with HPA/KEDA]Common Issues
GPU not visible inside pod
Check that the NVIDIA device plugin DaemonSet is running on the node. Verify with kubectl get pods -n gpu-operator -l app=nvidia-device-plugin-daemonset. If missing, the GPU Operator may need reinstalling.
CUDA version mismatch
The container CUDA version must be compatible with the host driver. Use nvidia-smi on the node to check driver version, then select a compatible container image from NVIDIA NGC catalog.
Out of memory on GPU
Reduce batch size, enable gradient checkpointing for training, or use model quantization (AWQ/GPTQ) for inference. Monitor with nvidia-smi to track peak memory usage.
Best Practices
- Always set
resources.limitsfornvidia.com/gpuβ without it, pods wonβt get GPU access - Use node selectors or affinity to target specific GPU types (A100, H100, etc.)
- Monitor GPU utilization with DCGM Exporter + Prometheus β idle GPUs waste expensive resources
- Pin CUDA container versions β donβt use
latesttags in production - Enable GPU health checks with liveness probes that verify CUDA functionality
Key Takeaways
- GPU Feature Discovery Node Labels is critical for production GPU workloads on Kubernetes
- Proper resource configuration prevents scheduling failures and resource waste
- Monitor GPU utilization to right-size allocations and reduce cloud costs
- Use NVIDIA GPU Operator for automated driver and toolkit lifecycle management
- Combine with KEDA or custom metrics HPA for GPU-aware autoscaling

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
