Talos Linux MIG Configuration with GPU Operator
Configure NVIDIA MIG on Talos Linux Kubernetes clusters. Install GPU Operator, set MIG strategy, and dynamically partition A100 GPUs without node reboot.
π‘ Quick Answer: On Talos Linux with NVIDIA GPU extensions, MIG reconfiguration requires the GPU Operator with mig-manager. Without it, labeling
nvidia.com/mig.configdoes nothing. Install the GPU Operator withmig.strategy: mixed, then usekubectl labelto switch MIG profiles dynamically.
The Problem
Talos Linux provides NVIDIA drivers via system extensions, and nodes show nvidia.com/gpu.present=true. But basic driver presence doesnβt mean MIG management is available:
- No
nvidia.com/mig.capablelabel β GPU Feature Discovery (GFD) not running - No
nvidia.com/mig.config.stateβ mig-manager not deployed - No
nvidia.com/gpu.productorgpu.countβ device plugin not advertising GPUs - Setting
nvidia.com/mig.configlabel has no effect without mig-manager watching it
The Solution
Architecture
On Talos, the stack splits cleanly between OS-level and Kubernetes-level:
Talos Extensions (immutable, OS-level)
ββ NVIDIA driver/kernel modules + nvidia-container-toolkit
NVIDIA GPU Operator (Kubernetes-level)
ββ driver: disabled (Talos provides it)
ββ toolkit: disabled (Talos provides it)
ββ device-plugin: enabled
ββ gpu-feature-discovery: enabled
ββ mig-manager: enabled
ββ dcgm-exporter: enabled
ββ validator: optionalTalos manages the immutable driver, GPU Operator manages the Kubernetes GPU runtime, and MIG is declarative via labels β no shelling into nodes.
Step 1: Verify Current NVIDIA State
# Check what NVIDIA labels exist
kubectl get node worker-gpu-gwc-0 --show-labels | tr ',' '\n' | grep nvidia
# If you only see nvidia.com/gpu.present=true, GPU Operator is missing
# Check for GPU Operator components
kubectl get pods -A | grep -E 'gpu-operator|mig|nvidia'
# Empty = GPU Operator not installed
# Verify GPU model (critical for MIG profile selection)
kubectl debug node/worker-gpu-gwc-0 -it --image=nvidia/cuda:12.9.0-base-ubuntu24.04 -- nvidia-smi -L
# GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-...)Step 2: Install NVIDIA GPU Operator with MIG Support
# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator with MIG strategy
helm upgrade --install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set mig.strategy=mixed \
--set migManager.enabled=true \
--set gfd.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=trueKey flags for Talos:
driver.enabled=falseβ Talos provides the driver via extensionstoolkit.enabled=falseβ Talos bundles the container toolkit in extensionsmig.strategy=mixedβ allows different MIG profiles per GPU (orsinglefor uniform)
Step 3: Verify GPU Operator Deployment
# Check all GPU Operator pods are running
kubectl get pods -n gpu-operator
# NAME READY STATUS
# gpu-operator-... 1/1 Running
# nvidia-device-plugin-daemonset-... 1/1 Running
# nvidia-gpu-feature-discovery-... 1/1 Running
# nvidia-mig-manager-... 1/1 Running
# Verify NVIDIA labels are now populated
kubectl get node worker-gpu-gwc-0 --show-labels | tr ',' '\n' | grep nvidia
# nvidia.com/gpu.present=true
# nvidia.com/gpu.count=1
# nvidia.com/gpu.product=NVIDIA-A100-80GB-PCIe
# nvidia.com/mig.capable=true
# nvidia.com/cuda.driver.major=570Step 4: Configure MIG Layout
# Cordon and drain
kubectl cordon worker-gpu-gwc-0
kubectl drain worker-gpu-gwc-0 --ignore-daemonsets --delete-emptydir-data
# Apply MIG configuration (A100 80GB profiles)
kubectl label node worker-gpu-gwc-0 nvidia.com/mig.config=all-1g.10gb --overwrite
# Watch mig-manager apply the configuration
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager -f
# Verify success
kubectl get node worker-gpu-gwc-0 \
-o jsonpath='{.metadata.labels.nvidia\.com/mig\.config}{"\n"}{.metadata.labels.nvidia\.com/mig\.config\.state}{"\n"}'
# all-1g.10gb
# success
# Uncordon
kubectl uncordon worker-gpu-gwc-0MIG Profiles: A100 80GB vs 40GB
A100 80GB (e.g., Azure Standard_NC24ads_A100_v4 if 80GB SKU):
| Profile | Instances | Memory Each |
|---|---|---|
all-1g.10gb | 7 | 10 GB |
all-2g.20gb | 3 | 20 GB |
all-3g.40gb | 2 | 40 GB |
all-7g.80gb | 1 | 80 GB |
A100 40GB:
| Profile | Instances | Memory Each |
|---|---|---|
all-1g.5gb | 7 | 5 GB |
all-2g.10gb | 3 | 10 GB |
all-3g.20gb | 2 | 20 GB |
all-4g.20gb | 1 | 20 GB |
all-7g.40gb | 1 | 40 GB |
β οΈ Using
all-1g.10gbon A100 40GB will fail β the profile doesnβt exist. Always verify your GPU model first.
MIG Strategy: Single vs Mixed
# Single strategy: all GPUs on a node get the same MIG profile
mig:
strategy: single
# Mixed strategy: different GPUs can have different profiles
mig:
strategy: mixedWith mixed strategy, you can use custom ConfigMaps:
apiVersion: v1
kind: ConfigMap
metadata:
name: mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
mig-configs:
inference-optimized:
- devices: [0]
mig-enabled: true
mig-devices:
"3g.40gb": 1
"1g.10gb": 4Talos-Specific: GPU Extensions Configuration
# Talos machine config for NVIDIA extensions
machine:
install:
extensions:
- image: ghcr.io/siderolabs/nvidia-open-gpu-kernel-modules:570.133.20-v1.10.0
- image: ghcr.io/siderolabs/nvidia-container-toolkit:570.133.20-v1.17.7
kernel:
modules:
- name: nvidia
- name: nvidia_uvm
- name: nvidia_drm
- name: nvidia_modesetDebug Checklist
# 1. Is the GPU visible at all?
kubectl describe node worker-gpu-gwc-0 | grep -i 'gpu.product'
# 2. Is GFD running and labeling?
kubectl logs -n gpu-operator -l app=gpu-feature-discovery --tail=50
# 3. Is mig-manager deployed?
kubectl get ds -n gpu-operator | grep mig
# 4. What's the mig-manager doing?
kubectl logs -n gpu-operator -l app=nvidia-mig-manager -c nvidia-mig-manager --tail=200
# 5. Are MIG devices advertised?
kubectl get node worker-gpu-gwc-0 -o json | \
jq '.status.allocatable | to_entries[] | select(.key | startswith("nvidia.com/mig-"))'
# 6. Device plugin healthy?
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset --tail=50Common Issues
Label set but mig.config.state never appears
- Cause: mig-manager not deployed (GPU Operator missing or
migManager.enabled=false) - Fix: Install GPU Operator with
--set migManager.enabled=true
mig-manager fails with βdriver not loadedβ
- Cause: Talos extension not properly configured
- Fix: Verify kernel modules are loaded:
kubectl debug node/... -- ls /dev/nvidia*
Wrong MIG profile for GPU model
- Cause: Using A100-80GB profiles on A100-40GB (or vice versa)
- Fix: Check
nvidia.com/gpu.productlabel and use matching profiles
Device plugin shows 0 MIG resources
- Cause: Device plugin hasnβt re-enumerated after MIG change
- Fix: Wait 1-2 minutes; check device plugin logs for errors
Best Practices
- Verify GPU model before choosing profiles β 80GB and 40GB have different MIG geometries
- Disable driver/toolkit in GPU Operator on Talos β Talos provides these via extensions
- Use
mixedMIG strategy for flexibility across GPU workloads - Always drain before MIG changes β in-flight GPU workloads will fail
- Monitor mig-manager logs during reconfiguration β it shows each step
- Label nodes with intended MIG profile β enables GitOps-driven GPU fleet management
Key Takeaways
- Talos provides NVIDIA drivers via extensions, but GPU Operator is still needed for MIG management
- Without mig-manager,
nvidia.com/mig.configlabels are ignored - Install GPU Operator with
driver.enabled=falseandtoolkit.enabled=falseon Talos - A100 80GB and 40GB have different MIG profile names β verify your SKU
- The workflow is: drain β label β wait for
successβ uncordon

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
