πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

NVIDIA PyTorch Container on Kubernetes

Deploy nvcr.io/nvidia/pytorch containers on Kubernetes for GPU training. Version selection, CUDA compatibility, multi-node DDP, and NCCL configuration.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use nvcr.io/nvidia/pytorch:24.07-py3 (or latest monthly tag) for GPU training on Kubernetes. The NVIDIA PyTorch containers include pre-built CUDA, cuDNN, NCCL, and PyTorch β€” everything needed for single and multi-node GPU training. Request GPU resources (nvidia.com/gpu: 1), mount shared storage for datasets, and set NCCL environment variables for multi-node communication.

The Problem

Building PyTorch containers for GPU training is complex:

  • CUDA, cuDNN, NCCL version compatibility matrix
  • GPU driver compatibility with container CUDA version
  • Multi-node training requires specific NCCL and network config
  • Building from source takes hours and often fails

The Solution

NVIDIA PyTorch Container Versions

# Container naming: nvcr.io/nvidia/pytorch:YY.MM-py3
# YY.MM = year.month release cycle

# Popular versions:
# 24.07-py3  β†’ CUDA 12.5, PyTorch 2.4, NCCL 2.22
# 24.10-py3  β†’ CUDA 12.6, PyTorch 2.5, NCCL 2.23
# 25.01-py3  β†’ CUDA 12.7, PyTorch 2.6, NCCL 2.25
# 25.04-py3  β†’ CUDA 12.8, PyTorch 2.7, NCCL 2.26
# 25.11-py3  β†’ CUDA 13.0, PyTorch 2.8, NCCL 2.28 (latest reference)

# Check available tags
curl -s "https://nvcr.io/v2/nvidia/pytorch/tags/list" | jq '.tags[-10:]'

Single GPU Training Job

apiVersion: batch/v1
kind: Job
metadata:
  name: pytorch-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.07-py3
        command:
        - python
        - /workspace/train.py
        - --epochs=10
        - --batch-size=64
        resources:
          limits:
            nvidia.com/gpu: 1
          requests:
            cpu: "4"
            memory: 16Gi
        volumeMounts:
        - name: dataset
          mountPath: /data
        - name: scripts
          mountPath: /workspace
      volumes:
      - name: dataset
        persistentVolumeClaim:
          claimName: training-data
      - name: scripts
        configMap:
          name: training-scripts
      restartPolicy: Never
  backoffLimit: 2

Multi-GPU Single Node (DataParallel)

apiVersion: batch/v1
kind: Job
metadata:
  name: multi-gpu-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: nvcr.io/nvidia/pytorch:24.07-py3
        command:
        - torchrun
        - --nproc_per_node=4
        - /workspace/train_ddp.py
        resources:
          limits:
            nvidia.com/gpu: 4       # All 4 GPUs on one node
          requests:
            cpu: "16"
            memory: 64Gi
        env:
        - name: NCCL_DEBUG
          value: "INFO"
        volumeMounts:
        - name: dataset
          mountPath: /data
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: dataset
        persistentVolumeClaim:
          claimName: training-data
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 16Gi          # Shared memory for NCCL
      restartPolicy: Never

Multi-Node DDP Training

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: distributed-training
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
          - name: pytorch
            image: nvcr.io/nvidia/pytorch:24.07-py3
            command:
            - torchrun
            - --nnodes=4
            - --nproc_per_node=8
            - --rdzv_backend=c10d
            - --rdzv_endpoint=$(MASTER_ADDR):29500
            - /workspace/train_ddp.py
            resources:
              limits:
                nvidia.com/gpu: 8
            env:
            - name: NCCL_IB_DISABLE
              value: "0"           # Enable InfiniBand
            - name: NCCL_SOCKET_IFNAME
              value: "eth0"
            - name: NCCL_DEBUG
              value: "WARN"
            volumeMounts:
            - name: shm
              mountPath: /dev/shm
          volumes:
          - name: shm
            emptyDir:
              medium: Memory
              sizeLimit: 32Gi
    Worker:
      replicas: 3
      template:
        spec:
          containers:
          - name: pytorch
            image: nvcr.io/nvidia/pytorch:24.07-py3
            # Same config as Master

Container Contents Reference

Component24.07-py325.01-py325.11-py3
CUDA12.512.713.0
cuDNN9.29.59.8
NCCL2.22.32.25.12.28.8
PyTorch2.4.02.6.02.8.0
Python3.103.103.12
MOFED5.45.45.4
GDRCopy2.42.4.12.5.1
OSUbuntu 22.04Ubuntu 22.04Ubuntu 24.04

GPU Driver Compatibility

# Check minimum driver version for container CUDA version
# CUDA 12.5 β†’ Driver β‰₯ 555.42
# CUDA 12.6 β†’ Driver β‰₯ 560.28
# CUDA 12.7 β†’ Driver β‰₯ 565.57
# CUDA 13.0 β†’ Driver β‰₯ 570.86

# Check node driver version
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: {.status.nodeInfo.containerRuntimeVersion}{"\n"}{end}'

# Inside pod
nvidia-smi --query-gpu=driver_version --format=csv,noheader

Common Issues

β€œCUDA error: no kernel image is available for execution on the device”

GPU architecture mismatch. The container’s PyTorch was compiled for specific GPU architectures. H100 needs recent containers (24.01+), older GPUs like V100 are widely supported.

Shared memory errors (RuntimeError: DataLoader worker... killed)

Default /dev/shm is 64MB. Mount an emptyDir with medium: Memory and sizeLimit: 16Gi+.

NCCL timeout in multi-node

Network interface selection wrong. Set NCCL_SOCKET_IFNAME to your pod network interface (usually eth0) and check firewall rules for NCCL ports.

Best Practices

  • Pin container version β€” use 24.07-py3 not latest
  • Always mount /dev/shm β€” PyTorch DataLoader needs large shared memory
  • Match driver version β€” check CUDAβ†’driver compatibility matrix
  • Set NCCL_DEBUG=INFO for initial setup, WARN for production
  • Use Kubeflow PyTorchJob for multi-node β€” handles MASTER_ADDR and coordination

Key Takeaways

  • nvcr.io/nvidia/pytorch:YY.MM-py3 containers include CUDA, cuDNN, NCCL, and PyTorch pre-built
  • Pin versions (e.g., 24.07-py3) for reproducible training
  • Always mount /dev/shm as emptyDir Memory for DataLoader workers
  • Multi-node requires NCCL env vars (NCCL_SOCKET_IFNAME, NCCL_IB_DISABLE)
  • Check GPU driver version compatibility before deploying
#nvidia #pytorch #gpu #training #nccl #containers
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens