πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

Kubeflow Training Operator on Kubernetes

Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Install the standalone Kubeflow Training Operator via kubectl apply manifest or Helm, then create PyTorchJob/TFJob CRs to run distributed training across GPU nodes with automatic worker coordination.

The Problem

Distributed ML training requires coordinating multiple workers across nodes β€” setting up rendezvous endpoints, managing rank assignments, handling worker failures, and cleaning up resources. Manually managing this with bare Deployments or StatefulSets is error-prone and doesn’t handle gang scheduling or elastic scaling.

The Solution

The Kubeflow Training Operator provides CRDs (PyTorchJob, TFJob, MPIJob, XGBoostJob, PaddleJob) that abstract distributed training orchestration. It handles worker startup ordering, environment variable injection, failure recovery, and cleanup.

Note: Kubeflow Training Operator is a standalone component. It was removed from OpenShift AI (RHOAI) in 2025 β€” on OpenShift, install it directly from upstream.

Install Training Operator (Standalone)

# Install latest stable release
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

# Or via Helm
helm repo add kubeflow https://kubeflow.github.io/training-operator
helm install training-operator kubeflow/training-operator \
  --namespace kubeflow \
  --create-namespace

# Verify installation
kubectl get pods -n kubeflow
kubectl get crds | grep kubeflow
# pytorchjobs.kubeflow.org
# tfjobs.kubeflow.org
# mpijobs.kubeflow.org
# xgboostjobs.kubeflow.org
# paddlejobs.kubeflow.org

Install on OpenShift

# Kubeflow Training Operator was removed from OpenShift AI (RHOAI)
# Install standalone operator directly

# Option 1: Manifest-based
oc apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

# Option 2: Create namespace with proper SCC
oc new-project kubeflow
oc adm policy add-scc-to-user anyuid -z training-operator -n kubeflow

kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"

# Verify CRDs
oc get crds | grep kubeflow

Basic PyTorchJob

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-mnist
  namespace: ai-workloads
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: kubeflowkatib/pytorch-mnist-gpu:latest
              command:
                - python
                - /opt/pytorch-mnist/mnist.py
                - --epochs=5
                - --batch-size=64
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: kubeflowkatib/pytorch-mnist-gpu:latest
              command:
                - python
                - /opt/pytorch-mnist/mnist.py
                - --epochs=5
                - --batch-size=64
              resources:
                limits:
                  nvidia.com/gpu: 1

Monitor Training Jobs

# List training jobs
kubectl get pytorchjobs -n ai-workloads
kubectl get tfjobs -n ai-workloads
kubectl get mpijobs -n ai-workloads

# Get job status
kubectl describe pytorchjob pytorch-mnist -n ai-workloads

# View master logs
kubectl logs pytorch-mnist-master-0 -n ai-workloads

# View worker logs
kubectl logs pytorch-mnist-worker-0 -n ai-workloads

# Watch job progress
kubectl get pytorchjobs -n ai-workloads -w

# Clean up completed jobs
kubectl delete pytorchjob pytorch-mnist -n ai-workloads

Environment Variables (Auto-Injected)

# The Training Operator automatically sets these env vars:
# MASTER_ADDR=pytorch-mnist-master-0
# MASTER_PORT=23456
# WORLD_SIZE=4 (1 master + 3 workers)
# RANK=0|1|2|3
# PYTHONUNBUFFERED=1

# Your training script uses them via torch.distributed:
# torch.distributed.init_process_group(backend="nccl")
graph TD
    A[Training Operator] -->|Watches| B[PyTorchJob CR]
    B --> C[Master Pod rank 0]
    B --> D[Worker Pod rank 1]
    B --> E[Worker Pod rank 2]
    B --> F[Worker Pod rank 3]
    
    C -->|NCCL| D
    C -->|NCCL| E
    C -->|NCCL| F
    
    G[MASTER_ADDR and WORLD_SIZE] -->|Auto-injected| C
    G -->|Auto-injected| D
    G -->|Auto-injected| E
    G -->|Auto-injected| F
    
    H[GPU Scheduler] -->|nvidia.com/gpu| C
    H -->|nvidia.com/gpu| D

Common Issues

  • Workers stuck Pending β€” insufficient GPU resources; check kubectl describe node | grep gpu and reduce worker count
  • NCCL timeout β€” workers can’t communicate; verify NetworkPolicy allows pod-to-pod traffic; check NCCL_SOCKET_IFNAME
  • Master pod CrashLoopBackOff β€” training script error; check kubectl logs for Python traceback
  • OpenShift SCC issues β€” Training Operator may need anyuid SCC; add to service account
  • Job stuck in Running after completion β€” some training scripts don’t exit cleanly; add proper sys.exit(0) after training

Best Practices

  • Use restartPolicy: OnFailure for automatic retry on transient failures
  • Set appropriate GPU resource limits β€” one GPU per worker is typical
  • Use NCCL backend for multi-GPU communication (backend="nccl")
  • Store checkpoints to shared storage (PVC) for crash recovery
  • Use ttlSecondsAfterFinished or manual cleanup to avoid accumulating completed jobs
  • Pin Training Operator version to avoid unexpected behavior changes

Key Takeaways

  • Kubeflow Training Operator is standalone β€” install independently (removed from OpenShift AI)
  • CRDs: PyTorchJob, TFJob, MPIJob, XGBoostJob, PaddleJob
  • Auto-injects MASTER_ADDR, WORLD_SIZE, RANK β€” training scripts use standard distributed APIs
  • Master-Worker topology with automatic pod coordination and failure handling
  • Works with any GPU-enabled Kubernetes cluster including OpenShift
#kubeflow #training-operator #distributed-training #pytorch #gpu
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens