Kubeflow Training Operator on Kubernetes
Install Kubeflow Training Operator for distributed ML training with PyTorchJob, TFJob, and MPIJob on GPU-enabled Kubernetes clusters.
π‘ Quick Answer: Install the standalone Kubeflow Training Operator via
kubectl applymanifest or Helm, then create PyTorchJob/TFJob CRs to run distributed training across GPU nodes with automatic worker coordination.
The Problem
Distributed ML training requires coordinating multiple workers across nodes β setting up rendezvous endpoints, managing rank assignments, handling worker failures, and cleaning up resources. Manually managing this with bare Deployments or StatefulSets is error-prone and doesnβt handle gang scheduling or elastic scaling.
The Solution
The Kubeflow Training Operator provides CRDs (PyTorchJob, TFJob, MPIJob, XGBoostJob, PaddleJob) that abstract distributed training orchestration. It handles worker startup ordering, environment variable injection, failure recovery, and cleanup.
Note: Kubeflow Training Operator is a standalone component. It was removed from OpenShift AI (RHOAI) in 2025 β on OpenShift, install it directly from upstream.
Install Training Operator (Standalone)
# Install latest stable release
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"
# Or via Helm
helm repo add kubeflow https://kubeflow.github.io/training-operator
helm install training-operator kubeflow/training-operator \
--namespace kubeflow \
--create-namespace
# Verify installation
kubectl get pods -n kubeflow
kubectl get crds | grep kubeflow
# pytorchjobs.kubeflow.org
# tfjobs.kubeflow.org
# mpijobs.kubeflow.org
# xgboostjobs.kubeflow.org
# paddlejobs.kubeflow.orgInstall on OpenShift
# Kubeflow Training Operator was removed from OpenShift AI (RHOAI)
# Install standalone operator directly
# Option 1: Manifest-based
oc apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"
# Option 2: Create namespace with proper SCC
oc new-project kubeflow
oc adm policy add-scc-to-user anyuid -z training-operator -n kubeflow
kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.1"
# Verify CRDs
oc get crds | grep kubeflowBasic PyTorchJob
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist
namespace: ai-workloads
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflowkatib/pytorch-mnist-gpu:latest
command:
- python
- /opt/pytorch-mnist/mnist.py
- --epochs=5
- --batch-size=64
resources:
limits:
nvidia.com/gpu: 1
Worker:
replicas: 3
restartPolicy: OnFailure
template:
spec:
containers:
- name: pytorch
image: kubeflowkatib/pytorch-mnist-gpu:latest
command:
- python
- /opt/pytorch-mnist/mnist.py
- --epochs=5
- --batch-size=64
resources:
limits:
nvidia.com/gpu: 1Monitor Training Jobs
# List training jobs
kubectl get pytorchjobs -n ai-workloads
kubectl get tfjobs -n ai-workloads
kubectl get mpijobs -n ai-workloads
# Get job status
kubectl describe pytorchjob pytorch-mnist -n ai-workloads
# View master logs
kubectl logs pytorch-mnist-master-0 -n ai-workloads
# View worker logs
kubectl logs pytorch-mnist-worker-0 -n ai-workloads
# Watch job progress
kubectl get pytorchjobs -n ai-workloads -w
# Clean up completed jobs
kubectl delete pytorchjob pytorch-mnist -n ai-workloadsEnvironment Variables (Auto-Injected)
# The Training Operator automatically sets these env vars:
# MASTER_ADDR=pytorch-mnist-master-0
# MASTER_PORT=23456
# WORLD_SIZE=4 (1 master + 3 workers)
# RANK=0|1|2|3
# PYTHONUNBUFFERED=1
# Your training script uses them via torch.distributed:
# torch.distributed.init_process_group(backend="nccl")graph TD
A[Training Operator] -->|Watches| B[PyTorchJob CR]
B --> C[Master Pod rank 0]
B --> D[Worker Pod rank 1]
B --> E[Worker Pod rank 2]
B --> F[Worker Pod rank 3]
C -->|NCCL| D
C -->|NCCL| E
C -->|NCCL| F
G[MASTER_ADDR and WORLD_SIZE] -->|Auto-injected| C
G -->|Auto-injected| D
G -->|Auto-injected| E
G -->|Auto-injected| F
H[GPU Scheduler] -->|nvidia.com/gpu| C
H -->|nvidia.com/gpu| DCommon Issues
- Workers stuck Pending β insufficient GPU resources; check
kubectl describe node | grep gpuand reduce worker count - NCCL timeout β workers canβt communicate; verify NetworkPolicy allows pod-to-pod traffic; check NCCL_SOCKET_IFNAME
- Master pod CrashLoopBackOff β training script error; check
kubectl logsfor Python traceback - OpenShift SCC issues β Training Operator may need
anyuidSCC; add to service account - Job stuck in Running after completion β some training scripts donβt exit cleanly; add proper
sys.exit(0)after training
Best Practices
- Use
restartPolicy: OnFailurefor automatic retry on transient failures - Set appropriate GPU resource limits β one GPU per worker is typical
- Use NCCL backend for multi-GPU communication (
backend="nccl") - Store checkpoints to shared storage (PVC) for crash recovery
- Use
ttlSecondsAfterFinishedor manual cleanup to avoid accumulating completed jobs - Pin Training Operator version to avoid unexpected behavior changes
Key Takeaways
- Kubeflow Training Operator is standalone β install independently (removed from OpenShift AI)
- CRDs: PyTorchJob, TFJob, MPIJob, XGBoostJob, PaddleJob
- Auto-injects MASTER_ADDR, WORLD_SIZE, RANK β training scripts use standard distributed APIs
- Master-Worker topology with automatic pod coordination and failure handling
- Works with any GPU-enabled Kubernetes cluster including OpenShift

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
