πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

Distributed Training TensorFlow PyTorch

Run distributed training jobs on Kubernetes with TensorFlow and PyTorch. Training Operator, multi-worker strategies, NCCL configuration.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use Kubeflow Training Operator to run distributed training as TFJob (TensorFlow) or PyTorchJob (PyTorch). Configure NCCL for GPU-to-GPU communication, set NCCL_IB_DISABLE=0 for RDMA-enabled clusters, and use elastic training for fault tolerance.

The Problem

Training large models takes days or weeks on a single GPU. Distributed training across multiple GPUs and nodes reduces training time linearly β€” but requires proper NCCL configuration, data parallelism strategy, and fault handling. Kubernetes Training Operator manages the multi-worker lifecycle.

The Solution

PyTorch Distributed Training

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: bert-finetune
spec:
  elasticPolicy:
    minReplicas: 2
    maxReplicas: 8
    rdzvBackend: c10d
  pytorchReplicaSpecs:
    Worker:
      replicas: 4
      template:
        spec:
          containers:
            - name: pytorch
              image: registry.example.com/training:1.0
              env:
                - name: NCCL_DEBUG
                  value: "INFO"
                - name: NCCL_IB_DISABLE
                  value: "0"
                - name: NCCL_NET_GDR_LEVEL
                  value: "SYS"
              command:
                - torchrun
                - --nproc_per_node=8
                - --nnodes=$(PET_NNODES)
                - --rdzv_backend=c10d
                - --rdzv_endpoint=$(PET_RDZV_ENDPOINT)
                - train.py
                - --model=bert-large
                - --batch-size-per-gpu=32
              resources:
                limits:
                  nvidia.com/gpu: 8
                  rdma/rdma_shared_device_a: 1
                  memory: 256Gi

TensorFlow Distributed Training

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  name: resnet-train
spec:
  tfReplicaSpecs:
    Chief:
      replicas: 1
      template:
        spec:
          containers:
            - name: tensorflow
              image: registry.example.com/tf-train:1.0
              command: ["python", "train.py"]
              env:
                - name: TF_FORCE_GPU_ALLOW_GROWTH
                  value: "true"
              resources:
                limits:
                  nvidia.com/gpu: 8
    Worker:
      replicas: 3
      template:
        spec:
          containers:
            - name: tensorflow
              image: registry.example.com/tf-train:1.0
              command: ["python", "train.py"]
              resources:
                limits:
                  nvidia.com/gpu: 8

NCCL Environment Variables

VariableValuePurpose
NCCL_DEBUGINFOVerify transport (IB vs Socket)
NCCL_IB_DISABLE0Enable InfiniBand/RoCE
NCCL_NET_GDR_LEVELSYSGPU Direct RDMA level
NCCL_IB_QPS_PER_CONNECTION4Queue pairs per connection
NCCL_SOCKET_IFNAMEeth0TCP fallback interface
graph TD
    MASTER[Master/Chief<br/>Coordinates training] --> W1[Worker 1<br/>8Γ— GPU]
    MASTER --> W2[Worker 2<br/>8Γ— GPU]
    MASTER --> W3[Worker 3<br/>8Γ— GPU]
    
    W1 <-->|NCCL AllReduce<br/>RDMA/InfiniBand| W2
    W2 <-->|NCCL AllReduce| W3
    W3 <-->|NCCL AllReduce| W1
    
    subgraph Data Parallel
        DATA[Training Data] -->|Shard 1| W1
        DATA -->|Shard 2| W2
        DATA -->|Shard 3| W3
    end

Common Issues

Training hangs at NCCL initialization

Workers can’t reach each other. Check: NCCL is using IB (NET/IB in logs), not falling back to TCP. Verify RDMA device availability with ibstat.

Gradient NaN after multi-node training

Learning rate needs scaling with world size. Use linear scaling: lr = base_lr * world_size. Enable gradient clipping: torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0).

Best Practices

  • Elastic training for fault tolerance β€” jobs continue if a worker fails
  • NCCL_DEBUG=INFO during initial setup β€” verify RDMA transport, then disable for performance
  • Data parallel for most workloads β€” simpler than model parallel
  • Checkpoint every N steps β€” recover from failures without restarting
  • Linear learning rate scaling with world size β€” prevents divergence

Key Takeaways

  • Training Operator manages multi-worker lifecycle on Kubernetes
  • PyTorchJob with elastic policy enables auto-scaling and fault tolerance
  • NCCL handles GPU-to-GPU communication β€” RDMA/IB is 10-100x faster than TCP
  • Data parallelism: each worker trains on a shard, gradients are synchronized via AllReduce
  • Checkpointing is mandatory β€” multi-node jobs are prone to failures
#distributed-training #tensorflow #pytorch #training-operator #nccl
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens