πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

Katib Hyperparameter Tuning Kubernetes

Automate hyperparameter tuning with Katib on Kubernetes. Bayesian optimization, random search, grid search, early stopping.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Create a Katib Experiment defining the search space (learning rate, batch size, layers), objective metric, and search algorithm. Katib runs parallel trials as Kubernetes Jobs, tracks metrics, and identifies optimal hyperparameters automatically.

The Problem

Hyperparameter tuning is tedious and GPU-expensive. Data scientists manually try combinations of learning rate, batch size, model depth, and regularization β€” running hundreds of training jobs. Katib automates this with intelligent search algorithms that converge faster than random or grid search.

The Solution

Katib Experiment

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: resnet-tuning
  namespace: kubeflow
spec:
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
    additionalMetricNames:
      - loss
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 4
  maxTrialCount: 30
  maxFailedTrialCount: 3
  parameters:
    - name: learning-rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.01"
    - name: batch-size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "128"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list: ["adam", "sgd", "adamw"]
    - name: dropout
      parameterType: double
      feasibleSpace:
        min: "0.1"
        max: "0.5"
  trialTemplate:
    primaryContainerName: training
    trialParameters:
      - name: learningRate
        reference: learning-rate
      - name: batchSize
        reference: batch-size
      - name: optimizer
        reference: optimizer
      - name: dropout
        reference: dropout
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training
                image: registry.example.com/resnet-train:1.0
                command:
                  - python
                  - train.py
                  - --lr=${trialParameters.learningRate}
                  - --batch-size=${trialParameters.batchSize}
                  - --optimizer=${trialParameters.optimizer}
                  - --dropout=${trialParameters.dropout}
                resources:
                  limits:
                    nvidia.com/gpu: 1
            restartPolicy: Never
  earlyStopping:
    algorithmName: medianstop
    algorithmSettings:
      - name: min_trials_required
        value: "5"
      - name: start_step
        value: "3"

Search Algorithms

AlgorithmBest ForTrials Needed
randomBroad explorationMany (50+)
gridSmall discrete spaceExhaustive
bayesianoptimizationContinuous parametersFew (15-30)
tpeMixed parameter typesModerate (20-40)
cmaesContinuous optimizationFew (15-30)
hyperbandEarly stopping + resourceModerate

Monitor Progress

# Watch experiment status
kubectl get experiment resnet-tuning -n kubeflow -w

# Get best trial
kubectl get experiment resnet-tuning -n kubeflow \
  -o jsonpath='{.status.currentOptimalTrial}'

# List all trials
kubectl get trial -n kubeflow -l katib.kubeflow.org/experiment=resnet-tuning
graph TD
    EXP[Experiment<br/>Define search space] --> ALG[Algorithm<br/>Bayesian / TPE / Random]
    ALG -->|Suggest params| T1[Trial 1<br/>lr=0.001, bs=32]
    ALG -->|Suggest params| T2[Trial 2<br/>lr=0.005, bs=64]
    ALG -->|Suggest params| T3[Trial 3<br/>lr=0.0001, bs=128]
    ALG -->|Suggest params| T4[Trial 4<br/>lr=0.003, bs=16]
    
    T1 -->|accuracy=0.89| COLLECT[Metrics Collector]
    T2 -->|accuracy=0.92| COLLECT
    T3 -->|accuracy=0.85| EARLY[Early Stopping<br/>Below median]
    T4 -->|accuracy=0.94| COLLECT
    
    COLLECT -->|Feed back| ALG
    ALG -->|Best trial| BEST[βœ… lr=0.003, bs=16<br/>accuracy=0.94]

Common Issues

Trials stuck in Pending β€” no GPU available

Reduce parallelTrialCount or add more GPU nodes. Each trial runs as a separate Job needing GPU resources.

Metrics not collected from trials

Katib needs to parse metrics from pod logs. Ensure your training script prints metrics in the expected format: accuracy=0.94 or use a custom metrics collector.

Best Practices

  • Bayesian optimization for most use cases β€” converges faster than random/grid
  • Start with wide search space β€” narrow after initial exploration
  • Enable early stopping β€” kills unpromising trials early, saves GPU hours
  • parallelTrialCount: 4 β€” balance between speed and GPU availability
  • Log metrics to stdout β€” Katib’s default collector parses pod logs

Key Takeaways

  • Katib automates hyperparameter tuning with intelligent search algorithms
  • Bayesian optimization needs 15-30 trials vs 100+ for random search
  • Each trial runs as a Kubernetes Job β€” parallel execution on GPU nodes
  • Early stopping (medianstop) kills underperforming trials β€” saves 30-50% GPU time
  • Framework-agnostic: works with PyTorch, TensorFlow, or any training script
#katib #hyperparameter #automl #tuning #optimization
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens