πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 15 minutes K8s 1.28+

Kubeflow Operator: Full ML Platform

Deploy the complete Kubeflow platform on Kubernetes with the Kubeflow Operator. Covers Pipelines, Notebooks, KServe, Katib, and multi-tenant ML workflows.

By Luca Berton β€’ β€’ πŸ“– 7 min read

πŸ’‘ Quick Answer: Deploy the complete Kubeflow platform on Kubernetes with the Kubeflow Operator. Covers Pipelines, Notebooks, KServe, Katib, and multi-tenant ML workflows.

The Problem

Building an end-to-end ML platform requires stitching together dozens of tools: notebooks, experiment tracking, distributed training, hyperparameter tuning, model serving, and pipeline orchestration. Installing each component separately leads to version conflicts, broken integrations, and maintenance nightmares. The Kubeflow Operator deploys and manages the entire stack as a single, cohesive platform.

The Solution

Kubeflow Components Overview

ComponentPurposeCRD
PipelinesML workflow orchestrationPipelineRun
NotebooksJupyter notebook serversNotebook
Training OperatorDistributed trainingPyTorchJob, TFJob, MPIJob
KatibHyperparameter tuningExperiment
KServeModel serving & inferenceInferenceService
VolumesData/model managementPVCs
ProfilesMulti-tenant namespacesProfile

Prerequisites

# Kubernetes 1.27+ with:
# - Default StorageClass (dynamic provisioning)
# - GPU support (NVIDIA GPU Operator) for training/inference
# - At least 16GB RAM, 8 CPUs for control plane components

# Required: cert-manager and Istio
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.0/cert-manager.yaml

# Wait for cert-manager
kubectl wait --for=condition=Available deployment -n cert-manager --all --timeout=300s
# Clone Kubeflow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests

# Install everything (takes 10-15 minutes)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 10
done

# Verify all pods are running
kubectl get pods -n kubeflow --watch

# Access the dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# Open http://localhost:8080
# Default credentials: user@example.com / 12341234

Install with Kubeflow Operator (Declarative)

# The Kubeflow Operator manages the platform lifecycle
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: kubeflow
  namespace: kubeflow
spec:
  applications:
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: common/cert-manager/cert-manager/base
      name: cert-manager
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: common/istio-1-17/istio-install/base
      name: istio
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: apps/pipeline/upstream/env/cert-manager/platform-agnostic-multi-user
      name: pipelines
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: apps/jupyter/notebook-controller/upstream/overlays/kubeflow
      name: notebooks
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: apps/training-operator/upstream/overlays/kubeflow
      name: training-operator
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: apps/katib/upstream/installs/katib-with-kubeflow
      name: katib
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: contrib/kserve/kserve
      name: kserve
  repos:
    - name: manifests
      uri: https://github.com/kubeflow/manifests/archive/v1.9.tar.gz

Multi-Tenant Setup with Profiles

# Create a team workspace (namespace + RBAC + quotas)
apiVersion: kubeflow.org/v1
kind: Profile
metadata:
  name: ml-team-alpha
spec:
  owner:
    kind: User
    name: alice@example.com
  resourceQuotaSpec:
    hard:
      requests.cpu: "32"
      requests.memory: 128Gi
      requests.nvidia.com/gpu: "4"
      limits.cpu: "64"
      limits.memory: 256Gi
      persistentvolumeclaims: "20"
# Kubeflow creates namespace "ml-team-alpha" with:
# - Istio sidecar injection
# - RBAC for the owner
# - ResourceQuota
# - Network isolation

kubectl get profile
kubectl get ns ml-team-alpha

Kubeflow Pipelines

# Define a pipeline with the KFP SDK
from kfp import dsl, compiler

@dsl.component(base_image="python:3.11")
def preprocess(data_path: str) -> str:
    import pandas as pd
    df = pd.read_csv(data_path)
    # Clean, transform, feature engineer...
    output_path = "/tmp/processed.csv"
    df.to_csv(output_path)
    return output_path

@dsl.component(base_image="pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime")
def train(data_path: str, epochs: int = 10) -> str:
    import torch
    # Training logic...
    model_path = "/tmp/model.pt"
    torch.save(model.state_dict(), model_path)
    return model_path

@dsl.component(base_image="python:3.11")
def deploy(model_path: str, endpoint: str):
    # Deploy to KServe
    pass

@dsl.pipeline(name="ML Training Pipeline")
def ml_pipeline(data_path: str = "gs://bucket/data.csv"):
    preprocess_task = preprocess(data_path=data_path)
    train_task = train(
        data_path=preprocess_task.output,
        epochs=20
    ).set_gpu_limit(1)
    deploy(
        model_path=train_task.output,
        endpoint="my-model"
    )

# Compile and upload
compiler.Compiler().compile(ml_pipeline, "pipeline.yaml")

# Submit via CLI
# kfp run submit -e my-experiment -r run-001 -f pipeline.yaml

Jupyter Notebooks

apiVersion: kubeflow.org/v1
kind: Notebook
metadata:
  name: gpu-notebook
  namespace: ml-team-alpha
spec:
  template:
    spec:
      containers:
        - name: notebook
          image: kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.9.0
          resources:
            requests:
              cpu: "2"
              memory: 8Gi
              nvidia.com/gpu: "1"
            limits:
              cpu: "4"
              memory: 16Gi
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: workspace
              mountPath: /home/jovyan
      volumes:
        - name: workspace
          persistentVolumeClaim:
            claimName: gpu-notebook-workspace

Katib Hyperparameter Tuning

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: tune-learning-rate
  namespace: ml-team-alpha
spec:
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  maxTrialCount: 30
  maxFailedTrialCount: 3
  parallelTrialCount: 5
  parameters:
    - name: learning-rate
      parameterType: double
      feasibleSpace:
        min: "0.0001"
        max: "0.1"
    - name: batch-size
      parameterType: int
      feasibleSpace:
        min: "16"
        max: "256"
    - name: optimizer
      parameterType: categorical
      feasibleSpace:
        list: ["adam", "sgd", "adamw"]
  trialTemplate:
    primaryContainerName: training
    trialParameters:
      - name: learningRate
        reference: learning-rate
      - name: batchSize
        reference: batch-size
      - name: optimizer
        reference: optimizer
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: training
                image: my-training:v1
                command:
                  - python
                  - train.py
                  - --lr=${trialParameters.learningRate}
                  - --batch-size=${trialParameters.batchSize}
                  - --optimizer=${trialParameters.optimizer}
                resources:
                  limits:
                    nvidia.com/gpu: 1
            restartPolicy: Never

KServe Model Serving

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
  namespace: ml-team-alpha
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "gs://models/my-model/v1"
      resources:
        requests:
          cpu: "2"
          memory: 4Gi
          nvidia.com/gpu: "1"
    minReplicas: 1
    maxReplicas: 5
    scaleTarget: 10        # Scale at 10 concurrent requests
  transformer:
    containers:
      - name: preprocessor
        image: my-preprocessor:v1
# Test inference
curl -X POST http://my-model.ml-team-alpha.example.com/v1/models/my-model:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

Full ML Workflow

graph TD
    A[Jupyter Notebook: Explore data] --> B[Kubeflow Pipeline]
    B --> C[Preprocess data]
    C --> D[Katib: Tune hyperparameters]
    D --> E[Training Operator: Distributed training]
    E --> F[Model Registry]
    F --> G[KServe: Deploy model]
    G --> H[Production inference]
    H --> I[Monitor & retrain]
    I --> B

Resource Requirements

ComponentMin CPUMin MemoryNotes
Istio2 cores4GiService mesh
Pipelines2 cores4GiArgo + MySQL + MinIO
Notebooks1 core2GiPer notebook server
Training Operator500m512MiController only
Katib500m512MiController + DB
KServe1 core2GiController + Istio
Total platform~8 cores~16GiWithout workloads

Production Hardening

# 1. Change default credentials
kubectl edit configmap dex -n auth
# Update staticPasswords

# 2. Configure external auth (OIDC)
# Point Dex to your identity provider (Okta, Azure AD, Google)

# 3. Enable HTTPS
# Configure Istio gateway with TLS certificates

# 4. External database for Pipelines
# Replace built-in MySQL with managed RDS/CloudSQL

# 5. External object storage
# Replace MinIO with S3/GCS for pipeline artifacts

# 6. GPU node pools
# Dedicate GPU nodes with taints for training workloads
kubectl taint nodes gpu-pool nvidia.com/gpu=present:NoSchedule

Common Issues

IssueCauseFix
Pods stuck in InitIstio sidecar injection delayWait, check istio-proxy logs
Pipeline step OOMInsufficient memory limitIncrease resource requests
Notebook won’t startPVC not boundCheck StorageClass
Katib trials failTraining image errorTest image standalone first
KServe 503Model loading timeoutIncrease timeout, check storage access
Auth redirect loopDex misconfigurationCheck OIDC settings, cookie domain

Best Practices

  • Start small: Install only the components you need (Pipelines + Notebooks is a good start)
  • Use Profiles for team isolation β€” each team gets their own namespace with quotas
  • External databases for production β€” built-in MySQL/MinIO are for development only
  • GPU quotas: Set requests.nvidia.com/gpu in Profile ResourceQuotas
  • Version pin your Kubeflow manifests β€” don’t track main in production
  • Backup Pipelines DB β€” contains all experiment history and metadata

Key Takeaways

  • Kubeflow provides a complete MLOps platform on Kubernetes
  • The operator approach manages the entire lifecycle declaratively
  • Multi-tenancy via Profiles provides team isolation with RBAC and quotas
  • Katib automates hyperparameter tuning with Bayesian optimization
  • KServe handles model serving with autoscaling and canary rollouts
  • Production deployments need external databases, OIDC auth, and TLS
#kubeflow #mlops #operator #ml-platform #pipelines
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens