ML Pipeline Automation Kubernetes
Automate ML pipelines on Kubernetes with Kubeflow Pipelines, Argo Workflows, and Tekton. Data preprocessing, training, evaluation, model registration.
π‘ Quick Answer: Use Kubeflow Pipelines for ML-specific workflows (data prep β train β evaluate β deploy) or Argo Workflows for general-purpose DAG execution. Define pipelines as Python code with the KFP SDK, compile to YAML, and schedule recurring runs for model retraining.
The Problem
Manual ML workflows donβt scale: data scientists run notebooks, manually copy model artifacts, test in staging, and deploy to production. Each step is error-prone and unreproducible. ML pipelines automate the full workflow β from raw data to deployed model β with versioning, caching, and monitoring.
The Solution
Kubeflow Pipeline (Python SDK)
from kfp import dsl, compiler
@dsl.component(base_image="python:3.11")
def preprocess_data(input_path: str, output_path: dsl.Output[dsl.Dataset]):
import pandas as pd
df = pd.read_csv(input_path)
df_clean = df.dropna()
df_clean.to_csv(output_path.path)
@dsl.component(base_image="registry.example.com/pytorch:2.5")
def train_model(
dataset: dsl.Input[dsl.Dataset],
model: dsl.Output[dsl.Model],
epochs: int = 10,
lr: float = 0.001
):
# Training logic here
pass
@dsl.component
def evaluate_model(model: dsl.Input[dsl.Model]) -> float:
# Evaluation logic
return 0.94
@dsl.component
def deploy_model(model: dsl.Input[dsl.Model], accuracy: float):
if accuracy > 0.90:
# Deploy to KServe
pass
@dsl.pipeline(name="ML Training Pipeline")
def training_pipeline(input_data: str = "s3://data/train.csv"):
preprocess = preprocess_data(input_path=input_data)
train = train_model(dataset=preprocess.outputs["output_path"], epochs=20)
evaluate = evaluate_model(model=train.outputs["model"])
deploy = deploy_model(
model=train.outputs["model"],
accuracy=evaluate.output
)
compiler.Compiler().compile(training_pipeline, "pipeline.yaml")Argo Workflow Alternative
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: ml-training
spec:
entrypoint: ml-pipeline
templates:
- name: ml-pipeline
dag:
tasks:
- name: preprocess
template: preprocess
- name: train
template: train
dependencies: [preprocess]
arguments:
artifacts:
- name: data
from: "{{tasks.preprocess.outputs.artifacts.processed-data}}"
- name: evaluate
template: evaluate
dependencies: [train]
- name: deploy
template: deploy
dependencies: [evaluate]
when: "{{tasks.evaluate.outputs.parameters.accuracy}} > 0.90"
- name: train
container:
image: registry.example.com/training:1.0
resources:
limits:
nvidia.com/gpu: 4Scheduled Retraining
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
name: weekly-retrain
spec:
schedule: "0 2 * * 1"
timezone: "UTC"
workflowSpec:
entrypoint: ml-pipeline
# ... same pipeline specgraph LR
DATA[Raw Data<br/>S3/GCS] --> PREP[Preprocess<br/>Clean + Feature Eng]
PREP --> TRAIN[Train<br/>Distributed GPU]
TRAIN --> EVAL[Evaluate<br/>Accuracy + Metrics]
EVAL -->|accuracy > 0.90| DEPLOY[Deploy<br/>KServe InferenceService]
EVAL -->|accuracy < 0.90| ALERT[Alert<br/>Model degraded]
CRON[CronWorkflow<br/>Weekly Monday 2AM] -->|Trigger| DATACommon Issues
Pipeline step fails but cached result is used on retry
KFP caches step outputs by default. Force re-execution: dsl.pipeline(pipeline_root='...').set_caching_options(False).
GPU step stuck in Pending
Pipeline steps run as pods. Check GPU availability: kubectl describe node | grep nvidia.com/gpu. Set activeDeadlineSeconds to prevent indefinite waits.
Best Practices
- Version everything β data, code, model artifacts, pipeline definition
- Conditional deployment β only deploy if evaluation metrics meet threshold
- Cache intermediate steps β donβt reprocess data if only hyperparameters changed
- Schedule weekly retraining β models degrade as data distribution shifts
- Alert on metric degradation β automated monitoring of model accuracy
Key Takeaways
- ML pipelines automate the full workflow: data β train β evaluate β deploy
- Kubeflow Pipelines for ML-specific workflows; Argo Workflows for general DAGs
- Python SDK compiles pipelines to Kubernetes-native YAML
- Caching avoids redundant computation β only re-run changed steps
- Scheduled retraining prevents model drift β critical for production ML

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Master ML lifecycle management with MLflow on Kubernetes β tracking, registry, and deployment.
Start Learning βAutomate Kubernetes node configuration and cluster bootstrapping with Ansible.
Start Learning βCourses by CopyPasteLearn.com β Learn IT by Doing
