πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai beginner ⏱ 20 minutes K8s 1.28+

CNCF AI Projects Landscape Kubernetes

Navigate the CNCF AI project landscape for Kubernetes. Kubeflow, KServe, KAITO, Volcano, and emerging projects for training, serving, scheduling.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: The CNCF Cloud Native AI (CNAI) landscape covers: Kubeflow (ML platform), KServe (model serving), Volcano (batch scheduling), KAITO (automated GPU provisioning), and Kueue (job queuing). Choose based on your stage: Kubeflow for full MLOps, KServe for serving-only, KAITO for quickstart.

The Problem

The AI/ML ecosystem on Kubernetes has grown from a few projects to a sprawling landscape. Choosing the right combination of CNCF projects for training, serving, scheduling, and monitoring AI workloads is confusing β€” especially when projects overlap in functionality.

The Solution

CNCF AI Project Map

CategoryProjectStagePurpose
ML PlatformKubeflowGraduated candidateFull MLOps lifecycle
Model ServingKServeIncubatingServerless inference
Batch SchedulingVolcanoIncubatingGang scheduling, queues
Job QueuingKueueSandboxFair sharing, quotas
GPU ProvisioningKAITOSandboxAutomated LLM deploy
Distributed TrainingTraining OperatorPart of KubeflowPyTorch/TF/MPI jobs
HP TuningKatibPart of KubeflowAutoML experiments
Feature StoreFeastGraduated candidateFeature management

Decision Tree

Need full ML platform (notebooks + training + serving)?
  β†’ Kubeflow

Just need model serving with autoscaling?
  β†’ KServe (standalone, without full Kubeflow)

Need to deploy a preset LLM quickly?
  β†’ KAITO (provisions GPU nodes automatically)

Need batch scheduling for training jobs?
  β†’ Volcano (gang scheduling, fair-share queues)
  β†’ Kueue (lighter weight, K8s-native queuing)

Need multi-model serving on shared GPUs?
  β†’ ModelMesh (part of KServe)

Need hyperparameter tuning?
  β†’ Katib (part of Kubeflow, or standalone)

Complementary Projects

# Common production stack:
# 1. Kueue for job admission control
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: gpu-queue
spec:
  resourceGroups:
    - coveredResources: ["nvidia.com/gpu"]
      flavors:
        - name: a100
          resources:
            - name: nvidia.com/gpu
              nominalQuota: 32

---
# 2. Volcano for gang scheduling
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: training
spec:
  weight: 2
  capability:
    nvidia.com/gpu: 16

---
# 3. KServe for serving
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: production-model
spec:
  predictor:
    model:
      modelFormat:
        name: pytorch
      storageUri: "s3://models/production/v5"

LF AI & Data Landscape

Beyond CNCF, the Linux Foundation AI & Data hosts additional projects:

  • ONNX β€” Open model format, framework interoperability
  • Horovod β€” Distributed deep learning (Uber)
  • MLflow β€” Experiment tracking and model registry
  • Ray β€” Distributed computing framework
graph TD
    subgraph CNCF AI Stack
        KF[Kubeflow<br/>ML Platform] --> KS[KServe<br/>Serving]
        KF --> TO[Training Operator<br/>Distributed Training]
        KF --> KAT[Katib<br/>HP Tuning]
        KF --> PIP[Pipelines<br/>Workflow]
    end
    
    subgraph Scheduling
        VOL[Volcano<br/>Gang Scheduling]
        KUE[Kueue<br/>Job Queuing]
    end
    
    subgraph Provisioning
        KAITO_N[KAITO<br/>GPU Auto-provision]
    end
    
    TO --> VOL
    KS --> MM[ModelMesh<br/>Multi-model]
    KAITO_N --> KS

Common Issues

Kubeflow vs KAITO β€” which to install?

Different purposes. Kubeflow is a full ML platform (notebooks, training, pipelines, serving). KAITO is a one-click LLM deployment tool. Use KAITO for quick inference, Kubeflow for full MLOps lifecycle.

Volcano vs Kueue β€” which scheduler?

Volcano for complex gang scheduling with custom plugins. Kueue for simpler fair-share queuing that works with the default scheduler. Many teams use Kueue for simplicity.

Best Practices

  • Start with KServe if you only need model serving β€” don’t install all of Kubeflow
  • Add Kueue early β€” job queuing prevents GPU resource conflicts
  • KAITO for LLM quickstart β€” minutes to deploy vs hours with manual setup
  • Volcano for multi-node training β€” gang scheduling prevents partial starts
  • Check CNCF maturity level β€” Graduated > Incubating > Sandbox for production use

Key Takeaways

  • CNCF AI ecosystem covers the full ML lifecycle: training, tuning, serving, scheduling, monitoring
  • Kubeflow is the comprehensive platform; KServe, Katib, and Training Operator can be used standalone
  • KAITO automates LLM deployment end-to-end β€” GPU provisioning through serving
  • Volcano and Kueue solve GPU scheduling β€” gang scheduling and fair-share queuing
  • Choose based on need: full platform (Kubeflow), serving only (KServe), quick LLM deploy (KAITO)
#cncf #ai-landscape #cloud-native #ecosystem
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens