πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

AI Cost Management on Kubernetes

Control AI infrastructure costs on Kubernetes with GPU utilization tracking, chargeback per team, spot instance strategies, right-sizing recommendations.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Implement GPU cost management with: (1) DCGM + Prometheus for per-pod GPU utilization, (2) Kubecost or OpenCost for team-level chargeback, (3) Kueue quotas for fair sharing, (4) idle GPU alerts at <20% utilization for >2 hours, (5) spot instances for training with checkpointing.

The Problem

AI infrastructure costs spiral quickly: a single H100 node costs $75,000/year on-demand. Without visibility into who’s using what, teams over-provision GPUs, notebooks run 24/7 with idle GPUs, and training jobs use on-demand instances when spot would work. Most organizations waste 40-60% of their GPU budget.

The Solution

GPU Cost Tracking with OpenCost

# Install OpenCost
helm install opencost opencost/opencost \
  --namespace opencost \
  --set opencost.customPricing.enabled=true \
  --set opencost.customPricing.configPath=/tmp/pricing.json

# Custom GPU pricing
apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-pricing
  namespace: opencost
data:
  pricing.json: |
    {
      "nvidia.com/gpu": {
        "A100-SXM4-80GB": 2.50,
        "H100-SXM5-80GB": 4.00,
        "L4": 0.35,
        "T4": 0.15
      }
    }

Idle GPU Detection

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-cost-alerts
spec:
  groups:
    - name: gpu-cost
      rules:
        - alert: IdleGPU
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[2h]) < 20
            and on(pod) kube_pod_labels{label_workload_type="notebook"}
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "GPU idle >2h in notebook {{ $labels.pod }}"
            action: "Consider scaling down or using CPU-only notebook"
            estimated_waste: "{{ $value | humanize }}% utilization = ~${{ mul 2.5 0.8 | humanize }}/hr wasted"
        
        - alert: GPUOverProvisionedJob
          expr: |
            avg_over_time(DCGM_FI_DEV_GPU_UTIL[1h]) < 30
            and on(pod) kube_pod_labels{label_workload_type="training"}
          for: 2h
          annotations:
            summary: "Training job {{ $labels.pod }} using <30% GPU β€” consider smaller instance"

Team Chargeback Dashboard (PromQL)

# GPU-hours per team per day
sum by (namespace) (
  count_over_time(DCGM_FI_DEV_GPU_UTIL[24h]) / 240
) * on(namespace) group_left(team)
  kube_namespace_labels{label_team!=""}

# Cost per team (GPU-hours Γ— price)
sum by (namespace) (
  count_over_time(DCGM_FI_DEV_GPU_UTIL[24h]) / 240
  * on(node) group_left(gpu_model)
    label_replace(DCGM_FI_DEV_GPU_UTIL, "gpu_model", "$1", "modelName", "(.*)")
) * 2.50  # $/GPU-hour for A100

Auto-Shutdown for Idle Notebooks

apiVersion: batch/v1
kind: CronJob
metadata:
  name: idle-notebook-cleanup
  namespace: kubeflow
spec:
  schedule: "0 */2 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cleanup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  for nb in $(kubectl get notebook -n kubeflow -o name); do
                    LAST_ACTIVITY=$(kubectl get $nb -o jsonpath='{.metadata.annotations.notebooks\.kubeflow\.org/last-activity}')
                    if [ $(date -d "$LAST_ACTIVITY" +%s) -lt $(date -d "4 hours ago" +%s) ]; then
                      kubectl delete $nb
                      echo "Deleted idle notebook: $nb"
                    fi
                  done

Cost Optimization Matrix

StrategySavingsRiskWorkload
Spot instances60-80%InterruptionTraining
MIG sharing3-7xReduced per-model GPUSmall inference
Time-slicing2-4xContentionNotebooks
Idle shutdown30-50%User frictionNotebooks
Right-sizing20-40%NoneAll
Reserved instances30-60%CommitmentSteady-state
graph TD
    DCGM[DCGM Metrics] --> PROM[Prometheus]
    PROM --> OPENCOST[OpenCost<br/>Per-pod GPU cost]
    PROM --> ALERTS[Idle GPU Alerts<br/><20% for 2h]
    PROM --> GRAFANA[Grafana<br/>Team chargeback]
    
    ALERTS --> CLEANUP[Auto-shutdown<br/>Idle notebooks]
    ALERTS --> RIGHTSIZE[Right-size<br/>Recommendations]
    
    OPENCOST --> REPORT[Monthly Report<br/>Per-team cost]

Common Issues

OpenCost not tracking GPU costs

Custom pricing ConfigMap must match your GPU model names exactly. Check: kubectl get nodes -o json | jq '.items[].metadata.labels["nvidia.com/gpu.product"]'.

Teams pushing back on chargeback

Start with visibility (dashboards) before enforcement (quotas). Show teams their utilization before charging. Most teams self-optimize when they see the numbers.

Best Practices

  • Visibility first β€” dashboards before enforcement, show teams their GPU utilization
  • Alert on idle GPUs β€” <20% for 2h is the standard threshold
  • Auto-shutdown idle notebooks β€” biggest single source of GPU waste
  • Spot for training β€” checkpointing makes interruption tolerable
  • Chargeback per team β€” accountability drives optimization behavior

Key Takeaways

  • 40-60% of GPU spending is wasted on idle or underutilized GPUs
  • DCGM + Prometheus + OpenCost provides per-pod, per-team GPU cost tracking
  • Idle GPU alerts catch notebooks and dev pods wasting expensive GPUs
  • Auto-shutdown idle notebooks β€” the single biggest cost optimization
  • Team-level chargeback dashboards drive self-optimization behavior
  • Spot instances for training + MIG/time-slicing for inference = 3-5x cost reduction
#cost-management #gpu-cost #chargeback #right-sizing #finops
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens