πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability intermediate ⏱ 15 minutes K8s 1.28+

Per-Tenant GPU Monitoring and Chargeback

Build per-tenant GPU monitoring dashboards with queue time, utilization, thermal metrics, and GPU-hour chargeback on Kubernetes.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Use DCGM Exporter for GPU metrics, kube_pod_labels for tenant association, and PromQL queries like DCGM_FI_DEV_GPU_UTIL * on(pod) group_left(namespace) kube_pod_info{namespace=~"tenant-.*"} for per-tenant GPU utilization. Calculate GPU-hours for chargeback with sum(rate(DCGM_FI_DEV_GPU_UTIL[1h])) by (namespace) / 100 * count by (namespace).

The Problem

When teams can’t see their GPU usage, behavior doesn’t change. Over-provisioning, idle GPUs, and queue starvation remain invisible. Without chargeback, there’s no incentive to optimize. Teams need self-service dashboards showing their queue time, GPU utilization, and costs.

The Solution

DCGM Exporter Metrics

# Key DCGM metrics for tenant monitoring:
# DCGM_FI_DEV_GPU_UTIL          β€” GPU utilization %
# DCGM_FI_DEV_MEM_COPY_UTIL     β€” Memory utilization %
# DCGM_FI_DEV_GPU_TEMP          β€” GPU temperature
# DCGM_FI_DEV_POWER_USAGE       β€” Power draw (W)
# DCGM_FI_DEV_FB_USED           β€” Framebuffer memory used (MB)
# DCGM_FI_DEV_FB_FREE           β€” Framebuffer memory free (MB)
# DCGM_FI_PROF_GR_ENGINE_ACTIVE β€” GPU engine active ratio

Per-Tenant GPU Utilization

# PrometheusRule for per-tenant GPU alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: gpu-tenant-alerts
  namespace: openshift-monitoring
spec:
  groups:
    - name: gpu-tenant
      interval: 30s
      rules:
        # Per-tenant GPU utilization
        - record: tenant:gpu_utilization:avg
          expr: |
            avg(DCGM_FI_DEV_GPU_UTIL
              * on(pod, namespace) group_left()
              kube_pod_info{namespace=~"tenant-.*"}
            ) by (namespace)

        # Per-tenant GPU memory usage
        - record: tenant:gpu_memory_used_bytes:sum
          expr: |
            sum(DCGM_FI_DEV_FB_USED
              * on(pod, namespace) group_left()
              kube_pod_info{namespace=~"tenant-.*"}
            ) by (namespace) * 1024 * 1024

        # GPU-hours per tenant (for chargeback)
        - record: tenant:gpu_hours:rate1h
          expr: |
            sum(
              count by (namespace, pod) (
                DCGM_FI_DEV_GPU_UTIL{namespace=~"tenant-.*"} > 0
              )
            ) by (namespace) / 3600

        # Alert: GPU idle >30min
        - alert: GPUIdleTenant
          expr: |
            tenant:gpu_utilization:avg < 5
            and on(namespace)
            count(DCGM_FI_DEV_GPU_UTIL{namespace=~"tenant-.*"} >= 0) by (namespace) > 0
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "Tenant {{ $labels.namespace }} has idle GPUs for >30m"

Chargeback Dashboard Queries

# Grafana dashboard panels (PromQL):

# Panel 1: GPU-hours per tenant (daily)
query: |
  sum(
    increase(
      tenant:gpu_hours:rate1h[24h]
    )
  ) by (namespace)

# Panel 2: Queue time by priority class
query: |
  histogram_quantile(0.95,
    sum(rate(scheduler_queue_incoming_pods_total[5m])) by (le, priority_level)
  )

# Panel 3: GPU utilization heatmap per tenant
query: |
  avg(DCGM_FI_DEV_GPU_UTIL
    * on(pod, namespace) group_left()
    kube_pod_info{namespace=~"tenant-.*"}
  ) by (namespace)

# Panel 4: Cost estimation ($3.50/GPU-hour for H200)
query: |
  sum(increase(tenant:gpu_hours:rate1h[720h])) by (namespace) * 3.50

Queue Time Monitoring

# Scheduler metrics for queue wait time
- record: tenant:scheduler_queue_time_p95:5m
  expr: |
    histogram_quantile(0.95,
      sum(rate(scheduler_scheduling_attempt_duration_seconds_bucket[5m]))
      by (le, result)
    )

Self-Service Dashboard ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: tenant-gpu-dashboard
  namespace: openshift-monitoring
  labels:
    grafana_dashboard: "true"
data:
  tenant-gpu.json: |
    {
      "dashboard": {
        "title": "GPU Tenant Dashboard",
        "templating": {
          "list": [{
            "name": "namespace",
            "type": "query",
            "query": "label_values(DCGM_FI_DEV_GPU_UTIL, namespace)",
            "regex": "tenant-.*"
          }]
        },
        "panels": [
          {"title": "GPU Utilization", "type": "gauge"},
          {"title": "GPU Memory", "type": "timeseries"},
          {"title": "Queue Time p95", "type": "stat"},
          {"title": "GPU-Hours (MTD)", "type": "stat"},
          {"title": "Estimated Cost", "type": "stat"}
        ]
      }
    }
graph TD
    A[DCGM Exporter] --> B[Prometheus]
    C[kube-state-metrics] --> B
    D[Scheduler Metrics] --> B
    E[HAProxy Logs] --> F[Loki]
    
    B --> G[Recording Rules per tenant]
    G --> H[Grafana Dashboard]
    
    H --> I[GPU Utilization]
    H --> J[Queue Time p95]
    H --> K[GPU Memory and Thermal]
    H --> L[Chargeback: GPU-hours x rate]
    
    L --> M[Tenant self-service view]

Common Issues

  • DCGM metrics missing namespace label β€” DCGM exports per-GPU; join with kube_pod_info to get namespace association
  • Chargeback counts idle GPUs β€” use DCGM_FI_DEV_GPU_UTIL > 5 threshold to only count active GPUs
  • Dashboard too slow β€” use recording rules for expensive per-tenant aggregations; pre-compute hourly

Best Practices

  • Use recording rules for per-tenant aggregations β€” avoid expensive real-time PromQL
  • Provide self-service dashboards per tenant (Grafana with namespace variable)
  • Alert on idle GPUs (>30min at <5% utilization) to improve utilization
  • GPU-hour chargeback changes team behavior β€” visibility drives optimization
  • Track queue time p95 as a key SLO β€” if teams wait >10min for GPUs, add capacity or adjust priorities

Key Takeaways

  • DCGM Exporter + kube_pod_info join enables per-tenant GPU metrics
  • GPU-hours = primary chargeback metric (count of GPUs Γ— hours active)
  • Queue time, utilization, memory, thermal β€” four key metrics per tenant
  • Self-service dashboards change behavior β€” teams optimize when they see costs
  • Recording rules prevent expensive real-time aggregations in Prometheus
#monitoring #gpu #chargeback #prometheus #grafana #dcgm
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens