Run:ai GPU Metrics Pipeline with DCGM and Thanos
End-to-end GPU metrics pipeline on Run:ai: DCGM exporter collects GPU utilization, Prometheus scrapes, remote-writes to Thanos Receive, and Grafana dashboards
π‘ Quick Answer: Run:ai uses DCGM Exporter β Prometheus β Thanos Receive β Thanos Query β Grafana to provide per-workload GPU utilization, memory usage, NVLink bandwidth, and GPU compute allocation metrics with long-term retention.
The Problem
You need to:
- Track GPU utilization per training job (not just per node)
- Retain GPU metrics beyond Prometheusβs local retention (15d default)
- Visualize GPU compute, memory, and NVLink usage in dashboards
- Correlate GPU metrics with workload lifecycle events
- Alert on underutilized GPUs (wasted expensive resources)
The Solution
Metrics Pipeline Architecture
GPU Node Infra Node
βββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β NVIDIA GPU β β Thanos Receive (StatefulSet) β
β β β β β remote-write β
β DCGM Exporter :9400 β β β β
β β scrape β β Prometheus (cluster-monitoring) β
β Prometheus Agent ββΌββββββββΌββ β β
βββββββββββββββββββββββ β β query β
β Thanos Query β
β β β
β Grafana (Run:ai dashboards) β
βββββββββββββββββββββββββββββββββββKey Metrics Collected
GPU Metrics (from DCGM Exporter):
βββ GPU_UTILIZATION β % compute cores active (0-100)
βββ GPU_MEMORY_USAGE_BYTES β VRAM used (bytes)
βββ GPU_MEMORY_TOTAL_BYTES β Total VRAM available
βββ CPU_USAGE_CORES β Container CPU usage
βββ CPU_MEMORY_USAGE_BYTES β Container RAM usage
βββ NVLINK_BANDWIDTH_TOTAL β Inter-GPU bandwidth (bytes/sec)
βββ GPU_TEMPERATURE β Die temperature (Β°C)
Run:ai enrichment labels:
βββ clusterId β Cluster UUID
βββ workload β Job name (e.g., mistral4small-fsdp)
βββ project β Run:ai project/department
βββ user β Submitting user
βββ gpu_index β GPU device index (0,1,2...)DCGM Exporter DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: gpu-operator
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
ports:
- containerPort: 9400
name: metrics
env:
- name: DCGM_EXPORTER_LISTEN
value: ":9400"
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
volumeMounts:
- name: dcgm-metrics
mountPath: /etc/dcgm-exporter
volumes:
- name: dcgm-metrics
configMap:
name: dcgm-metrics-configPrometheus Remote Write to Thanos
# Prometheus config for remote-write to Thanos Receive
remoteWrite:
- url: "http://runai-backend-thanos-receive.runai-backend.svc:19291/api/v1/receive"
writeRelabelConfigs:
- sourceLabels: [__name__]
regex: "DCGM_.*|runai_.*|nvidia_.*"
action: keep
queueConfig:
maxSamplesPerSend: 5000
batchSendDeadline: 10s
maxRetries: 3Grafana Dashboard Queries (PromQL)
# GPU Compute Utilization per workload
avg(DCGM_FI_DEV_GPU_UTIL{workload="mistral4small-fsdp"}) by (gpu_index)
# GPU Memory Usage per workload (GiB)
sum(DCGM_FI_DEV_FB_USED{workload="mistral4small-fsdp"}) by (gpu_index) / 1024
# Total GPU allocation across cluster
count(DCGM_FI_DEV_GPU_UTIL > 0) / count(DCGM_FI_DEV_GPU_UTIL)
# NVLink bandwidth (GB/s)
rate(DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL[5m]) / 1e9
# Idle GPU detection (< 5% utilization for 30 min)
DCGM_FI_DEV_GPU_UTIL < 5 and ON() (time() - runai_job_start_time > 1800)Run:ai API Metrics Endpoints
Run:ai UI fetches metrics via REST API:
GET /api/v1/metrics?metricType=GPU_UTILIZATION
&start=2026-05-05&d=2026-05-05T14%3A...
&clusterId=d94fdaa3-e91e-4368-b5b9-a71751bf3985
GET /api/v1/metrics?metricType=GPU_MEMORY_USAGE_BYTES
&start=20&d=2026-05-05T14%3A...
GET /api/v1/metrics?metricType=CPU_USAGE_CORES
&start=20&d=2026-05-05T14%3A...
GET /api/v1/metrics?metricType=NVLINK_BANDWIDTH_TOTAL
&start=20&d=2026-05-05T14%3A...
Response format:
{
"value": "12582912",
"timestamp": "2026-05-05T14:07:12.3452"
}Typical GPU Training Metrics Pattern
Phase GPU Util GPU Mem CPU Cores CPU Mem
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Initializing 0% 0 GB 0.5 2 GB
Model Loading 5-10% +20 GB 2-3 +40 GB
FSDP Setup 10-20% +5 GB 3-4 +10 GB
Training Loop 85-98% stable 2-4 stable
Checkpointing 20-30% stable 1-2 +5 GB
Evaluation 60-80% stable 1-2 stable
Completion 0% drops 0 dropsAlert Rules for GPU Waste
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: gpu-waste-alerts
namespace: runai-backend
spec:
groups:
- name: gpu-utilization
rules:
- alert: GPUIdleWorkload
expr: |
avg_over_time(DCGM_FI_DEV_GPU_UTIL[30m]) < 5
and on(pod) kube_pod_status_phase{phase="Running"} == 1
for: 30m
labels:
severity: warning
annotations:
summary: "GPU idle for 30+ minutes"
description: "Workload {{ $labels.workload }} using < 5% GPU"
- alert: GPUMemoryUnderutilized
expr: |
DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE < 0.2
for: 1h
labels:
severity: info
annotations:
summary: "GPU memory < 20% utilized"Common Issues
GPU metrics missing in Grafana
- Cause: Thanos Receive crashed (OOMKilled) β metrics gap
- Fix: Fix Thanos Receive memory; historical gaps are permanent
DCGM Exporter CrashLoopBackOff
- Cause: GPU driver mismatch or DCGM version incompatibility
- Fix: Match DCGM exporter version to GPU driver version
NVLink metrics all zeros
- Cause: Single-GPU workload or NVLink not configured
- Fix: NVLink metrics only appear for multi-GPU workloads using NCCL
Metrics delayed by 5+ minutes
- Cause: Prometheus remote-write queue backlog
- Fix: Increase
maxSamplesPerSend; check Thanos Receive health
Best Practices
- Size Thanos Receive for retention β 4Gi+ memory for 15d of GPU metrics
- Filter remote-write β only send DCGM/Run:ai metrics, not all cluster metrics
- Alert on idle GPUs β $10+/hour per GPU wasted is expensive
- Use per-workload labels β enables chargeback by team/project
- Monitor NVLink β bandwidth drops indicate NCCL communication issues
- Set dashboard time ranges β training jobs are short; use 1h-4h windows
Key Takeaways
- DCGM Exporter runs as DaemonSet on all GPU nodes, exposes :9400
- Prometheus scrapes and remote-writes to Thanos Receive in runai-backend
- Run:ai UI queries metrics via REST API (wraps Thanos Query PromQL)
- Key metrics: GPU_UTILIZATION, GPU_MEMORY_USAGE_BYTES, NVLINK_BANDWIDTH
- Typical FSDP training: 85-98% GPU util, ~32% GPU memory, spiky CPU
- Thanos Receive OOM causes permanent metrics gaps β size it properly
- Alert on idle GPUs to avoid wasting $10+/hour per unused device

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
