πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability intermediate ⏱ 30 minutes K8s 1.28+

Monitor NCCL Benchmark Runs Prometheus & Gr...

Track NCCL benchmark outcomes and GPU telemetry over time with Prometheus and Grafana dashboards to detect communication regressions early.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Combine NCCL benchmark logs with GPU metrics (utilization, memory, interconnect indicators) in Grafana dashboards to detect performance drift across cluster changes.

Benchmark snapshots are useful, but trend-based monitoring catches regressions sooner.

Data Sources

  • NCCL benchmark output logs
  • DCGM exporter metrics
  • Node and pod metadata labels

Dashboard Suggestions

  • Benchmark run duration by node pair
  • Effective bandwidth trend by test profile
  • GPU utilization and memory during tests
  • Failure count per benchmark type

Operational Practice

Schedule recurring benchmark jobs and alert when bandwidth drops below baseline thresholds.

Prometheus Metrics for NCCL

Export NCCL communication metrics to Prometheus for GPU cluster monitoring.

DCGM Integration

DCGM Exporter exposes GPU metrics that correlate with NCCL performance:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
        ports:
        - containerPort: 9400
        env:
        - name: DCGM_EXPORTER_COLLECTORS
          value: "/etc/dcgm-exporter/dcp-metrics-included.csv"

Key Metrics to Monitor

MetricDescriptionAlert Threshold
DCGM_FI_PROF_NVLINK_TX_BYTESNVLink transmit bandwidth<50% of peak
DCGM_FI_PROF_NVLINK_RX_BYTESNVLink receive bandwidth<50% of peak
DCGM_FI_DEV_GPU_UTILGPU utilization during NCCL ops<70%
DCGM_FI_DEV_MEM_COPY_UTILMemory copy utilization<40%

Grafana Dashboard

# Import NCCL monitoring dashboard
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: nccl-dashboard
  labels:
    grafana_dashboard: "1"
data:
  nccl-performance.json: |
    {
      "title": "NCCL Performance",
      "panels": [
        {"title": "NVLink Bandwidth", "targets": [{"expr": "rate(DCGM_FI_PROF_NVLINK_TX_BYTES[5m])"}]},
        {"title": "AllReduce Latency", "targets": [{"expr": "histogram_quantile(0.99, nccl_allreduce_duration_bucket)"}]}
      ]
    }
EOF
#nccl #prometheus #grafana #observability #gpu
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens