Monitor NCCL Benchmark Runs Prometheus & Gr...
Track NCCL benchmark outcomes and GPU telemetry over time with Prometheus and Grafana dashboards to detect communication regressions early.
π‘ Quick Answer: Combine NCCL benchmark logs with GPU metrics (utilization, memory, interconnect indicators) in Grafana dashboards to detect performance drift across cluster changes.
Benchmark snapshots are useful, but trend-based monitoring catches regressions sooner.
Data Sources
- NCCL benchmark output logs
- DCGM exporter metrics
- Node and pod metadata labels
Dashboard Suggestions
- Benchmark run duration by node pair
- Effective bandwidth trend by test profile
- GPU utilization and memory during tests
- Failure count per benchmark type
Operational Practice
Schedule recurring benchmark jobs and alert when bandwidth drops below baseline thresholds.
Prometheus Metrics for NCCL
Export NCCL communication metrics to Prometheus for GPU cluster monitoring.
DCGM Integration
DCGM Exporter exposes GPU metrics that correlate with NCCL performance:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04
ports:
- containerPort: 9400
env:
- name: DCGM_EXPORTER_COLLECTORS
value: "/etc/dcgm-exporter/dcp-metrics-included.csv"Key Metrics to Monitor
| Metric | Description | Alert Threshold |
|---|---|---|
DCGM_FI_PROF_NVLINK_TX_BYTES | NVLink transmit bandwidth | <50% of peak |
DCGM_FI_PROF_NVLINK_RX_BYTES | NVLink receive bandwidth | <50% of peak |
DCGM_FI_DEV_GPU_UTIL | GPU utilization during NCCL ops | <70% |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy utilization | <40% |
Grafana Dashboard
# Import NCCL monitoring dashboard
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: nccl-dashboard
labels:
grafana_dashboard: "1"
data:
nccl-performance.json: |
{
"title": "NCCL Performance",
"panels": [
{"title": "NVLink Bandwidth", "targets": [{"expr": "rate(DCGM_FI_PROF_NVLINK_TX_BYTES[5m])"}]},
{"title": "AllReduce Latency", "targets": [{"expr": "histogram_quantile(0.99, nccl_allreduce_duration_bucket)"}]}
]
}
EOF
Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
