NVIDIA CNS with Insight Operator for Network Diagnostics
Deploy NVIDIA Cloud-Native Stack (CNS) with the Insight Operator and NVIDIA Insight tools for deep GPU fabric diagnostics. Collect NIC firmware health, link
π‘ Quick Answer: NVIDIA Cloud-Native Stack (CNS) bundles the GPU Operator, Network Operator, and Insight tools into a validated deployment. The Insight Operator adds deep NIC/switch diagnostics β firmware health checks, link quality monitoring, cable testing, and topology discovery β beyond what standard telemetry provides.
The Problem
- Standard monitoring shows counters but not root causes (why is the link degraded?)
- Firmware bugs, cable degradation, and transceiver aging cause intermittent NCCL failures
- No visibility into switch-to-NIC negotiation issues or speed downgrades
- Need proactive diagnostics β detect problems before training jobs fail
- Large GPU clusters have hundreds of links; manual
mlxlinkchecks donβt scale
The Solution
NVIDIA Cloud-Native Stack (CNS) Components
NVIDIA Cloud-Native Stack (CNS)
βββ GPU Operator β GPU drivers, device plugin, DCGM, MIG
βββ Network Operator β RDMA, SR-IOV, Multus, NIC drivers
βββ Insight Tools β Diagnostics, health, topology
βββ NIC Health Agent β Firmware checks, self-test
βββ Link Monitor β Signal quality, BER, FEC rates
βββ Cable Diagnostics β TDR testing, transceiver health
βββ Topology Discovery β Fabric map, path validationDeploy CNS with Insight Operator
# NicClusterPolicy with Insight tools enabled
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
# Standard Network Operator components
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: "24.10-0.7.0.0"
rdmaSharedDevicePlugin:
image: k8s-rdma-shared-dev-plugin
repository: nvcr.io/nvidia/mellanox
version: "1.5.1"
nvIpam:
image: nvidia-k8s-ipam
repository: ghcr.io/mellanox
version: "0.3.0"
# Insight tools β network diagnostics
nicFeatureDiscovery:
image: nic-feature-discovery
repository: nvcr.io/nvidia/mellanox
version: "0.1.0"
# Enable Insight Agent for deep diagnostics
insightAgent:
image: insight-agent
repository: nvcr.io/nvidia/mellanox
version: "1.2.0"
config:
# Health check interval (seconds)
healthCheckInterval: 300
# Enable all diagnostic modules
enableLinkMonitor: true
enableCableDiagnostics: true
enableFirmwareHealth: true
enableTopologyDiscovery: true
# Prometheus export
metricsPort: 9091Insight Agent DaemonSet
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-insight-agent
namespace: nvidia-network-operator
spec:
selector:
matchLabels:
app: nvidia-insight-agent
template:
metadata:
labels:
app: nvidia-insight-agent
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9091"
spec:
nodeSelector:
feature.node.kubernetes.io/network-sriov.capable: "true"
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
hostNetwork: true
containers:
- name: insight-agent
image: nvcr.io/nvidia/mellanox/insight-agent:1.2.0
securityContext:
privileged: true
ports:
- containerPort: 9091
name: metrics
env:
- name: INSIGHT_HEALTH_INTERVAL
value: "300"
- name: INSIGHT_LINK_MONITOR_INTERVAL
value: "60"
- name: INSIGHT_CABLE_DIAG_INTERVAL
value: "3600"
- name: INSIGHT_LOG_LEVEL
value: "info"
volumeMounts:
- name: sys
mountPath: /sys
- name: dev
mountPath: /dev
- name: mst
mountPath: /dev/mst
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
volumes:
- name: sys
hostPath:
path: /sys
- name: dev
hostPath:
path: /dev
- name: mst
hostPath:
path: /dev/mstNIC Health Diagnostics
# Run NIC self-test (mlxlink-based)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli nic-health --device mlx5_0
# Output:
# Device: mlx5_0 (ConnectX-7)
# ββββββββββββββββββββββββββββββββββββββββββββ
# Firmware Version: 28.42.1000 β
# Firmware Status: Valid
# PCI Link Speed: Gen5 x16 (64 GT/s) β
# PCI Width: x16 (no degradation) β
# Temperature: 52Β°C (max 105Β°C) β
# Link Speed: 400 Gb/s (NDR) β
# Physical Link: Up
# Logical Link: Up
# FEC Mode: RS-FEC (544,514) β
# Eye Opening: Good (margin: 42%)
# Self-Test: PASSED β
# Batch health check across all nodes
kubectl get pods -n nvidia-network-operator -l app=nvidia-insight-agent \
-o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | \
while read node; do
echo "=== $node ==="
kubectl exec -n nvidia-network-operator \
$(kubectl get pod -n nvidia-network-operator -l app=nvidia-insight-agent \
--field-selector spec.nodeName=$node -o name) -- \
insight-cli nic-health --all --format=short
doneLink Quality Monitoring
# Real-time link quality (BER, FEC errors, eye diagram)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli link-monitor --device mlx5_0 --port 1
# Output:
# Link Quality Report β mlx5_0/1
# ββββββββββββββββββββββββββββββββββββββββββββ
# Speed: 400 Gb/s (4x 106.25 GBaud)
# FEC Mode: RS-FEC (544,514)
# FEC Corrected CW: 1,234 (rate: 2.3e-8) β
Normal
# FEC Uncorrected CW: 0 β
# Raw BER: 1.2e-12 β
(threshold: 1e-6)
# Effective BER: 0 (post-FEC) β
# Symbol Errors: 0 β
# Link Flaps (24h): 0 β
# Eye Height (mV): 38 β
(min: 15)
# Eye Width (ps): 12 β
(min: 5)
#
# Status: HEALTHY β link operating within spec
# Check for degraded links (pre-failure detection)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli link-monitor --all --degraded-only
# Shows only links with:
# - FEC corrected rate > 1e-6 (high correction = cable/connector issue)
# - Eye opening < 50% margin
# - BER approaching threshold
# - Recent link flapsCable Diagnostics (TDR)
# Time-Domain Reflectometry β finds cable faults
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli cable-diag --device mlx5_0 --port 1
# Output:
# Cable Diagnostics β mlx5_0/1
# ββββββββββββββββββββββββββββββββββββββββββββ
# Cable Type: AOC (Active Optical Cable)
# Vendor: Mellanox Technologies
# Part Number: MFS1S00-H030V
# Serial: MT2318FT01234
# Length: 30m
# Temperature: 42Β°C (max: 70Β°C) β
# TX Power (Lane 1-4): -1.2, -1.1, -1.3, -1.2 dBm β
# RX Power (Lane 1-4): -3.4, -3.2, -3.5, -3.3 dBm β
# Voltage: 3.28V β
# TDR Test: PASSED β no faults detected
# Cable Health Score: 98/100 β
#
# Warnings: None
# Identify failing cables before they cause job failures
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli cable-diag --all --warnings-only
# Example output for degraded cable:
# β οΈ gpu-node-07 mlx5_2/1:
# RX Power Lane 3: -8.2 dBm (threshold: -7.0 dBm)
# Cable Health Score: 62/100
# Recommendation: Schedule cable replacement within 2 weeksTopology Discovery
# Discover GPU fabric topology (NIC β switch β NIC paths)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli topology discover
# Output:
# GPU Fabric Topology
# ββββββββββββββββββββββββββββββββββββββββββββ
#
# Leaf Switch: sw-leaf-01 (Quantum-2 QM9700)
# βββ Port 1 β gpu-node-01/mlx5_0 (400G NDR) β
# βββ Port 2 β gpu-node-01/mlx5_1 (400G NDR) β
# βββ Port 3 β gpu-node-02/mlx5_0 (400G NDR) β
# βββ Port 4 β gpu-node-02/mlx5_1 (400G NDR) β
# βββ Uplink β sw-spine-01 Port 33 (800G NDR)
#
# Leaf Switch: sw-leaf-02 (Quantum-2 QM9700)
# βββ Port 1 β gpu-node-03/mlx5_0 (400G NDR) β
# βββ Port 2 β gpu-node-03/mlx5_1 (400G NDR) β
# βββ Port 3 β gpu-node-04/mlx5_0 (400G NDR) β
# βββ Port 4 β gpu-node-04/mlx5_1 (200G HDR) β οΈ Speed mismatch!
# βββ Uplink β sw-spine-01 Port 34 (800G NDR)
# Validate NCCL path between two nodes
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
insight-cli topology path --src gpu-node-01 --dst gpu-node-03
# Path: gpu-node-01/mlx5_0 β sw-leaf-01:P1 β sw-spine-01:P33βP34 β sw-leaf-02:P1 β gpu-node-03/mlx5_0
# Hops: 3 (leaf-spine-leaf)
# Max bandwidth: 400 Gb/s (bottleneck: NIC speed)
# Latency estimate: ~2.1 ΞΌsPrometheus Metrics from Insight
# Insight-specific metrics beyond standard DTS counters
# TYPE nvidia_insight_nic_health_score gauge
nvidia_insight_nic_health_score{device="mlx5_0",node="gpu-node-01"} 100
# TYPE nvidia_insight_cable_health_score gauge
nvidia_insight_cable_health_score{device="mlx5_0",port="1",node="gpu-node-01"} 98
# TYPE nvidia_insight_fec_corrected_rate gauge
nvidia_insight_fec_corrected_rate{device="mlx5_0",port="1"} 2.3e-8
# TYPE nvidia_insight_eye_height_mv gauge
nvidia_insight_eye_height_mv{device="mlx5_0",port="1"} 38
# TYPE nvidia_insight_link_flaps_total counter
nvidia_insight_link_flaps_total{device="mlx5_0",port="1"} 0
# TYPE nvidia_insight_cable_temperature_celsius gauge
nvidia_insight_cable_temperature_celsius{device="mlx5_0",port="1"} 42Alert on Degradation (Before Failure)
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: nvidia-insight-alerts
namespace: nvidia-network-operator
spec:
groups:
- name: gpu-fabric-health
rules:
- alert: NICHealthDegraded
expr: nvidia_insight_nic_health_score < 80
for: 5m
labels:
severity: warning
annotations:
summary: "NIC health degraded on {{ $labels.node }} {{ $labels.device }}"
description: "Health score {{ $value }}/100 β run insight-cli nic-health for details."
- alert: CableHealthCritical
expr: nvidia_insight_cable_health_score < 70
for: 5m
labels:
severity: warning
annotations:
summary: "Cable degrading on {{ $labels.node }} {{ $labels.device }}/{{ $labels.port }}"
description: "Score {{ $value }}/100 β schedule replacement."
- alert: HighFECCorrectionRate
expr: nvidia_insight_fec_corrected_rate > 1e-6
for: 10m
labels:
severity: warning
annotations:
summary: "High FEC correction rate on {{ $labels.device }}"
description: "Rate {{ $value }} β cable/connector degradation likely."
- alert: LinkFlap
expr: increase(nvidia_insight_link_flaps_total[1h]) > 0
labels:
severity: critical
annotations:
summary: "Link flap detected on {{ $labels.node }} {{ $labels.device }}"
description: "Unstable link β NCCL jobs will experience failures."
- alert: EyeOpeningLow
expr: nvidia_insight_eye_height_mv < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Low eye opening on {{ $labels.device }}/{{ $labels.port }}"
description: "Signal integrity marginal ({{ $value }}mV) β approaching failure threshold."Manual Diagnostics via mlxlink
# If Insight Operator isn't deployed, use mlxlink directly
# (available in DOCA driver container or host)
# Link status and speed
mlxlink -d mlx5_0 -p 1
# Eye opening measurement
mlxlink -d mlx5_0 -p 1 --show_eye
# FEC counters and BER
mlxlink -d mlx5_0 -p 1 --show_fec
# Cable/transceiver info
mlxlink -d mlx5_0 -p 1 --show_module
# Full diagnostic dump
mlxlink -d mlx5_0 -p 1 -m --json
# Cable TDR test (takes ~30 seconds)
mlxcables -d mlx5_0 --port 1 --read_diagCommon Issues
Insight agent canβt access /dev/mst
- Cause: Mellanox Software Tools (MST) not started on host
- Fix: Run
mst starton host or ensure DOCA driver container starts MST
Cable diagnostics show βunsupportedβ
- Cause: Passive copper cables donβt support TDR or power monitoring
- Fix: Only AOC/transceiver-based cables support full diagnostics
Topology discovery incomplete
- Cause: No LLDP or subnet manager running; canβt discover switch hops
- Fix: Enable LLDP on switches; or configure OpenSM for IB fabrics
Health score fluctuates
- Cause: FEC corrections are normal at high speeds (400G+); transient spikes
- Fix: Alert on sustained degradation (>10min) not momentary spikes
Best Practices
- Run cable diagnostics weekly β catch degradation before failures
- Baseline eye opening after installation β know your βgoodβ values
- Alert on FEC uncorrected > 0 β this means data corruption is possible
- Topology discovery after any cabling change β verify paths are optimal
- Track cable temperature β overheating cables degrade faster
- Schedule replacements proactively β health score <70 = replace within 2 weeks
Key Takeaways
- NVIDIA CNS bundles GPU Operator + Network Operator + Insight tools
- Insight Operator provides NIC health, link quality, cable diagnostics, topology discovery
- Goes beyond counters β detects why links are degraded (eye opening, BER, cable power)
- Proactive: catches cable/connector degradation before NCCL jobs fail
mlxlinkis the manual equivalent for one-off diagnostics- Alert on health scores, FEC rates, eye opening, link flaps β not just error counters
- Topology discovery validates that NCCL traffic takes optimal switch paths

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
