πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Observability advanced ⏱ 15 minutes K8s 1.28+

NVIDIA CNS with Insight Operator for Network Diagnostics

Deploy NVIDIA Cloud-Native Stack (CNS) with the Insight Operator and NVIDIA Insight tools for deep GPU fabric diagnostics. Collect NIC firmware health, link

By Luca Berton β€’ β€’ πŸ“– 9 min read

πŸ’‘ Quick Answer: NVIDIA Cloud-Native Stack (CNS) bundles the GPU Operator, Network Operator, and Insight tools into a validated deployment. The Insight Operator adds deep NIC/switch diagnostics β€” firmware health checks, link quality monitoring, cable testing, and topology discovery β€” beyond what standard telemetry provides.

The Problem

  • Standard monitoring shows counters but not root causes (why is the link degraded?)
  • Firmware bugs, cable degradation, and transceiver aging cause intermittent NCCL failures
  • No visibility into switch-to-NIC negotiation issues or speed downgrades
  • Need proactive diagnostics β€” detect problems before training jobs fail
  • Large GPU clusters have hundreds of links; manual mlxlink checks don’t scale

The Solution

NVIDIA Cloud-Native Stack (CNS) Components

NVIDIA Cloud-Native Stack (CNS)
β”œβ”€β”€ GPU Operator           ← GPU drivers, device plugin, DCGM, MIG
β”œβ”€β”€ Network Operator       ← RDMA, SR-IOV, Multus, NIC drivers
└── Insight Tools          ← Diagnostics, health, topology
    β”œβ”€β”€ NIC Health Agent   ← Firmware checks, self-test
    β”œβ”€β”€ Link Monitor       ← Signal quality, BER, FEC rates
    β”œβ”€β”€ Cable Diagnostics  ← TDR testing, transceiver health
    └── Topology Discovery ← Fabric map, path validation

Deploy CNS with Insight Operator

# NicClusterPolicy with Insight tools enabled
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  # Standard Network Operator components
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: "24.10-0.7.0.0"

  rdmaSharedDevicePlugin:
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/mellanox
    version: "1.5.1"

  nvIpam:
    image: nvidia-k8s-ipam
    repository: ghcr.io/mellanox
    version: "0.3.0"

  # Insight tools β€” network diagnostics
  nicFeatureDiscovery:
    image: nic-feature-discovery
    repository: nvcr.io/nvidia/mellanox
    version: "0.1.0"

  # Enable Insight Agent for deep diagnostics
  insightAgent:
    image: insight-agent
    repository: nvcr.io/nvidia/mellanox
    version: "1.2.0"
    config:
      # Health check interval (seconds)
      healthCheckInterval: 300
      # Enable all diagnostic modules
      enableLinkMonitor: true
      enableCableDiagnostics: true
      enableFirmwareHealth: true
      enableTopologyDiscovery: true
      # Prometheus export
      metricsPort: 9091

Insight Agent DaemonSet

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-insight-agent
  namespace: nvidia-network-operator
spec:
  selector:
    matchLabels:
      app: nvidia-insight-agent
  template:
    metadata:
      labels:
        app: nvidia-insight-agent
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9091"
    spec:
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      hostNetwork: true
      containers:
        - name: insight-agent
          image: nvcr.io/nvidia/mellanox/insight-agent:1.2.0
          securityContext:
            privileged: true
          ports:
            - containerPort: 9091
              name: metrics
          env:
            - name: INSIGHT_HEALTH_INTERVAL
              value: "300"
            - name: INSIGHT_LINK_MONITOR_INTERVAL
              value: "60"
            - name: INSIGHT_CABLE_DIAG_INTERVAL
              value: "3600"
            - name: INSIGHT_LOG_LEVEL
              value: "info"
          volumeMounts:
            - name: sys
              mountPath: /sys
            - name: dev
              mountPath: /dev
            - name: mst
              mountPath: /dev/mst
          resources:
            requests:
              cpu: "100m"
              memory: "256Mi"
            limits:
              cpu: "500m"
              memory: "512Mi"
      volumes:
        - name: sys
          hostPath:
            path: /sys
        - name: dev
          hostPath:
            path: /dev
        - name: mst
          hostPath:
            path: /dev/mst

NIC Health Diagnostics

# Run NIC self-test (mlxlink-based)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli nic-health --device mlx5_0

# Output:
# Device: mlx5_0 (ConnectX-7)
# ────────────────────────────────────────────
# Firmware Version:    28.42.1000 βœ…
# Firmware Status:     Valid
# PCI Link Speed:      Gen5 x16 (64 GT/s) βœ…
# PCI Width:           x16 (no degradation) βœ…
# Temperature:         52Β°C (max 105Β°C) βœ…
# Link Speed:          400 Gb/s (NDR) βœ…
# Physical Link:       Up
# Logical Link:        Up
# FEC Mode:            RS-FEC (544,514) βœ…
# Eye Opening:         Good (margin: 42%)
# Self-Test:           PASSED βœ…

# Batch health check across all nodes
kubectl get pods -n nvidia-network-operator -l app=nvidia-insight-agent \
  -o jsonpath='{range .items[*]}{.spec.nodeName}{"\n"}{end}' | \
  while read node; do
    echo "=== $node ==="
    kubectl exec -n nvidia-network-operator \
      $(kubectl get pod -n nvidia-network-operator -l app=nvidia-insight-agent \
        --field-selector spec.nodeName=$node -o name) -- \
      insight-cli nic-health --all --format=short
  done
# Real-time link quality (BER, FEC errors, eye diagram)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli link-monitor --device mlx5_0 --port 1

# Output:
# Link Quality Report β€” mlx5_0/1
# ────────────────────────────────────────────
# Speed:                400 Gb/s (4x 106.25 GBaud)
# FEC Mode:             RS-FEC (544,514)
# FEC Corrected CW:     1,234 (rate: 2.3e-8) βœ… Normal
# FEC Uncorrected CW:   0 βœ…
# Raw BER:              1.2e-12 βœ… (threshold: 1e-6)
# Effective BER:        0 (post-FEC) βœ…
# Symbol Errors:        0 βœ…
# Link Flaps (24h):     0 βœ…
# Eye Height (mV):      38 βœ… (min: 15)
# Eye Width (ps):       12 βœ… (min: 5)
#
# Status: HEALTHY β€” link operating within spec

# Check for degraded links (pre-failure detection)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli link-monitor --all --degraded-only

# Shows only links with:
# - FEC corrected rate > 1e-6 (high correction = cable/connector issue)
# - Eye opening < 50% margin
# - BER approaching threshold
# - Recent link flaps

Cable Diagnostics (TDR)

# Time-Domain Reflectometry β€” finds cable faults
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli cable-diag --device mlx5_0 --port 1

# Output:
# Cable Diagnostics β€” mlx5_0/1
# ────────────────────────────────────────────
# Cable Type:           AOC (Active Optical Cable)
# Vendor:               Mellanox Technologies
# Part Number:          MFS1S00-H030V
# Serial:               MT2318FT01234
# Length:               30m
# Temperature:          42Β°C (max: 70Β°C) βœ…
# TX Power (Lane 1-4):  -1.2, -1.1, -1.3, -1.2 dBm βœ…
# RX Power (Lane 1-4):  -3.4, -3.2, -3.5, -3.3 dBm βœ…
# Voltage:              3.28V βœ…
# TDR Test:             PASSED β€” no faults detected
# Cable Health Score:   98/100 βœ…
#
# Warnings: None

# Identify failing cables before they cause job failures
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli cable-diag --all --warnings-only

# Example output for degraded cable:
# ⚠️  gpu-node-07 mlx5_2/1:
#   RX Power Lane 3: -8.2 dBm (threshold: -7.0 dBm)
#   Cable Health Score: 62/100
#   Recommendation: Schedule cable replacement within 2 weeks

Topology Discovery

# Discover GPU fabric topology (NIC β†’ switch β†’ NIC paths)
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli topology discover

# Output:
# GPU Fabric Topology
# ════════════════════════════════════════════
# 
# Leaf Switch: sw-leaf-01 (Quantum-2 QM9700)
# β”œβ”€β”€ Port 1  ← gpu-node-01/mlx5_0 (400G NDR) βœ…
# β”œβ”€β”€ Port 2  ← gpu-node-01/mlx5_1 (400G NDR) βœ…
# β”œβ”€β”€ Port 3  ← gpu-node-02/mlx5_0 (400G NDR) βœ…
# β”œβ”€β”€ Port 4  ← gpu-node-02/mlx5_1 (400G NDR) βœ…
# └── Uplink  β†’ sw-spine-01 Port 33 (800G NDR)
#
# Leaf Switch: sw-leaf-02 (Quantum-2 QM9700)
# β”œβ”€β”€ Port 1  ← gpu-node-03/mlx5_0 (400G NDR) βœ…
# β”œβ”€β”€ Port 2  ← gpu-node-03/mlx5_1 (400G NDR) βœ…
# β”œβ”€β”€ Port 3  ← gpu-node-04/mlx5_0 (400G NDR) βœ…
# β”œβ”€β”€ Port 4  ← gpu-node-04/mlx5_1 (200G HDR) ⚠️ Speed mismatch!
# └── Uplink  β†’ sw-spine-01 Port 34 (800G NDR)

# Validate NCCL path between two nodes
kubectl exec -n nvidia-network-operator ds/nvidia-insight-agent -- \
  insight-cli topology path --src gpu-node-01 --dst gpu-node-03

# Path: gpu-node-01/mlx5_0 β†’ sw-leaf-01:P1 β†’ sw-spine-01:P33β†’P34 β†’ sw-leaf-02:P1 β†’ gpu-node-03/mlx5_0
# Hops: 3 (leaf-spine-leaf)
# Max bandwidth: 400 Gb/s (bottleneck: NIC speed)
# Latency estimate: ~2.1 ΞΌs

Prometheus Metrics from Insight

# Insight-specific metrics beyond standard DTS counters
# TYPE nvidia_insight_nic_health_score gauge
nvidia_insight_nic_health_score{device="mlx5_0",node="gpu-node-01"} 100

# TYPE nvidia_insight_cable_health_score gauge
nvidia_insight_cable_health_score{device="mlx5_0",port="1",node="gpu-node-01"} 98

# TYPE nvidia_insight_fec_corrected_rate gauge
nvidia_insight_fec_corrected_rate{device="mlx5_0",port="1"} 2.3e-8

# TYPE nvidia_insight_eye_height_mv gauge
nvidia_insight_eye_height_mv{device="mlx5_0",port="1"} 38

# TYPE nvidia_insight_link_flaps_total counter
nvidia_insight_link_flaps_total{device="mlx5_0",port="1"} 0

# TYPE nvidia_insight_cable_temperature_celsius gauge
nvidia_insight_cable_temperature_celsius{device="mlx5_0",port="1"} 42

Alert on Degradation (Before Failure)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nvidia-insight-alerts
  namespace: nvidia-network-operator
spec:
  groups:
    - name: gpu-fabric-health
      rules:
        - alert: NICHealthDegraded
          expr: nvidia_insight_nic_health_score < 80
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "NIC health degraded on {{ $labels.node }} {{ $labels.device }}"
            description: "Health score {{ $value }}/100 β€” run insight-cli nic-health for details."

        - alert: CableHealthCritical
          expr: nvidia_insight_cable_health_score < 70
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Cable degrading on {{ $labels.node }} {{ $labels.device }}/{{ $labels.port }}"
            description: "Score {{ $value }}/100 β€” schedule replacement."

        - alert: HighFECCorrectionRate
          expr: nvidia_insight_fec_corrected_rate > 1e-6
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High FEC correction rate on {{ $labels.device }}"
            description: "Rate {{ $value }} β€” cable/connector degradation likely."

        - alert: LinkFlap
          expr: increase(nvidia_insight_link_flaps_total[1h]) > 0
          labels:
            severity: critical
          annotations:
            summary: "Link flap detected on {{ $labels.node }} {{ $labels.device }}"
            description: "Unstable link β€” NCCL jobs will experience failures."

        - alert: EyeOpeningLow
          expr: nvidia_insight_eye_height_mv < 20
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Low eye opening on {{ $labels.device }}/{{ $labels.port }}"
            description: "Signal integrity marginal ({{ $value }}mV) β€” approaching failure threshold."
# If Insight Operator isn't deployed, use mlxlink directly
# (available in DOCA driver container or host)

# Link status and speed
mlxlink -d mlx5_0 -p 1

# Eye opening measurement
mlxlink -d mlx5_0 -p 1 --show_eye

# FEC counters and BER
mlxlink -d mlx5_0 -p 1 --show_fec

# Cable/transceiver info
mlxlink -d mlx5_0 -p 1 --show_module

# Full diagnostic dump
mlxlink -d mlx5_0 -p 1 -m --json

# Cable TDR test (takes ~30 seconds)
mlxcables -d mlx5_0 --port 1 --read_diag

Common Issues

Insight agent can’t access /dev/mst

  • Cause: Mellanox Software Tools (MST) not started on host
  • Fix: Run mst start on host or ensure DOCA driver container starts MST

Cable diagnostics show β€œunsupported”

  • Cause: Passive copper cables don’t support TDR or power monitoring
  • Fix: Only AOC/transceiver-based cables support full diagnostics

Topology discovery incomplete

  • Cause: No LLDP or subnet manager running; can’t discover switch hops
  • Fix: Enable LLDP on switches; or configure OpenSM for IB fabrics

Health score fluctuates

  • Cause: FEC corrections are normal at high speeds (400G+); transient spikes
  • Fix: Alert on sustained degradation (>10min) not momentary spikes

Best Practices

  1. Run cable diagnostics weekly β€” catch degradation before failures
  2. Baseline eye opening after installation β€” know your β€œgood” values
  3. Alert on FEC uncorrected > 0 β€” this means data corruption is possible
  4. Topology discovery after any cabling change β€” verify paths are optimal
  5. Track cable temperature β€” overheating cables degrade faster
  6. Schedule replacements proactively β€” health score <70 = replace within 2 weeks

Key Takeaways

  • NVIDIA CNS bundles GPU Operator + Network Operator + Insight tools
  • Insight Operator provides NIC health, link quality, cable diagnostics, topology discovery
  • Goes beyond counters β€” detects why links are degraded (eye opening, BER, cable power)
  • Proactive: catches cable/connector degradation before NCCL jobs fail
  • mlxlink is the manual equivalent for one-off diagnostics
  • Alert on health scores, FEC rates, eye opening, link flaps β€” not just error counters
  • Topology discovery validates that NCCL traffic takes optimal switch paths
#nvidia #cns #insight #networking #diagnostics #observability
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens