NVIDIA Nsight Operator for GPU Profiling on Kubernetes
Deploy NVIDIA Nsight Systems and Nsight Compute on Kubernetes for GPU workload profiling. Capture kernel traces, memory bandwidth, SM occupancy, and NCCL
π‘ Quick Answer: NVIDIA Nsight Systems profiles GPU timelines (kernels, memory, NCCL comms) and Nsight Compute provides per-kernel metrics (occupancy, memory throughput, warp stalls). On Kubernetes, run profiling as sidecar containers or one-shot Jobs with
CAP_SYS_ADMIN+ access to/dev/nvidia*. Export.nsys-repand.ncu-repfiles for analysis.
The Problem
- GPU utilization shows 80% but you donβt know if kernels are efficient or just queued
- NCCL all-reduce takes 40% of step time but you canβt see which collective is slow
- Memory-bound vs compute-bound β need per-kernel analysis to optimize
- Distributed training has idle bubbles β where exactly is the pipeline stall?
- Standard monitoring (DCGM) shows utilization, not why itβs low
The Solution
Nsight Systems: Timeline Profiling
# Profile a training job with Nsight Systems
apiVersion: batch/v1
kind: Job
metadata:
name: nsight-systems-profile
namespace: ai-workloads
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: profiler
image: nvcr.io/nvidia/pytorch:24.05-py3
command: ["/bin/bash", "-c"]
args:
- |
# Profile the training script with Nsight Systems
nsys profile \
--trace=cuda,nvtx,osrt,cudnn,cublas,nccl \
--sample=cpu \
--output=/profiles/training_profile \
--export=sqlite \
--force-overwrite=true \
--duration=120 \
--delay=30 \
python /workspace/train.py \
--model llama-3-8b \
--batch-size 4 \
--gradient-accumulation-steps 8
echo "Profile saved to /profiles/training_profile.nsys-rep"
ls -lh /profiles/
securityContext:
capabilities:
add: ["SYS_ADMIN"] # Required for GPU profiling
env:
- name: CUDA_VISIBLE_DEVICES
value: "0,1,2,3"
volumeMounts:
- name: profiles
mountPath: /profiles
- name: shm
mountPath: /dev/shm
resources:
limits:
nvidia.com/gpu: "4"
memory: "64Gi"
requests:
cpu: "8"
memory: "64Gi"
volumes:
- name: profiles
persistentVolumeClaim:
claimName: nsight-profiles-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32GiNsight Systems CLI Options
# Key nsys profile flags:
nsys profile \
--trace=cuda,nvtx,osrt,cudnn,cublas,nccl \
# cuda β CUDA API calls and kernel launches
# nvtx β Developer annotations (PyTorch/DeepSpeed markers)
# osrt β OS runtime (pthread, file I/O)
# cudnn β cuDNN operations (conv, attention)
# cublas β cuBLAS GEMM operations
# nccl β NCCL collective communications
--cuda-memory-usage=true \ # Track allocations
--gpuctxsw=true \ # GPU context switches
--sample=cpu \ # CPU call stack sampling
--backtrace=dwarf \ # Full backtraces
--output=/profiles/report \ # Output path
--export=sqlite,json \ # Also export for scripting
--duration=120 \ # Profile for 120 seconds
--delay=30 \ # Skip warmup (first 30s)
--kill=sigterm \ # Clean shutdown after duration
python train.py
# Analyze from CLI (headless)
nsys stats /profiles/report.nsys-rep
# Shows: top GPU kernels, CUDA API summary, NCCL ops, memory ops
# Export specific stats
nsys stats --report cuda_gpu_kern_sum /profiles/report.nsys-rep
nsys stats --report nccl_gpu_proj /profiles/report.nsys-repNsight Compute: Per-Kernel Analysis
# Deep-dive into specific GPU kernels
apiVersion: batch/v1
kind: Job
metadata:
name: nsight-compute-profile
namespace: ai-workloads
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: profiler
image: nvcr.io/nvidia/pytorch:24.05-py3
command: ["/bin/bash", "-c"]
args:
- |
# Profile specific kernels with Nsight Compute
ncu \
--set full \
--target-processes all \
--kernel-name "regex:.*gemm.*|.*attention.*" \
--launch-count 10 \
--output /profiles/kernel_analysis \
python /workspace/inference.py \
--model llama-3-70b \
--prompt "Explain Kubernetes"
echo "Kernel analysis saved"
ncu --import /profiles/kernel_analysis.ncu-rep --summary
securityContext:
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: profiles
mountPath: /profiles
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
volumes:
- name: profiles
persistentVolumeClaim:
claimName: nsight-profiles-pvc# Nsight Compute CLI options
ncu \
--set full \ # All metrics (or: basic, roofline)
--section SpeedOfLight \ # SM and memory throughput %
--section MemoryWorkloadAnalysis \ # Memory access patterns
--section Occupancy \ # Warp occupancy analysis
--section WarpStateStatistics \ # Why warps are stalled
--kernel-name "regex:.*gemm.*" \ # Only profile GEMM kernels
--launch-skip 100 \ # Skip warmup launches
--launch-count 10 \ # Profile 10 kernel launches
--target-processes all \ # Profile child processes too
--output /profiles/report \
python inference.py
# Key metrics from Nsight Compute:
# - SM Throughput % β are SMs fully utilized?
# - Memory Throughput % β saturating HBM bandwidth?
# - Achieved Occupancy β warps active vs theoretical max
# - Warp Stall Reasons β memory dependency? instruction fetch?Profile Distributed Training (Multi-Node)
# Profile NCCL communications across nodes
apiVersion: batch/v1
kind: Job
metadata:
name: nsight-distributed-profile
namespace: ai-workloads
spec:
completions: 2
parallelism: 2
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: profiler
image: nvcr.io/nvidia/pytorch:24.05-py3
command: ["/bin/bash", "-c"]
args:
- |
# Enable NCCL debug for correlation
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
# Profile with NCCL tracing
nsys profile \
--trace=cuda,nccl,nvtx,osrt \
--cuda-memory-usage=true \
--output=/profiles/rank_${RANK}_profile \
--delay=60 \
--duration=180 \
torchrun \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=${RANK} \
--master_addr=${MASTER_ADDR} \
--master_port=29500 \
train_distributed.py
securityContext:
capabilities:
add: ["SYS_ADMIN"]
env:
- name: RANK
valueFrom:
fieldRef:
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
- name: MASTER_ADDR
value: "nsight-distributed-profile-0.ai-workloads.svc"
volumeMounts:
- name: profiles
mountPath: /profiles
- name: shm
mountPath: /dev/shm
resources:
limits:
nvidia.com/gpu: "4"
volumes:
- name: profiles
persistentVolumeClaim:
claimName: nsight-profiles-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi# Analyze NCCL timelines across ranks
nsys stats --report nccl_gpu_proj rank_0_profile.nsys-rep
nsys stats --report nccl_gpu_proj rank_1_profile.nsys-rep
# Example output:
# NCCL Operation Summary (Rank 0):
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Operation Count Avg (ms) Total (ms) % of Step
# AllReduce 240 12.3 2952 38.2%
# ReduceScatter 240 4.1 984 12.7%
# AllGather 240 3.8 912 11.8%
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Total NCCL: 4848 62.7%
#
# β οΈ NCCL dominates step time β overlap compute with communicationSidecar Pattern for Production Profiling
# Add Nsight as a sidecar to an existing training Pod
apiVersion: v1
kind: Pod
metadata:
name: training-with-profiler
spec:
containers:
# Main training container
- name: training
image: registry.example.com/my-training:v2.1
command: ["python", "train.py"]
resources:
limits:
nvidia.com/gpu: "8"
# Nsight Systems sidecar β attaches to running process
- name: nsight-sidecar
image: nvcr.io/nvidia/nsight-systems:2024.5
command: ["/bin/bash", "-c"]
args:
- |
# Wait for training to start
sleep 60
# Find the training PID
TRAIN_PID=$(pgrep -f "python train.py")
# Attach and profile for 120 seconds
nsys profile \
--trace=cuda,nccl,nvtx \
--pid=$TRAIN_PID \
--duration=120 \
--output=/profiles/attached_profile
echo "Profile captured"
sleep infinity
securityContext:
capabilities:
add: ["SYS_ADMIN", "SYS_PTRACE"]
volumeMounts:
- name: profiles
mountPath: /profiles
shareProcessNamespace: true # Required for PID attachment
volumes:
- name: profiles
persistentVolumeClaim:
claimName: nsight-profiles-pvcRetrieve and Analyze Profiles
# Copy profiles from PVC to local machine
kubectl cp ai-workloads/nsight-systems-profile:/profiles/training_profile.nsys-rep \
./training_profile.nsys-rep
# Open in Nsight Systems GUI (local workstation)
nsys-ui ./training_profile.nsys-rep
# Or generate text summary
nsys stats ./training_profile.nsys-rep --report cuda_gpu_kern_sum
# Top GPU Kernels:
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Kernel Name Time (ms) Count Avg (ΞΌs)
# volta_fp16_s884gemm_256x128_ldg8 1823.4 4800 379.9
# fmha_v2_flash_attention_fp16_512 892.1 2400 371.7
# nccl_allreduce_ring_ll128 672.3 240 2801.3
# void at::native::vectorized_elementwise 234.5 9600 24.4
# Roofline analysis (requires ncu with --set roofline)
ncu --import kernel_analysis.ncu-rep --page rooflineAutomated Profiling with CronJob
# Periodic profiling to track performance regression
apiVersion: batch/v1
kind: CronJob
metadata:
name: periodic-gpu-profile
namespace: ai-workloads
spec:
schedule: "0 2 * * 1" # Monday 2 AM
jobTemplate:
spec:
template:
spec:
restartPolicy: Never
containers:
- name: benchmark-profile
image: nvcr.io/nvidia/pytorch:24.05-py3
command: ["/bin/bash", "-c"]
args:
- |
DATE=$(date +%Y%m%d)
nsys profile \
--trace=cuda,nccl \
--output=/profiles/weekly_${DATE} \
--duration=300 \
python /workspace/benchmark.py
# Compare with baseline
nsys stats --report cuda_gpu_kern_sum \
/profiles/weekly_${DATE}.nsys-rep > /profiles/weekly_${DATE}_stats.txt
securityContext:
capabilities:
add: ["SYS_ADMIN"]
volumeMounts:
- name: profiles
mountPath: /profiles
resources:
limits:
nvidia.com/gpu: "1"
volumes:
- name: profiles
persistentVolumeClaim:
claimName: nsight-profiles-pvcCommon Issues
βERR_NVGPUCTRPERM: permission issueβ
- Cause: Missing
CAP_SYS_ADMINcapability - Fix: Add
securityContext.capabilities.add: ["SYS_ADMIN"]; or setprivileged: true
Profile file is empty (0 bytes)
- Cause: Application crashed before profile flush; or
--durationtoo short for--delay - Fix: Ensure
duration > delay; add--kill=noneto let app finish naturally
βCannot attach to processβ in sidecar
- Cause:
shareProcessNamespacenot enabled; or missingSYS_PTRACE - Fix: Set
shareProcessNamespace: trueon Pod spec; addCAP_SYS_PTRACE
Nsight Compute causes 10-100x slowdown
- Cause: Normal β ncu replays kernels multiple times to collect all metrics
- Fix: Use
--launch-countto limit profiled launches; use--set basicfor less overhead
Best Practices
- Nsight Systems first, Compute second β Systems shows the timeline (where time goes); Compute shows why a specific kernel is slow
- Profile after warmup β use
--delayto skip initialization - Limit duration β 60-180s is usually enough for steady-state
- Use NVTX markers in code β
torch.cuda.nvtx.range_push("forward")for clear timeline labels - Store profiles in PVC β
.nsys-repfiles can be 1-10 GB for long traces - Weekly regression profiles β catch performance degradation early
Key Takeaways
- Nsight Systems = timeline profiler (CUDA + NCCL + CPU, shows where time goes)
- Nsight Compute = kernel profiler (occupancy, memory, roofline, shows why a kernel is slow)
- Both require
CAP_SYS_ADMINon Kubernetes (GPU hardware counter access) - Profile distributed training with per-rank
.nsys-repfiles β correlate NCCL timelines - Sidecar pattern +
shareProcessNamespacefor profiling existing workloads non-intrusively - Key findings: NCCL % of step time, compute vs memory bound kernels, pipeline bubbles
- Export to SQLite/JSON for automated regression detection

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
