NCCL PXN Cross-NIC Communication via NVLink
Configure NCCL PXN (PCIe cross-NIC via NVLink) for multi-node GPU training where not every GPU has a direct RDMA NIC. Covers topology
π‘ Quick Answer: NCCL PXN (PCIe cross-NIC via NVLink) allows GPUs without a directly attached RDMA NIC to reach the network through another GPUβs NIC via NVLink. This is critical in systems where fewer NICs than GPUs exist (e.g., 4 NICs for 8 GPUs) β NCCL routes traffic over NVLink to a peer GPU that has NIC access.
The Problem
In multi-GPU servers, the GPU-to-NIC topology is often not 1:1:
- 8 GPUs but only 4 InfiniBand NICs
- NICs connected to specific PCIe switches, not all GPUs
- GPUs without direct NIC access fall back to CPU-staged copies (slow)
- Need inter-node communication for all 8 GPUs, not just the 4 with NICs
The Solution
Understanding PXN Topology
Typical 8-GPU Server with 4 NICs:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CPU0 Socket CPU1 Socket
ββββββββββββββββββββ ββββββββββββββββββββ
β PCIe Switch 0 β β PCIe Switch 2 β
β βββ GPU0 β NIC0 β ββNVLinkβββ β βββ GPU4 β NIC2 β
β βββ GPU1 β β βββ GPU5 β
β β β β
β PCIe Switch 1 β β PCIe Switch 3 β
β βββ GPU2 β NIC1 β ββNVLinkβββ β βββ GPU6 β NIC3 β
β βββ GPU3 β β βββ GPU7 β
ββββββββββββββββββββ ββββββββββββββββββββ
Without PXN:
GPU1, GPU3, GPU5, GPU7 β NO direct NIC β CPU copy fallback (slow)
With PXN:
GPU1 β NVLink β GPU0 β NIC0 β Network (GPU0 proxies for GPU1)
GPU3 β NVLink β GPU2 β NIC1 β Network (GPU2 proxies for GPU3)
GPU5 β NVLink β GPU4 β NIC2 β Network (GPU4 proxies for GPU5)
GPU7 β NVLink β GPU6 β NIC3 β Network (GPU6 proxies for GPU7)NCCL PXN Configuration
apiVersion: v1
kind: Pod
metadata:
name: distributed-training
namespace: ai-training
annotations:
k8s.v1.cni.cncf.io/networks: rdma-net-0,rdma-net-1,rdma-net-2,rdma-net-3
spec:
containers:
- name: training
image: nvcr.io/nvidia/pytorch:24.07-py3
env:
# PXN and Cross-NIC settings
- name: NCCL_CROSS_NIC
value: "1" # Allow traffic across different NICs
- name: NCCL_NET_GDR_LEVEL
value: "5" # Full GPU-Direct RDMA
- name: NCCL_P2P_LEVEL
value: "NVL" # Use NVLink for P2P (enables PXN)
# NIC selection
- name: NCCL_IB_HCA
value: "mlx5_0,mlx5_1,mlx5_2,mlx5_3" # All 4 NICs
# Topology detection
- name: NCCL_TOPO_FILE
value: "/var/run/nvidia/topo.xml" # GPU topology file
- name: NCCL_TOPO_DUMP_FILE
value: "/tmp/nccl-topo.xml" # Debug: dump detected topo
# Performance tuning
- name: NCCL_IB_QPS_PER_CONNECTION
value: "4"
- name: NCCL_IB_TIMEOUT
value: "22"
- name: NCCL_IB_RETRY_CNT
value: "7"
- name: NCCL_ALGO
value: "Ring,Tree" # Algorithm selection
- name: NCCL_PROTO
value: "Simple,LL,LL128"
# Debug
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_DEBUG_SUBSYS
value: "INIT,NET,GRAPH"
resources:
requests:
nvidia.com/gpu: "8"
openshift.io/mellanoxnics: "4"Topology File for NCCL
<!-- /var/run/nvidia/topo.xml β helps NCCL understand GPU-NIC affinity -->
<system version="1">
<cpu numaid="0" affinity="0-31" arch="x86_64" vendor="GenuineIntel">
<pci busid="0000:17:00.0" class="0x030200" vendor="0x10de" device="0x2330"
subsystem_vendor="0x10de" subsystem_device="0x1626" link_speed="16 GT/s"
link_width="16">
<!-- GPU0 -->
<gpu dev="0" sm="90" mem="81920" bar1="131072"/>
</pci>
<pci busid="0000:18:00.0" class="0x020700" vendor="0x15b3" device="0x101e">
<!-- NIC0 - same PCIe switch as GPU0 -->
<nic dev="mlx5_0"/>
</pci>
<pci busid="0000:65:00.0" class="0x030200" vendor="0x10de" device="0x2330">
<!-- GPU1 - no direct NIC, will use PXN via GPU0 -->
<gpu dev="1" sm="90" mem="81920" bar1="131072"/>
</pci>
</cpu>
</system>NCCL_CROSS_NIC Explained
NCCL_CROSS_NIC values:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Value Behavior
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
0 Use only the NIC closest to each GPU (strict affinity)
β Fails if GPU has no local NIC
1 Allow GPUs to use any NIC (cross-NIC via NVLink/PXN)
β Enables PXN path: GPU β NVLink β peer GPU β NIC
2 Prefer local NIC but fall back to cross-NIC if needed
β Best of both: locality when possible, PXN when necessaryVerify PXN is Active
# Run with NCCL_DEBUG=INFO and look for PXN indicators
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH
# In NCCL output, look for:
# "Channel [X] ... GPU/Y -> NIC/Z via GPU/W" β PXN path (GPU W proxies)
# "PXN" in graph info
# Topology dump
export NCCL_TOPO_DUMP_FILE=/tmp/topo.xml
# Run training, then inspect /tmp/topo.xml for GPU-NIC paths
# Check which algo/path NCCL selected
# "Ring" with cross-NIC paths = PXN active
# "Tree" = hierarchical (also uses PXN for leaf GPUs without NICs)Multi-NIC Bandwidth Optimization
# With 4Γ ConnectX-7 400Gb/s NICs:
# Theoretical: 4 Γ 400 = 1600 Gb/s bidirectional per node
# With PXN overhead (~5% NVLink hop): ~1520 Gb/s effective
# Optimize NIC-to-GPU mapping:
export NCCL_IB_HCA="mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1"
# :1 = port 1 (InfiniBand port number)
# Pin NCCL threads to correct NUMA
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=4
# For DGX-style systems (NVSwitch):
export NCCL_NVLS_ENABLE=1 # NVLink SHARP (H100+)
export NCCL_P2P_NET_CHUNKSIZE=524288 # 512KB chunks for NVLinkCommon Issues
GPU without NIC falls back to SHM/CPU copy
- Cause:
NCCL_CROSS_NIC=0or NVLink not detected between GPUs - Fix: Set
NCCL_CROSS_NIC=1; verify NVLink withnvidia-smi topo -m
PXN not used despite NVLink present
- Cause: Topology file missing or incorrect; NCCL canβt determine affinity
- Fix: Provide
NCCL_TOPO_FILE; or let GPU Operator generate it via GFD
Uneven bandwidth across GPUs
- Cause: PXN GPUs share NIC bandwidth with the GPU that has direct access
- Fix: Expected β 2 GPUs share 1 NIC. Design for it in placement strategy.
Best Practices
- Set
NCCL_CROSS_NIC=1for systems with fewer NICs than GPUs - Provide topology file β helps NCCL make optimal path decisions
- Match VF count to NIC count (not GPU count) in SR-IOV policy
- Use
NCCL_DEBUG=INFOto verify PXN paths are selected - Pin workloads to full nodes β partial allocation breaks PXN topology
- NVSwitch systems (DGX): all GPUs can reach all NICs efficiently
- PCIe-only systems: PXN limited to GPUs connected via NVLink bridges
Key Takeaways
- PXN = GPU uses another GPUβs NIC via NVLink for network access
- Critical when NIC count < GPU count (common: 4 NICs for 8 GPUs)
NCCL_CROSS_NIC=1enables cross-NIC routing via NVLink- ~5% overhead per NVLink hop compared to direct NIC access
- Topology file helps NCCL find optimal GPUβNIC paths
- Works with both InfiniBand and RoCE (Ethernet RDMA)
- DGX/NVSwitch systems: all GPUs have equal NIC access (no PXN penalty)
- PCIe systems: PXN only works between NVLink-connected GPU pairs

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
