RoCEv2 End-to-End Lossless Stack
Complete RoCEv2 lossless fabric configuration from GPU node to switch and back. Dell OS10 switches, Mellanox NICs, OpenShift MachineConfig, PFC, ECN.
π‘ Quick Answer: A lossless RoCEv2 stack requires three aligned layers: node (ECN sysctl + Mellanox PFC/DSCP trust), switch (DSCP trust + PFC TC3 + ECN/WRED), and Kubernetes (SR-IOV + NCCL env vars). All must agree on DSCP 26 β Priority 3 β Traffic Class 3.
The Problem
RDMA performance collapses when any single layer of the network stack is misconfigured. Common failures:
- Switch doesnβt trust DSCP β RDMA traffic treated as best-effort β drops
- Node PFC disabled β no pause frames β buffer overflow β packet loss β NCCL retransmissions
- ECN disabled β PFC activates too frequently β head-of-line blocking
- DSCP mapping mismatch β RDMA traffic lands in wrong queue β mixed with bulk data
A working lossless fabric requires every component to agree on the same priority mapping.
The Solution
Architecture Overview
graph TD
subgraph Layer 1: Application
VLLM[vLLM / PyTorch] --> NCCL[NCCL 2.27+]
NCCL -->|RDMA verbs| IBVERBS[libibverbs]
end
subgraph Layer 2: Node OS
IBVERBS --> MLX[Mellanox mlx5 Driver]
MLX --> PFC_N[PFC Priority 3]
MLX --> ECN_N[ECN Enabled]
MLX --> DSCP_N[DSCP Trust Mode]
end
subgraph Layer 3: Network Fabric
NIC1[ConnectX NIC<br/>400 Gb/s] -->|DSCP 26 AF31| SW[Dell OS10 Switch]
SW -->|Trust DSCP<br/>PFC TC3<br/>ECN/WRED| NIC2[ConnectX NIC<br/>400 Gb/s]
end
subgraph Layer 4: Kubernetes
SRIOV[SR-IOV Device Plugin] --> VF[Virtual Function]
MULTUS[Multus CNI] --> NET[NetworkAttachmentDefinition]
end
MLX --> NIC1
VF --> MLXLayer-by-Layer Configuration
Layer 1: Kubernetes (SR-IOV + NCCL)
# SriovNetworkNodePolicy for RDMA
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rdma-policy
namespace: openshift-sriov-network-operator
spec:
nodeSelector:
node-role.kubernetes.io/gpu-worker: ""
resourceName: mellanoxnics
numVfs: 8
nicSelector:
vendor: "15b3"
deviceID: "101d"
deviceType: netdevice # Required for RDMA verbs
isRdma: true
---
# NetworkAttachmentDefinition
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
name: rdma-net
namespace: ai-inference
annotations:
k8s.v1.cni.cncf.io/resourceName: openshift.io/mellanoxnics
spec:
config: |
{
"cniVersion": "0.3.1",
"type": "host-device",
"ipam": {
"type": "whereabouts",
"range": "10.0.100.0/24"
}
}NCCL environment variables for pods:
env:
- name: NCCL_DEBUG
value: "INFO"
- name: NCCL_SOCKET_IFNAME
value: "net1"
- name: NCCL_NET
value: "IB"Layer 2: Node Configuration (MachineConfig)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: gpu-worker
name: 99-roce-lossless-stack
spec:
config:
ignition:
version: 3.4.0
storage:
files:
# ECN sysctl
- path: /etc/sysctl.d/99-ecn.conf
mode: 0644
contents:
source: data:text/plain;charset=utf-8,net.ipv4.tcp_ecn=1%0Anet.ipv4.tcp_ecn_fallback=1%0A
# NIC configuration script
- path: /usr/local/bin/configure-roce-lossless.sh
mode: 0755
contents:
source: data:text/plain;charset=utf-8,#!/bin/bash%0Aset -euo pipefail%0Afor dev in /sys/class/infiniband/mlx5_*; do%0A for nd in %24dev/device/net/*; do%0A NETDEV=%24(basename %24nd)%0A echo "Configuring %24NETDEV for lossless RoCEv2"%0A mlnx_qos -i %24NETDEV --trust dscp 2>/dev/null || true%0A mlnx_qos -i %24NETDEV --pfc 0,0,0,1,0,0,0,0 2>/dev/null || true%0A mlnx_qos -i %24NETDEV --ecn 0,0,0,1,0,0,0,0 2>/dev/null || true%0A echo "Done: %24NETDEV"%0A done%0Adone%0A
systemd:
units:
- name: configure-roce-lossless.service
enabled: true
contents: |
[Unit]
Description=Configure lossless RoCEv2 on Mellanox NICs
After=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/configure-roce-lossless.sh
RemainAfterExit=true
[Install]
WantedBy=multi-user.targetLayer 3: Switch Configuration (Dell OS10)
! Per-port configuration for GPU node connections
interface ethernet1/1/1-1/1/16
description "GPU Nodes - RoCEv2 Lossless"
mtu 9216
trust dscp
priority-flow-control mode on
priority-flow-control priority 3 no-drop
ets mode on
ets traffic-class 0 bandwidth 30
ets traffic-class 3 bandwidth 70
wred-profile rdma-ecn queue 3
no shutdown
! ECN/WRED profile
wred ecn-profile rdma-ecn
ecn enable
threshold minimum 150 maximum 1500 probability 100Validation Checklist
Run this from each GPU node to verify the full stack:
#!/bin/bash
echo "=== RoCEv2 Lossless Stack Validation ==="
echo -e "\n--- 1. ECN Sysctl ---"
sysctl net.ipv4.tcp_ecn
# Expected: net.ipv4.tcp_ecn = 1
echo -e "\n--- 2. Mellanox NIC Trust Mode ---"
for dev in /sys/class/infiniband/mlx5_*; do
NETDEV=$(ls $dev/device/net/ | head -1)
echo "$NETDEV:"
mlnx_qos -i $NETDEV 2>/dev/null | grep -E "trust|pfc|ecn" || echo " mlnx_qos not available"
done
# Expected: trust state: dscp, pfc: 0,0,0,1,0,0,0,0
echo -e "\n--- 3. RDMA Devices ---"
ibv_devinfo | grep -E "hca_id|state|active_speed"
# Expected: PORT_ACTIVE, speed matching NIC spec
echo -e "\n--- 4. SR-IOV VFs ---"
ip link show | grep "vf " | head -5
# Expected: VFs listed with MAC addresses
echo -e "\n--- 5. PFC Counters ---"
for dev in /sys/class/infiniband/mlx5_*; do
NETDEV=$(ls $dev/device/net/ | head -1)
echo "$NETDEV PFC:"
ethtool -S $NETDEV 2>/dev/null | grep -E "pfc|pause" | grep -v ": 0$"
done
# Expected: counters may be 0 (good) or small numbers (normal)
echo -e "\n--- 6. NCCL Test ---"
echo "Deploy a test pod with NCCL_DEBUG=INFO and check for:"
echo " NET/IB : Using [0]mlx5_X:1/RoCE"
echo " GPU Direct RDMA Enabled"
echo " Channel XX/0 : 0[0] -> 1[1] via P2P/CUMEM"Troubleshooting Decision Tree
graph TD
START[NCCL slow or<br/>falling back to TCP] --> Q1{NCCL logs show<br/>NET/IB or NET/Socket?}
Q1 -->|NET/Socket| Q2{ibv_devinfo shows<br/>RDMA devices?}
Q2 -->|No| FIX1[Install MOFED driver<br/>Check SR-IOV VFs]
Q2 -->|Yes| Q3{NCCL_NET=IB and<br/>NCCL_IB_HCA set?}
Q3 -->|No| FIX2[Set NCCL_NET=IB<br/>NCCL_IB_HCA=mlx5_0:1]
Q3 -->|Yes| FIX3[Check NetworkAttachmentDefinition<br/>and pod annotations]
Q1 -->|NET/IB| Q4{Bandwidth lower<br/>than expected?}
Q4 -->|Yes| Q5{PFC counters<br/>incrementing fast?}
Q5 -->|Yes| FIX4[ECN thresholds too high<br/>Lower WRED minimum]
Q5 -->|No| Q6{Switch trust dscp<br/>enabled?}
Q6 -->|No| FIX5[Enable trust dscp<br/>on switch ports]
Q6 -->|Yes| FIX6[Check MTU mismatch<br/>Switch=9216, Host=9000]
Q4 -->|No| OK[Stack is healthy β
]Common Issues
DSCP mapping mismatch between layers
The most common issue. Verify all three agree:
| Layer | Config | Expected |
|---|---|---|
| Host NIC | mlnx_qos --trust dscp | Trust mode: dscp |
| Switch port | trust dscp | DSCP 26 β TC3 |
| NCCL | Default (DSCP 26/AF31) | No override needed |
PFC storms after enabling ECN
ECN on the switch AND the host must both be enabled. If only one side has ECN, the other canβt react to congestion signals:
- Switch:
wred-profile rdma-ecnwithecn enable - Host:
mlnx_qos --ecn 0,0,0,1,0,0,0,0
GPU Direct RDMA not available
Requires open kernel modules (useOpenKernelModules: true in GPU Operator) and DMA-BUF support. Check:
cat /sys/module/nvidia/parameters/NVreg_OpenRmEnableUnsupportedGpusBest Practices
- Configure bottom-up: Switch β Node β Kubernetes β Application
- Validate each layer before proceeding to the next
- DSCP 26 (AF31) β Priority 3 is the universal RoCEv2 convention β donβt invent custom mappings
- ECN minimum threshold should be ~10% of switch buffer per port β too low causes over-marking, too high defeats the purpose
- Test with
ib_write_bwanddoca_perftestbefore running workloads - Monitor in production: PFC pause counters, ECN mark counters, NCCL throughput
- Document your topology: which switch ports connect to which nodes, VLAN assignments, MTU settings
Key Takeaways
- Lossless RoCEv2 requires alignment across all layers: app β OS β NIC β switch β NIC β OS β app
- Three mandatory configs per layer: DSCP trust, PFC on priority 3, ECN enabled
- Dell OS10 defaults DSCP 24/26 to TC3 β verify with
show qos-map dscp-tc - MachineConfig + systemd oneshot is the OpenShift-native way to configure NIC QoS
mlnx_qosis the Swiss Army knife for Mellanox NIC QoS: trust, PFC, ECN, ETS- GPU Direct RDMA requires open kernel modules β check NCCL logs for
DMA-BUF is available - PFC prevents drops (emergency brake), ECN prevents congestion (proactive) β always use both
- Validate with the checklist script before deploying workloads

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
