πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking intermediate ⏱ 10 minutes K8s 1.28+

ETS Queue, PFC, DSCP Trust on Mellanox Quic...

Quick reference for enabling ETS queues, PFC, DSCP trust, and DSCP-to-priority mapping on Mellanox ConnectX NICs. Three commands for lossless RoCE Ethernet.

By Luca Berton β€’ β€’ πŸ“– 8 min read

πŸ’‘ Quick Answer: Three steps for lossless RoCE on Mellanox ConnectX: 1) mlnx_qos -i <iface> --trust dscp (classify by DSCP field), 2) mlnx_qos -i <iface> --pfc 0,0,0,1,0,0,0,0 (make priority 3 lossless), 3) mlnx_qos -i <iface> --tc_bw 70,0,0,30,0,0,0,0 (ETS queues: 30% for RoCE). DSCP 26 (AF31) maps to priority 3 by default.

The Problem

RoCE needs three things configured on every Mellanox NIC: DSCP trust (so the NIC classifies traffic by DSCP value), PFC (so priority 3 is lossless), and ETS queues (so RDMA traffic gets guaranteed bandwidth). Without all three, RoCE either drops packets or starves.

Step 1: Set DSCP Trust

DSCP trust tells the NIC to classify packets by their IP DSCP field instead of VLAN PCP bits:

# Check current trust mode
mlnx_qos -i ens8f0np0 | grep -i trust
# Trust state: pcp    ← Default (wrong for RoCE)

# Set DSCP trust
mlnx_qos -i ens8f0np0 --trust dscp

# Verify
mlnx_qos -i ens8f0np0 | grep -i trust
# Trust state: dscp   ← Correct βœ…

Why DSCP? RoCE v2 packets carry DSCP 26 (AF31) in the IP header. With DSCP trust, the NIC reads this value and places traffic into priority 3 automatically β€” no VLAN tagging required.

Step 2: Map DSCP to Priority

RoCE v2 default: DSCP 26 β†’ priority 3. Verify the mapping exists:

# Check DSCP-to-priority mapping
mlnx_qos -i ens8f0np0 --dscp2prio

# Look for DSCP 26 β†’ priority 3
# DSCP  Priority
# ...
# 24    3     ← CS3
# 26    3     ← AF31 (RoCE default) βœ…
# 34    4     ← AF41
# ...

# If DSCP 26 doesn't map to priority 3, set it:
mlnx_qos -i ens8f0np0 --dscp2prio set,26,3

DSCP Values Reference

DSCP ValueNameDecimalBinaryDefault Priority
0Best Effort00000000
8CS180010001
24CS3240110003
26AF31260110103 ← RoCE
34AF41341000104
46EF461011105
48CS6481100006

Setting DSCP from NCCL

NCCL uses the NCCL_IB_TC environment variable. This sets the traffic class byte (TOS field), not DSCP directly:

# TOS = DSCP << 2
# DSCP 26 = 011010 β†’ TOS = 01101000 = 104 decimal
# But NCCL_IB_TC sets only the 8-bit TC field

# For DSCP 26 (AF31):
NCCL_IB_TC=106    # Some NCCL versions use this mapping
# Or explicitly via RoCE CM:
NCCL_IB_GID_INDEX=3

Step 3: Enable PFC on Priority 3

PFC pauses priority 3 when switch/NIC buffers fill β€” preventing packet drops:

# Check current PFC state
mlnx_qos -i ens8f0np0
# PFC configuration:
#   priority:  0  1  2  3  4  5  6  7
#   enabled:   0  0  0  0  0  0  0  0   ← All disabled (default)

# Enable PFC on priority 3 ONLY
mlnx_qos -i ens8f0np0 --pfc 0,0,0,1,0,0,0,0

# Verify
mlnx_qos -i ens8f0np0
# PFC configuration:
#   priority:  0  1  2  3  4  5  6  7
#   enabled:   0  0  0  1  0  0  0  0   ← Priority 3 lossless βœ…

⚠️ Never enable PFC on all priorities β€” this wastes switch buffer and can cause head-of-line blocking. Only priority 3 (RoCE) needs lossless.

Step 4: Configure ETS Queues

ETS (Enhanced Transmission Selection) maps priorities to traffic classes and allocates bandwidth:

# Default state: all bandwidth on TC0
mlnx_qos -i ens8f0np0
# tc:  0    1    2    3    4    5    6    7
# bw:  100  0    0    0    0    0    0    0
# tsa: ets  str  str  str  str  str  str  str

# Configure: 70% best-effort (TC0), 30% RoCE (TC3)
mlnx_qos -i ens8f0np0 \
  --tc_bw 70,0,0,30,0,0,0,0 \
  --tsa ets,strict,strict,ets,strict,strict,strict,strict

# Verify
mlnx_qos -i ens8f0np0
# ETS/BW:
#   tc:  0   1   2   3   4   5   6   7
#   bw:  70  0   0   30  0   0   0   0   βœ…
#   tsa: ets str str ets str str str str

ETS Scheduling Modes

ModeBehaviorUse For
etsGuaranteed minimum bandwidth (can burst higher)TC0 (best-effort), TC3 (RoCE)
strictAlways served first when queued (starves lower TCs)Control plane traffic

ETS Bandwidth Recommendations

WorkloadTC0 (Best Effort)TC3 (RoCE)
AI Training (RDMA-heavy)20%80%
Mixed (RDMA + regular)50%50%
Mostly regular traffic70%30%
Storage (NFS-over-RDMA)60%40%
# AI training cluster β€” 80% to RDMA
mlnx_qos -i ens8f0np0 \
  --tc_bw 20,0,0,80,0,0,0,0 \
  --tsa ets,strict,strict,ets,strict,strict,strict,strict

Complete Configuration (All-in-One)

#!/bin/bash
# configure-roce-qos.sh β€” Complete RoCE QoS setup for Mellanox NIC
IFACE=${1:-ens8f0np0}

echo "=== Configuring RoCE QoS on $IFACE ==="

# 1. DSCP trust
echo "[1/5] Setting DSCP trust..."
mlnx_qos -i "$IFACE" --trust dscp

# 2. DSCP-to-priority mapping
echo "[2/5] Mapping DSCP 26 β†’ priority 3..."
mlnx_qos -i "$IFACE" --dscp2prio set,26,3

# 3. PFC on priority 3
echo "[3/5] Enabling PFC on priority 3..."
mlnx_qos -i "$IFACE" --pfc 0,0,0,1,0,0,0,0

# 4. ETS bandwidth allocation
echo "[4/5] Configuring ETS: TC0=70%, TC3=30%..."
mlnx_qos -i "$IFACE" \
  --tc_bw 70,0,0,30,0,0,0,0 \
  --tsa ets,strict,strict,ets,strict,strict,strict,strict

# 5. ECN on TC3
echo "[5/5] Enabling ECN on TC3..."
echo 1 > /sys/class/net/"$IFACE"/ecn/roce_np/enable/3 2>/dev/null
echo 1 > /sys/class/net/"$IFACE"/ecn/roce_rp/enable/3 2>/dev/null

echo ""
echo "=== Verification ==="
mlnx_qos -i "$IFACE"

echo ""
echo "=== PFC Counters ==="
ethtool -S "$IFACE" | grep prio3

echo ""
echo "βœ… RoCE QoS configured on $IFACE"

Apply to All Mellanox NICs

#!/bin/bash
# configure-all-nics.sh β€” Apply to every Mellanox NIC on the node
for iface in $(ls /sys/class/net/); do
  if ethtool -i "$iface" 2>/dev/null | grep -q mlx5_core; then
    echo ">>> Configuring $iface"
    ./configure-roce-qos.sh "$iface"
    echo ""
  fi
done

Verification Cheat Sheet

# Full QoS state
mlnx_qos -i ens8f0np0

# Just PFC
mlnx_qos -i ens8f0np0 | grep -A2 "PFC"

# Just ETS
mlnx_qos -i ens8f0np0 | grep -A3 "tc:"

# DSCP mapping
mlnx_qos -i ens8f0np0 --dscp2prio | grep "26"

# PFC pause counters (non-zero = working)
ethtool -S ens8f0np0 | grep prio3_pause

# Drops (should be ZERO on priority 3)
ethtool -S ens8f0np0 | grep prio3_discard

# RDMA bandwidth test
# Server: ib_write_bw -d mlx5_0 --report_gbits
# Client: ib_write_bw -d mlx5_0 <server-ip> --report_gbits

The Flow: Packet Path Through ETS + PFC

flowchart TD
    APP["Application<br/>(NCCL all-reduce)"] -->|"RDMA write"| RDMA["RDMA Verbs<br/>(ibv_post_send)"]
    RDMA -->|"DSCP 26 (AF31)"| NIC["Mellanox NIC<br/>(mlx5_core)"]
    
    NIC -->|"DSCP trust"| CLASSIFY["Classify:<br/>DSCP 26 β†’ Priority 3"]
    CLASSIFY -->|"ETS queue"| TC3["Traffic Class 3<br/>(30% bandwidth)"]
    TC3 -->|"Transmit"| WIRE["Wire<br/>(200Gb/s)"]
    
    WIRE --> SWITCH["Switch"]
    SWITCH -->|"Buffer filling"| PFC_CHECK{"Buffer > threshold?"}
    PFC_CHECK -->|"No"| FORWARD["Forward to<br/>destination NIC"]
    PFC_CHECK -->|"Yes"| PAUSE["Send PFC PAUSE<br/>for priority 3 only"]
    PAUSE -->|"NIC pauses TC3"| TC3

Common Issues

IssueCauseFix
Trust resets to PCP after rebootmlnx_qos not persistentUse systemd service or MachineConfig
DSCP 26 maps to wrong priorityFirmware default differsmlnx_qos --dscp2prio set,26,3 explicitly
ETS shows all bandwidth on TC0tc_bw not setmlnx_qos --tc_bw 70,0,0,30,0,0,0,0
PFC counters all zeroSwitch PFC not enabledConfigure PFC on switch port too
High pause counts + low throughputPFC storm / congestionEnable ECN, check switch buffers
mlnx_qos: No such deviceWrong interface nameCheck ip link, use physical interface not bond

Best Practices

  • Always configure all three: DSCP trust + PFC + ETS β€” missing any one breaks lossless
  • PFC on priority 3 only β€” never enable all priorities
  • ETS bandwidth varies by workload β€” 80% RDMA for AI training, 30% for mixed
  • Switch must match β€” PFC is hop-by-hop, every device must participate
  • Persist configuration β€” mlnx_qos is volatile, use systemd or MachineConfig
  • ECN alongside PFC β€” reduces pause storm frequency
  • Verify with ib_write_bw β€” confirm line-rate before deploying workloads

Key Takeaways

  • DSCP trust β†’ NIC classifies by IP DSCP field (not VLAN PCP)
  • DSCP 26 (AF31) β†’ priority 3 β†’ the default RoCE mapping
  • PFC on priority 3 β†’ lossless for RDMA, lossy for everything else
  • ETS queues β†’ guaranteed bandwidth per traffic class (TC0 + TC3)
  • Three commands: --trust dscp, --pfc 0,0,0,1,0,0,0,0, --tc_bw 70,0,0,30,0,0,0,0
  • All three are required β€” skip one and RoCE performance degrades
#ets #pfc #dscp #mellanox #connectx #roce #rdma #qos
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens