πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
Networking advanced ⏱ 40 minutes K8s 1.28+

Configure PFC with NMState on Kubernetes

Enable Priority Flow Control (PFC) for lossless RDMA using NMState and NodeNetworkConfigurationPolicy. Configure DSCP-to-priority mapping, ECN, and RoCEv2 QoS.

By Luca Berton β€’ β€’ πŸ“– 9 min read

πŸ’‘ Quick Answer: Priority Flow Control (PFC) makes RoCEv2 lossless by pausing specific traffic classes instead of dropping packets. On OpenShift/Kubernetes, configure PFC via NodeNetworkConfigurationPolicy (NNCP) using the NMState operator. Enable PFC on priority 3 (default RoCE), set jumbo frames (MTU 9000), and configure the switch to match. Without PFC, RoCE performance degrades catastrophically under congestion β€” TCP retransmits on RDMA destroy throughput.

The Problem

RoCEv2 (RDMA over Converged Ethernet) runs over standard Ethernet but requires lossless behavior for the RDMA traffic class. Regular Ethernet drops packets under congestion β€” fine for TCP, fatal for RDMA. PFC provides per-priority flow control: when a switch buffer fills for a specific priority, it sends a PAUSE frame to the sender, preventing drops.

Without PFC, a busy Ethernet fabric drops RDMA packets β†’ NCCL retransmits β†’ distributed training throughput drops 10-100Γ— β†’ GPUs sit idle waiting for network.

flowchart TB
    subgraph WITHOUT["❌ Without PFC"]
        GPU1_A["GPU Node 1"] -->|"RDMA packets"| SW_A["Switch<br/>(buffer full)"]
        SW_A -->|"❌ DROP"| GPU2_A["GPU Node 2"]
        SW_A -.->|"Packet lost<br/>NCCL timeout"| GPU1_A
    end
    
    subgraph WITH["βœ… With PFC"]
        GPU1_B["GPU Node 1"] -->|"RDMA packets<br/>(priority 3)"| SW_B["Switch<br/>(buffer full)"]
        SW_B -->|"PAUSE priority 3"| GPU1_B
        GPU1_B -.->|"Pauses sending<br/>(no drops)"| GPU1_B
        SW_B -->|"Buffer drains"| GPU2_B["GPU Node 2"]
        SW_B -->|"RESUME"| GPU1_B
    end

The Solution

Prerequisites

# NMState operator must be installed
# On OpenShift:
oc get pods -n openshift-nmstate
# NAME                                   READY
# nmstate-handler-xxxxx                  1/1
# nmstate-operator-xxxxx                 1/1

# Or install on vanilla Kubernetes:
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/latest/download/nmstate.io_nmstates.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/latest/download/namespace.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/latest/download/service_account.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/latest/download/role.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/latest/download/role_binding.yaml
kubectl apply -f https://github.com/nmstate/kubernetes-nmstate/releases/latest/download/operator.yaml

Enable PFC on RDMA Interfaces

# NodeNetworkConfigurationPolicy β€” enable PFC priority 3 for RoCE
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: pfc-roce-ens8f0
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
    feature.node.kubernetes.io/network-sriov.capable: "true"
  desiredState:
    interfaces:
      - name: ens8f0                   # RDMA-capable NIC
        type: ethernet
        state: up
        mtu: 9000                      # Jumbo frames required for RDMA
        ethtool:
          pause:
            # Disable global pause (use PFC per-priority instead)
            rx: false
            tx: false
        # PFC configuration via NMState ieee-8021Qaz
        ieee-8021Qaz:
          pfc:
            enabled:
              - 3                      # Enable PFC on priority 3 (RoCE default)
            # Priorities 0-2, 4-7: no PFC (lossy, best-effort)

Full RoCE QoS Configuration

# Complete PFC + DSCP + Trust mode configuration
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: roce-qos-full
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
      - name: ens8f0
        type: ethernet
        state: up
        mtu: 9000
        ipv4:
          enabled: true
          dhcp: false
          address:
            - ip: 10.0.100.10
              prefix-length: 24
        ethtool:
          pause:
            rx: false
            tx: false
          # Enable ECN (Explicit Congestion Notification)
          feature:
            rx-gro: true
            tx-generic-segmentation: true
        ieee-8021Qaz:
          pfc:
            enabled:
              - 3                      # Lossless priority for RoCE
          ets:
            # Enhanced Transmission Selection β€” bandwidth allocation
            traffic-classes:
              - priority: 0
                bandwidth: 10          # Best-effort: 10%
              - priority: 3
                bandwidth: 80          # RoCE RDMA: 80%
              - priority: 6
                bandwidth: 10          # Management: 10%
---
# Second NIC (dual-rail RDMA)
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: roce-qos-full-nic2
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
      - name: ens8f1
        type: ethernet
        state: up
        mtu: 9000
        ipv4:
          enabled: true
          dhcp: false
          address:
            - ip: 10.0.101.10
              prefix-length: 24
        ethtool:
          pause:
            rx: false
            tx: false
        ieee-8021Qaz:
          pfc:
            enabled:
              - 3
          ets:
            traffic-classes:
              - priority: 0
                bandwidth: 10
              - priority: 3
                bandwidth: 80
              - priority: 6
                bandwidth: 10

DSCP-to-Priority Mapping

RoCEv2 uses DSCP 26 (AF31) by default β†’ maps to priority 3. The NIC must trust DSCP markings:

# Configure DSCP trust mode via NMState
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: dscp-trust-roce
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
      - name: ens8f0
        type: ethernet
        state: up
        # For Mellanox ConnectX NICs, trust mode is set via sysfs
        # NMState doesn't directly expose trust mode yet β€”
        # use a MachineConfig or DaemonSet for this part
# DaemonSet to set trust mode + DSCP mapping on Mellanox NICs
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: roce-dscp-config
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: roce-dscp-config
  template:
    metadata:
      labels:
        app: roce-dscp-config
    spec:
      hostNetwork: true
      hostPID: true
      nodeSelector:
        feature.node.kubernetes.io/network-sriov.capable: "true"
      containers:
        - name: config
          image: registry.access.redhat.com/ubi9/ubi-minimal:latest
          command: ["/bin/bash", "-c"]
          args:
            - |
              for dev in ens8f0 ens8f1; do
                # Set trust mode to DSCP (not PCP)
                mlnx_qos -i $dev --trust dscp 2>/dev/null
                
                # Map DSCP 26 (AF31) to priority 3
                mlnx_qos -i $dev --dscp2prio set,26,3 2>/dev/null
                
                # Enable PFC on priority 3
                mlnx_qos -i $dev --pfc 0,0,0,1,0,0,0,0 2>/dev/null
                
                # Verify
                echo "=== $dev ==="
                mlnx_qos -i $dev 2>/dev/null
              done
              
              echo "PFC + DSCP configured. Sleeping..."
              sleep infinity
          securityContext:
            privileged: true
          volumeMounts:
            - name: sys
              mountPath: /sys
      volumes:
        - name: sys
          hostPath:
            path: /sys
      tolerations:
        - operator: Exists

ECN (Explicit Congestion Notification)

PFC prevents packet drops but can cause head-of-line blocking. ECN marks packets instead of dropping them, allowing endpoints to react before PFC kicks in:

# Enable ECN on RoCE interfaces
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: ecn-roce
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
      - name: ens8f0
        type: ethernet
        state: up
        # ECN is enabled at the IP level on the host
        # Set via sysctl
---
# MachineConfig for ECN sysctl (OpenShift)
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  name: 99-ecn-roce
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - path: /etc/sysctl.d/99-ecn-roce.conf
          mode: 0644
          contents:
            source: data:text/plain;charset=utf-8;base64,bmV0LmlwdjQudGNwX2Vjbj0xCg==
          # Decoded: net.ipv4.tcp_ecn=1

Bonded Interface with PFC

# PFC on a bonded RDMA interface (dual-port for redundancy)
apiVersion: nmstate.io/v1
kind: NodeNetworkConfigurationPolicy
metadata:
  name: pfc-bond-roce
spec:
  nodeSelector:
    node-role.kubernetes.io/worker: ""
  desiredState:
    interfaces:
      - name: bond-rdma
        type: bond
        state: up
        mtu: 9000
        ipv4:
          enabled: true
          dhcp: false
          address:
            - ip: 10.0.100.10
              prefix-length: 24
        link-aggregation:
          mode: 802.3ad              # LACP
          options:
            miimon: "100"
            xmit_hash_policy: layer3+4
          port:
            - ens8f0
            - ens8f1
        ieee-8021Qaz:
          pfc:
            enabled:
              - 3
          ets:
            traffic-classes:
              - priority: 3
                bandwidth: 80
              - priority: 0
                bandwidth: 20
      # Member interfaces also need PFC
      - name: ens8f0
        type: ethernet
        state: up
        mtu: 9000
        ieee-8021Qaz:
          pfc:
            enabled:
              - 3
      - name: ens8f1
        type: ethernet
        state: up
        mtu: 9000
        ieee-8021Qaz:
          pfc:
            enabled:
              - 3

Switch-Side Configuration (Must Match)

PFC must be configured end-to-end: NIC ↔ Switch ↔ NIC. The switch must enable PFC on the same priority:

! Cisco Nexus example
interface Ethernet1/1-48
  mtu 9216
  priority-flow-control mode on
  priority-flow-control priority 3 no-drop
  
! Mellanox/NVIDIA Spectrum
interface ethernet 1/1-48
  dcb priority-flow-control enable force
  dcb priority-flow-control priority 3 enable
  dcb ets traffic-class 3 bandwidth 80
  
! Arista
interface Ethernet1-48
  priority-flow-control on
  priority-flow-control priority 3 no-drop

Verify PFC is Working

# 1. Check PFC status on NIC
mlnx_qos -i ens8f0
# Expected output:
# PFC configuration:
#   priority:  0  1  2  3  4  5  6  7
#   enabled:   0  0  0  1  0  0  0  0
#
# tc: 0 ratelimit: unlimited, tsa: vendor
# tc: 3 ratelimit: unlimited, tsa: ets, bw: 80%

# 2. Check PFC counters (should see pause frames)
ethtool -S ens8f0 | grep -i pfc
# rx_pfc_pri3_packets: 1247
# tx_pfc_pri3_packets: 983
# If counters increase during training β†’ PFC is actively preventing drops βœ…

# 3. Check for PFC storms (too many pause frames = problem)
watch -n 1 'ethtool -S ens8f0 | grep -i "pfc\|pause"'

# 4. Verify via NMState
kubectl get nnce <node-name>.pfc-roce-ens8f0 -o yaml | grep -A10 ieee-8021Qaz

# 5. Verify lossless with NCCL
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=NET python -c "
import torch.distributed as dist
# ... run all-reduce
" 2>&1 | grep -E "NET/IB|GDRDMA"
# NET/IB = RDMA active βœ…
# Check no retransmit warnings in NCCL logs

NMState NodeNetworkState β€” Read Current Config

# Check current PFC state on a node
kubectl get nns <node-name> -o yaml | grep -A20 ieee-8021Qaz

# Or for all nodes
for node in $(kubectl get nodes -l node-role.kubernetes.io/worker -o name); do
  echo "=== ${node} ==="
  kubectl get nns ${node#node/} -o jsonpath='{.status.currentState.interfaces[*].ieee-8021Qaz}' 2>/dev/null
  echo
done

PFC Priority Mapping Reference

PriorityTypical UsePFC Enabled?
0Best-effort (default)No (lossy)
1BackgroundNo
2Excellent effortNo
3RoCEv2 / RDMAYes (lossless)
4Video streamingNo
5VoiceOptional
6Network controlNo
7Highest priorityNo

Standard mapping: RoCEv2 default DSCP 26 (AF31) β†’ Priority 3 β†’ PFC enabled on priority 3.

Common Issues

IssueCauseFix
PFC counters all zeroPFC not negotiated with switchVerify switch-side PFC config matches priority 3
PFC storm (continuous pauses)Slow receiver or buffer misconfiguredCheck ECN, increase switch buffer, verify no asymmetric links
RDMA works but slowPFC disabled, packets dropping silentlyEnable PFC, check ethtool -S for drops
ieee-8021Qaz not in NMStateNMState version too oldUpgrade NMState operator β‰₯ 2.2
NNCP stuck in ProgressingNIC doesn’t support DCB/PFCVerify ethtool -i ens8f0 shows mlx5_core driver
Trust mode wrongNIC trusting PCP instead of DSCPRun mlnx_qos -i ens8f0 --trust dscp

Best Practices

  • Enable PFC on priority 3 only β€” making all priorities lossless wastes buffer and risks PFC storms
  • Always configure the switch to match β€” PFC is end-to-end, both sides must agree
  • Use ECN alongside PFC β€” ECN reacts before buffers fill, reducing PFC frequency
  • Set MTU 9000 on both NIC and switch β€” jumbo frames are essential for RDMA throughput
  • Monitor PFC counters β€” some pauses are normal, continuous pauses indicate a problem
  • Trust DSCP, not PCP β€” DSCP is preserved end-to-end across routed networks
  • Test with ib_write_bw under congestion β€” PFC should maintain lossless behavior even under load
  • One lossless priority per network β€” don’t enable PFC on multiple priorities unless you need FCoE + RDMA

Key Takeaways

  • PFC makes RoCEv2 lossless by pausing specific traffic priorities instead of dropping
  • Configure via NNCP (NMState) on Kubernetes/OpenShift β€” ieee-8021Qaz.pfc.enabled: [3]
  • Default RoCE mapping: DSCP 26 β†’ Priority 3 β†’ PFC lossless
  • Switch must match: enable no-drop on priority 3 and allocate 80% bandwidth
  • Use mlnx_qos -i <dev> and ethtool -S <dev> | grep pfc to verify
  • Without PFC, RoCE drops packets under congestion β†’ NCCL throughput collapses
  • ECN + PFC together give the best performance: ECN reacts early, PFC is the safety net
#pfc #nmstate #nncp #rdma #roce #lossless-networking
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens