Shared Memory Transport for NCCL Intra-Node GPU
Configure NCCL shared memory (SHM) transport for intra-node GPU communication on Kubernetes. Covers /dev/shm sizing with emptyDir and NVLink/PCIe P2P paths.
π‘ Quick Answer: Mount a 16Gi Memory-backed
emptyDirat/dev/shmfor NCCL shared memory transport. KeepNCCL_SHM_DISABLE=0(enabled) for intra-node GPU communication when NVLink is unavailable. NCCL uses SHM as a CPU-mediated fallback between GPUs on the same node that lack direct P2P paths.
The Problem
- Default
/dev/shmin containers is 64MB β too small for NCCL buffers - NCCL shared memory transport needs large tmpfs for inter-GPU staging
- Kubernetes default
emptyDiruses disk, not memory β too slow for NCCL - Need to size SHM correctly for GPU count and message sizes
- Must understand when SHM is used vs NVLink vs PCIe P2P
The Solution
Pod Volume Configuration
volumes:
- name: dshm
emptyDir:
medium: Memory # tmpfs (RAM-backed, not disk)
sizeLimit: 16Gi # Size for NCCL buffers
containers:
- name: worker
volumeMounts:
- name: dshm
mountPath: /dev/shmWhen NCCL Uses Each Transport
Transport Path β When Used β Bandwidth
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββΌβββββββββββ
NVLink (P2P/CUMEM) β GPUs connected via NVLink/NVSwitchβ 600-900 GB/s
PCIe P2P β GPUs on same PCIe switch, no NVL β 20-30 GB/s
SHM (shared memory) β Same node, no direct P2P path β 10-20 GB/s
NET/IB/GDRDMA β Cross-node with GPUDirect RDMA β 35-45 GB/s
NET/IB (no GDR) β Cross-node, CPU bounce buffer β 12-15 GB/s
NET/Socket β Cross-node, TCP fallback β 2-5 GB/s
βββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββββ΄βββββββββββ
SHM is used for intra-node communication when:
- NVLink is not available between the GPU pair
- PCIe peer-to-peer is disabled or unsupported
- Both ranks are on the same physical nodeNCCL SHM Environment Variables
env:
# Keep SHM enabled (default)
- name: NCCL_SHM_DISABLE
value: "0" # 0=enabled, 1=disabled
# Disable collective network (SHARP) β not available on most clusters
- name: NCCL_COLLNET_ENABLE
value: "0" # 0=disabled (no SHARP hardware)Sizing Guide
GPUs per Node β Recommended /dev/shm β Rationale
βββββββββββββββΌβββββββββββββββββββββββΌββββββββββββββββββββββββββββββ
2 β 8Gi β Single SHM buffer pair
4 β 16Gi β Multiple concurrent transfers
8 β 32Gi β Ring/tree allreduce staging
16 (multi-NIC)β 64Gi β Full NVSwitch + fallback paths
βββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββββββββββββββββββββββββ
Formula: ~2-4 GB per GPU for staging buffers
Safety margin: 2Γ formula for concurrent collectivesVerifying SHM Usage
# Inside the pod, check /dev/shm size:
df -h /dev/shm
# Expected: tmpfs 16G 0 16G 0% /dev/shm
# During NCCL test, monitor SHM usage:
watch -n1 'du -sh /dev/shm/'
# Active test: may show 1-4 GB used
# In NCCL debug logs, SHM transport appears as:
# NCCL INFO Channel 0/0 : 0[0] -> 1[1] [send] via SHM/direct
# "SHM/direct" = shared memory transport between co-located GPUsWhen to Disable SHM
# Disable SHM if you see shared-memory errors:
export NCCL_SHM_DISABLE=1
# Scenarios for disabling:
# - Pod crashed with "Bus error" (SHM too small)
# - "mmap failed" errors in NCCL logs
# - All GPUs have NVLink (SHM unnecessary, NVLink is faster)
# - Debugging to isolate network vs. local issues
# Note: Disabling SHM forces NCCL to use network transport
# even for intra-node communication (wasteful but sometimes needed for debug)Common Issues
βBus errorβ or SIGBUS during NCCL test
- Cause: /dev/shm too small β NCCL exceeded tmpfs limit
- Fix: Increase
sizeLimitin emptyDir volume (16Gi β 32Gi)
/dev/shm shows only 64MB
- Cause: Default Docker/containerd SHM size; volume not mounted
- Fix: Add explicit emptyDir volume mount at
/dev/shmwithmedium: Memory
SHM not used despite same-node GPUs
- Cause: NVLink available (preferred) or P2P active
- Fix: Not an issue β NVLink/P2P is faster. SHM is the fallback.
Pod evicted for memory pressure
- Cause: Memory-backed emptyDir counts against pod memory limit
- Fix: Ensure pod memory limit includes SHM size (e.g., 32Gi limit if 16Gi SHM + 16Gi app)
Best Practices
- Always mount
/dev/shmwith Memory medium β default 64MB is never enough - Size at 2-4 GB per GPU minimum β 16Gi covers most 2-4 GPU configurations
- Keep
NCCL_SHM_DISABLE=0unless debugging specific SHM errors - Account for SHM in memory limits β tmpfs counts against cgroup memory
- Set
NCCL_COLLNET_ENABLE=0unless you have Mellanox SHARP hardware - Monitor with
du -sh /dev/shmduring tests to right-size allocation
Key Takeaways
/dev/shmmust be explicitly mounted as Memory-backed emptyDir in Kubernetes- Default 64MB container SHM is insufficient β use 16Gi+ for GPU workloads
- NCCL uses SHM for intra-node when NVLink/P2P unavailable β itβs the CPU fallback
- Memory-backed emptyDir counts against pod memory limit (plan accordingly)
NCCL_SHM_DISABLE=0keeps SHM enabled; disable only for debuggingNCCL_COLLNET_ENABLE=0disables SHARP (not available on most clusters)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
