Deploy NVIDIA Nemotron 120B MoE on K8s
Deploy NVIDIA Nemotron-3-Super-120B-A12B MoE model on Kubernetes. 120B total parameters with 12B active for enterprise-grade inference.
π‘ Quick Answer: Deploy NVIDIA Nemotron-3-Super-120B-A12B with vLLM using
--tensor-parallel-size 2on 2x A100 80GB. A MoE model with 120B total parameters and 12B active per token β enterprise-grade quality at efficient inference cost. Built by NVIDIA for production AI workloads.
The Problem
Enterprise AI deployments need:
- Frontier-quality reasoning β matching or exceeding GPT-4 on enterprise benchmarks
- Efficient inference β canβt justify 8-GPU clusters for every deployment
- NVIDIA-optimized β best performance on NVIDIA hardware with native TensorRT-LLM support
- Open weights β no API dependency, run on your own infrastructure
Nemotron-3-Super-120B-A12B delivers GPT-4-class quality with only 12B parameters active per token.
The Solution
Step 1: Deploy with vLLM
apiVersion: apps/v1
kind: Deployment
metadata:
name: nemotron-120b
namespace: ai-inference
labels:
app: nemotron-120b
spec:
replicas: 1
selector:
matchLabels:
app: nemotron-120b
template:
metadata:
labels:
app: nemotron-120b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
- "--tensor-parallel-size"
- "2"
- "--max-model-len"
- "32768"
- "--gpu-memory-utilization"
- "0.92"
- "--max-num-seqs"
- "32"
- "--enable-chunked-prefill"
- "--trust-remote-code"
- "--dtype"
- "bfloat16"
- "--port"
- "8000"
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: NCCL_DEBUG
value: "WARN"
resources:
limits:
nvidia.com/gpu: "2"
memory: 128Gi
cpu: "16"
requests:
memory: 96Gi
cpu: "8"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
failureThreshold: 30
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 15
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: nemotron-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 16Gi
terminationGracePeriodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
name: nemotron-120b
namespace: ai-inference
spec:
selector:
app: nemotron-120b
ports:
- port: 8000
targetPort: 8000
name: httpStep 2: FP8 on H100 (Single GPU Possible)
apiVersion: apps/v1
kind: Deployment
metadata:
name: nemotron-120b-fp8
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: nemotron-120b-fp8
template:
metadata:
labels:
app: nemotron-120b-fp8
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16"
- "--quantization"
- "fp8"
- "--tensor-parallel-size"
- "1"
- "--max-model-len"
- "16384"
- "--gpu-memory-utilization"
- "0.95"
- "--max-num-seqs"
- "16"
- "--trust-remote-code"
resources:
limits:
nvidia.com/gpu: "1"
memory: 96Gi
cpu: "8"
nodeSelector:
nvidia.com/gpu.product: "H100-SXM"Step 3: Test Enterprise Workloads
# Complex reasoning task
kubectl run test-nemotron --rm -it --image=curlimages/curl -- \
curl -s http://nemotron-120b:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16",
"messages": [
{"role": "system", "content": "You are an enterprise Kubernetes architect."},
{"role": "user", "content": "Design a multi-region, active-active Kubernetes platform for a financial services company with requirements: <100ms API latency, 99.99% uptime, SOC 2 compliance, and support for 10,000 microservices. Include networking, security, observability, and disaster recovery architecture."}
],
"max_tokens": 4096,
"temperature": 0.3
}'NVIDIA MoE Model Landscape
| Model | Total | Active | GPUs (BF16) | Use Case |
|-----------------------------|--------|--------|-------------|---------------------|
| Nemotron-3-Super-120B-A12B | 124B | 12B | 2x A100 80GB| Enterprise, reasoning|
| Qwen3-235B-A22B | 235B | 22B | 4x A100 80GB| Frontier, thinking |
| Qwen3.5-35B-A3B | 36B | 3B | 1x A100 40GB| Efficient multimodal|
| Mixtral-8x7B | 47B | 13B | 1x A100 80GB| General purpose |flowchart TD
A[Enterprise Request] --> B[Nemotron 120B MoE]
B --> C{Router Layer}
C --> D[Expert Group 1]
C --> E[Expert Group 2]
C --> F[Expert Group N]
D --> G[12B active compute]
G --> H[Response]
subgraph 2x A100 80GB
I[60B weights per GPU] --> C
end
J[120B total knowledge] -.-> D
J -.-> E
J -.-> F
K[Only 12B active per token] -.-> GCommon Issues
Model loading takes 15+ minutes
# 120B model in BF16 is ~240GB β loading from network storage is slow
# Use NVMe-backed PVC
startupProbe:
failureThreshold: 30 # 30 Γ 30s = 15 minutes
periodSeconds: 30NCCL errors with tensor parallelism
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_DEBUG
value: "INFO" # Change to INFO for debugging
# Ensure /dev/shm is large enough (16Gi for 2 GPUs)BF16 vs FP16
# Nemotron is trained in BF16 β use --dtype bfloat16
# FP16 may cause numerical issues with large MoE models
# BF16 has wider dynamic range, better for MoE routingBest Practices
- BF16 precision β Nemotron is trained in BF16, use
--dtype bfloat16 - 2x A100 80GB minimum β all 120B parameters must fit in VRAM
- FP8 on H100 β potentially fits on single H100 80GB with reduced context
- NVIDIA ecosystem β pairs well with TensorRT-LLM, NIM, and Triton
- Enterprise use cases β designed for complex reasoning, code gen, and analysis
- Large
/dev/shmβ 16Gi for NCCL multi-GPU communication
Key Takeaways
- Nemotron-3-Super-120B-A12B: 120B total, 12B active β enterprise MoE by NVIDIA
- Requires 2x A100 80GB (BF16) or potentially 1x H100 (FP8)
- Built for enterprise workloads β complex reasoning, architecture design, code generation
- NVIDIA-optimized β best performance on NVIDIA GPUs with native TensorRT-LLM support
- MoE architecture: 10x the knowledge of a 12B dense model at similar inference cost
- Use
--dtype bfloat16β trained in BF16, donβt convert to FP16

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
