NIM Model Profiles and Selection on Kubernetes
Configure NIM_MODEL_PROFILE for NVIDIA NIM deployments on Kubernetes. List profiles, select by ID or name, tune VRAM, and override with vLLM CLI args.
π‘ Quick Answer: Set
NIM_MODEL_PROFILEin your Pod spec to select a specific inference profile (precision, tensor parallelism, LoRA). Uselist-model-profilesto discover available profiles and their VRAM requirements. Without it, NIM auto-selects the best profile for your GPU hardware.
The Problem
NVIDIA NIM containers ship with multiple model profiles β different combinations of precision (bf16, fp8, nvfp4), tensor parallelism (TP), pipeline parallelism (PP), and LoRA support. Choosing the wrong profile wastes GPU memory, fails to start, or leaves performance on the table. You need to understand how profile selection works and how to pin the right profile for your Kubernetes deployment.
flowchart TB
START["NIM Container Starts"] --> CHECK{"NIM_MODEL_PROFILE<br/>set?"}
CHECK -->|"Not set"| AUTO["Memory-Aware Selector<br/>Estimates VRAM per profile"]
CHECK -->|"= 'default'"| DEFAULT["Intelligent Default<br/>Backend priority + HW compat"]
CHECK -->|"= profile ID"| EXACT["Exact Match<br/>by 64-char checksum"]
CHECK -->|"= friendly name"| NAME["Match by Description<br/>e.g. vllm-bf16-tp2-pp1"]
AUTO --> CLASSIFY["Classify Profiles"]
DEFAULT --> CLASSIFY
EXACT --> DOWNLOAD["Download Model Files"]
NAME --> DOWNLOAD
CLASSIFY --> COMPAT["β
Compatible<br/>VRAM fits"]
CLASSIFY --> LOW["β οΈ Low Memory<br/>Reduce max-model-len"]
CLASSIFY --> INCOMPAT["β Incompatible<br/>Weights exceed VRAM"]
COMPAT --> DOWNLOAD
LOW -->|"--max-model-len"| DOWNLOAD
DOWNLOAD --> LAUNCH["Launch Inference Backend"]The Solution
Profile Naming Convention
NIM profiles follow a consistent naming pattern:
vllm-<precision>-tp<N>-pp<M>[-lora]| Component | Values | Meaning |
|---|---|---|
| Backend | vllm | vLLM inference engine |
| Precision | bf16, fp8, mxfp4, nvfp4 | Quantization format |
tp<N> | tp1, tp2, tp4, tp8 | Tensor parallelism (GPU count) |
pp<M> | pp1, pp2 | Pipeline parallelism stages |
-lora | optional | LoRA adapter support enabled |
Examples:
vllm-bf16-tp1-pp1β BF16 on 1 GPU, no LoRAvllm-fp8-tp4-pp1-loraβ FP8 quantized across 4 GPUs with LoRAvllm-bf16-tp8-pp2β BF16 across 8 GPUs with 2-stage pipeline parallelism (multinode)
List Available Profiles
Before deploying, discover which profiles your NIM container supports:
# Job to list profiles on your cluster's GPUs
apiVersion: batch/v1
kind: Job
metadata:
name: nim-list-profiles
namespace: nim
spec:
template:
spec:
restartPolicy: Never
containers:
- name: list-profiles
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.7.3
command: ["list-model-profiles"]
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: nim-cache
mountPath: /opt/nim/.cache
volumes:
- name: nim-cache
persistentVolumeClaim:
claimName: nim-cache-pvc# Check the output
kubectl logs job/nim-list-profiles -n nimExample output:
MODEL PROFILES
- Compatible with system and runnable:
- dcec66a5... (vllm-bf16-tp1-pp1) [requires >=18 GB/gpu]
- With LoRA support:
- d66193b8... (vllm-bf16-tp1-pp1-feat_lora) [requires >=22 GB/gpu]
- Compatible with system but low memory:
- a1b2c3d4... (vllm-bf16-tp1-pp1) [requires >=45 GB/gpu, try --max-model-len=4096 to reduce to >=30 GB/gpu]
- Incompatible with system:
- 27af459c... (vllm-bf16-tp2-pp1)Deploy with Automatic Profile Selection
When you donβt set NIM_MODEL_PROFILE, NIM picks the best compatible profile automatically:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-llama-auto
namespace: nim
spec:
replicas: 1
selector:
matchLabels:
app: nim-llama
template:
metadata:
labels:
app: nim-llama
spec:
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.7.3
ports:
- containerPort: 8000
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: NGC_API_KEY
# No NIM_MODEL_PROFILE β auto-selects best fit
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: nim-cache
mountPath: /opt/nim/.cache
volumes:
- name: nim-cache
persistentVolumeClaim:
claimName: nim-cache-pvcDeploy with Explicit Profile Selection
Pin a specific profile for deterministic, reproducible deployments:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-llama-fp8
namespace: nim
spec:
replicas: 1
selector:
matchLabels:
app: nim-llama-fp8
template:
metadata:
labels:
app: nim-llama-fp8
spec:
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.7.3
ports:
- containerPort: 8000
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: NGC_API_KEY
# Option A: Select by friendly name
- name: NIM_MODEL_PROFILE
value: "vllm-fp8-tp1-pp1"
# Option B: Select by profile ID (most deterministic)
# - name: NIM_MODEL_PROFILE
# value: "70edb8bb9f8511ce2ea195e3caebcc3c7191dc27fea0c8d4acf9c0d9a69e43cd"
resources:
limits:
nvidia.com/gpu: "1"
volumeMounts:
- name: nim-cache
mountPath: /opt/nim/.cache
volumes:
- name: nim-cache
persistentVolumeClaim:
claimName: nim-cache-pvcDeploy with LoRA Support
env:
- name: NIM_MODEL_PROFILE
value: "vllm-bf16-tp1-pp1-feat_lora"
- name: NIM_PEFT_SOURCE
value: "/lora-adapters"
- name: NIM_PEFT_REFRESH_INTERVAL
value: "600" # Check for new adapters every 10 minMulti-GPU Tensor Parallelism
For large models that need multiple GPUs:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nim-llama-70b
namespace: nim
spec:
replicas: 1
selector:
matchLabels:
app: nim-llama-70b
template:
metadata:
labels:
app: nim-llama-70b
spec:
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-70b-instruct:1.7.3
ports:
- containerPort: 8000
env:
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-secret
key: NGC_API_KEY
- name: NIM_MODEL_PROFILE
value: "vllm-fp8-tp4-pp1"
resources:
limits:
nvidia.com/gpu: "4"
volumeMounts:
- name: nim-cache
mountPath: /opt/nim/.cache
- name: dshm
mountPath: /dev/shm
volumes:
- name: nim-cache
persistentVolumeClaim:
claimName: nim-cache-pvc
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 16GiLow Memory: Reduce Context Length
When a profile is flagged as βlow memoryβ, reduce --max-model-len:
env:
- name: NIM_MODEL_PROFILE
value: "vllm-bf16-tp1-pp1"
args:
- "--max-model-len"
- "4096" # Reduce from default (e.g., 128K) to fit VRAMOverride Profile Settings with vLLM CLI Args
Backend-native CLI args take precedence over profile defaults:
containers:
- name: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.7.3
env:
- name: NIM_MODEL_PROFILE
value: "vllm-bf16-tp2-pp1"
args:
# These override the profile's TP setting
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "8192"
- "--enable-lora"Precedence hierarchy:
- vLLM CLI args (highest) β
--tensor-parallel-size,--max-model-len, etc. - NIM_MODEL_PROFILE β profile defaults applied if not overridden
- Auto-selection (lowest) β hardware-based automatic pick
Profile Selection Decision Matrix
| Scenario | NIM_MODEL_PROFILE | GPU Request | Notes |
|---|---|---|---|
| Quick dev/test | (not set) | 1 | Auto-selects best single-GPU profile |
| Production (pinned) | vllm-fp8-tp1-pp1 | 1 | FP8 saves ~50% VRAM vs bf16 |
| Production with LoRA | vllm-bf16-tp1-pp1-feat_lora | 1 | Needs ~4GB extra VRAM for adapters |
| Large model (70B) | vllm-fp8-tp4-pp1 | 4 | FP8 on 4Γ H100/A100 |
| Very large (405B) | vllm-bf16-tp8-pp2 | 16 (multinode) | 8 GPUs Γ 2 pipeline stages |
| Constrained VRAM | vllm-fp8-tp1-pp1 + --max-model-len 4096 | 1 | Trade context length for fit |
| Deterministic CI/CD | <64-char profile ID> | varies | Immune to tag/name changes |
Common Issues
| Issue | Cause | Fix |
|---|---|---|
No compatible profiles found | GPU VRAM too small for any profile | Use a quantized profile (fp8/nvfp4) or increase TP |
| Container OOMKilled | Profile fits weights but not KV cache at full context | Add --max-model-len 4096 (or lower) |
| Wrong profile selected | Auto-selection picked unexpected profile | Pin with NIM_MODEL_PROFILE explicitly |
Profile not found | Typo in profile name or ID | Run list-model-profiles to verify exact names |
| LoRA not working | Non-LoRA profile selected | Use profile with -feat_lora suffix |
| Slow startup | Downloading model files each time | Use a PVC for /opt/nim/.cache to persist downloads |
| vLLM arg ignored | Arg syntax wrong | Args go in args: field, not env: β e.g., ["--tensor-parallel-size", "4"] |
Best Practices
- Always run
list-model-profilesfirst β discover whatβs available before deploying - Pin profiles in production β use explicit
NIM_MODEL_PROFILEfor reproducibility - Use profile IDs for CI/CD β 64-char IDs are immutable; friendly names may change across versions
- Prefer FP8 over BF16 when available β ~50% VRAM savings with minimal quality loss on H100/L40S
- Mount
/dev/shmfor multi-GPU β tensor parallelism needs shared memory for NCCL - Cache models on PVC β avoid re-downloading on every pod restart
- Set
NIM_MODEL_PROFILE=defaultβ better than no setting; triggers intelligent selection with backend priority
Key Takeaways
- NIM model profiles define precision, tensor/pipeline parallelism, and LoRA support per deployment
list-model-profilesshows compatible, low-memory, and incompatible profiles for your hardware- Pin
NIM_MODEL_PROFILEby ID or friendly name for deterministic production deployments - vLLM CLI args override profile defaults β useful for tuning
max-model-lenand TP - FP8 quantization is the sweet spot for H100/A100 β halves VRAM with negligible quality impact

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
