Troubleshoot NVIDIA NIM TensorRT-LLM Initialization Failures
Diagnose and fix common NIM TensorRT-LLM executor failures including DecoderState mismatch, version incompatibilities, and engine build errors.
π‘ Quick Answer: If NIM logs show
Failed to initialize executor on rank 0: setup(): incompatible function argumentswithmax_attention_windowas a list instead of int, your TensorRT-LLM bindings are older than the NIM runtime expects. Upgrade NIM container image, or removeNIM_NUM_KV_CACHE_SEQ_LENSoverride. If/v1/completionsreturnsactivator request timeout, the backend never finished initializing.
This recipe covers the most common NIM + TensorRT-LLM startup failures and their resolutions.
Symptom: βactivator request timeoutβ
When calling the inference endpoint:
curl -k -X POST https://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "Mistral-7B-v0.1", "prompt": "Hello", "max_tokens": 32}'You get:
activator request timeoutThis means: The model server never became ready. The TensorRT-LLM engine failed to initialize, so no inference requests can be served.
Root Cause 1: DecoderState.setup() ABI Mismatch
The Error
ERROR [TRT-LLM] Failed to initialize executor on rank 0:
setup(): incompatible function arguments.
Expected:
max_attention_window: int
Received:
max_attention_window: [4096, 4096, ..., 4096] # list of 32 intsWhat Happened
The NIM runtime passes max_attention_window as a per-layer list (one value per transformer layer), but the installed TensorRT-LLM C++ bindings expect a single integer.
This is a binary ABI mismatch between the NIM runtime and the TensorRT-LLM version bundled in the container.
How to Confirm
Check the TRT-LLM version inside the container:
kubectl exec -it <nim-pod> -n ai-inference -- \
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"Then verify the DecoderState.setup() signature:
kubectl exec -it <nim-pod> -n ai-inference -- \
python3 -c "
import tensorrt_llm.bindings.internal.runtime as rt
import inspect
print(inspect.signature(rt.DecoderState.setup))
"If you see max_attention_window: int (not List[int]), the bindings are too old.
Fix
Option A β Upgrade NIM container image (recommended)
Use a newer NIM LLM image that bundles TensorRT-LLM β₯ 1.0.4:
image: registry.example.com/org/nvidia/llm-nim:v0.16.0 # or newerOption B β Remove sliding window override
If your deployment sets NIM_NUM_KV_CACHE_SEQ_LENS, remove it. This environment variable overrides attention window logic and can trigger the mismatch.
Root Cause 2: Infinite Restart Loop
NIM retries engine initialization every ~5 seconds. The pattern in logs:
Loading weights concurrently: 100%|ββββββββββ| 617/617
Model init total -- 4.40s
ERROR [TRT-LLM] Failed to initialize executor on rank 0: ...
INFO Using JIT Config to create LLM args β retry starts
Loading weights concurrently: 100%|ββββββββββ| 617/617
Model init total -- 4.42s
ERROR [TRT-LLM] Failed to initialize executor on rank 0: ...
INFO Using JIT Config to create LLM args β retry againThis loop continues indefinitely. The endpoint never becomes healthy.
Diagnosis
# Count how many times initialization has been attempted
kubectl logs <nim-pod> -n ai-inference | grep -c "Failed to initialize executor"
# Check if the pod is in CrashLoopBackOff
kubectl get pods -n ai-inference -l app=mistral-nimFix
Stop redeploying with the same image. The error is deterministic β it will always fail with the same TRT-LLM version. Upgrade the container image.
Root Cause 3: GPU Memory Issues
The Error
CUDA out of memory
Failed to allocate tensor
RuntimeError: CUDA error: out of memoryCommon Causes
| Scenario | Explanation |
|---|---|
| Small GPU fraction | 50% of 40 GB = 20 GB, but engine needs ~30 GB |
| Other pods sharing GPU | MIG or time-slicing leaves insufficient VRAM |
Large max_batch_size | Default 512 may require too much KV cache memory |
Fix
# Check GPU memory inside the pod
kubectl exec -it <nim-pod> -n ai-inference -- nvidia-smiIncrease GPU allocation or reduce batch size:
env:
- name: NIM_MAX_BATCH_SIZE
value: "64" # reduce from default 512Root Cause 4: Transformers Version Warning
UserWarning: transformers version 4.56.1 is incompatible with nvidia-modeloptThis warning is usually harmless but can cause subtle issues with tokenizer loading. If inference fails after model loads successfully, pin a compatible transformers version in your custom image.
Root Cause 5: Chat Template Missing
{
"error": {
"message": "Model Mistral-7B-v0.1 does not have a default chat template defined in the tokenizer.",
"code": 500
}
}This is not a bug β Mistral-7B-v0.1 is a base model without a chat template.
Fix: Use /v1/completions instead of /v1/chat/completions, or deploy an instruct-tuned model.
Diagnostic Commands Summary
# Pod status
kubectl get pods -n ai-inference -l app=mistral-nim
# Full logs
kubectl logs -n ai-inference <nim-pod> --tail=200
# Search for errors
kubectl logs -n ai-inference <nim-pod> | grep -i "error\|failed\|exception"
# Check TRT-LLM version
kubectl exec -it <nim-pod> -- python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
# Check GPU status inside pod
kubectl exec -it <nim-pod> -- nvidia-smi
# Check engine initialization
kubectl logs -n ai-inference <nim-pod> | grep -i "engine\|executor\|trt"
# Test health endpoint
curl -k https://<endpoint>/v1/modelsDecision Tree
curl returns "activator request timeout"
ββ Check pod logs
ββ "Failed to initialize executor" + "setup(): incompatible"
β ββ TRT-LLM version mismatch β upgrade NIM image
ββ "CUDA out of memory"
β ββ Increase GPU allocation or reduce batch size
ββ "Failed to load model" / "plan file missing"
β ββ Model weights corrupted or incomplete β re-upload
ββ Logs end at "Creating TorchRT LLM API model"
β ββ Engine build hanging β check GPU driver and MOFED
ββ No error but pod keeps restarting
ββ Liveness probe failing β increase initialDelaySecondsWhen to Fall Back to vLLM
If NIM issues persist and you need inference running now:
- Deploy vLLM instead (see Deploy Mistral with vLLM
- vLLM is more forgiving with driver/library versions
- Lower throughput but much faster time-to-working-endpoint
- Same OpenAI-compatible API, just different backend
Related Recipes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
