πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

Deploy Mistral 7B with NVIDIA NIM on Kubernetes

Step-by-step guide to deploy Mistral-7B using NVIDIA NIM with TensorRT-LLM backend on Kubernetes for optimized GPU inference.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy the NVIDIA NIM LLM container with environment variables NIM_MODEL_NAME=/data/Mistral-7B-v0.1/ and NIM_SERVED_MODEL_NAME=Mistral-7B-v0.1. Mount model weights to /data. NIM auto-starts TensorRT-LLM on port 8000. No custom command needed β€” the container entrypoint handles everything.

Key difference from vLLM: NIM uses TensorRT-LLM for optimized inference with CUDA graphs, chunked prefill, and automatic engine building. Higher throughput, but stricter version requirements.

NVIDIA NIM (NVIDIA Inference Microservice) wraps TensorRT-LLM to serve LLMs with high throughput and low latency. This recipe covers deploying Mistral-7B-v0.1 using NIM on Kubernetes.

NIM vs vLLM Comparison

FeatureNIM (TensorRT-LLM)vLLM
BackendTensorRT-LLM C++ enginePyTorch-based
ThroughputHigher (optimized kernels)Good
Startup timeSlower (engine build)Faster
CompatibilityStrict version couplingMore forgiving
CUDA graphsBuilt-inOptional
Chat templateRequired for /chat/completionsSame
Custom commandNot neededRequired

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Kubernetes / OpenShift                       β”‚
β”‚                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  NIM Pod                                β”‚  β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚
β”‚  β”‚  β”‚ TensorRT-LLM Engine             β”‚    β”‚  β”‚
β”‚  β”‚  β”‚ - JIT engine build on first run  β”‚    β”‚  β”‚
β”‚  β”‚  β”‚ - CUDA graphs for batching       β”‚    β”‚  β”‚
β”‚  β”‚  β”‚ - Chunked prefill enabled        β”‚    β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”‚
β”‚  β”‚  Port 8000 (OpenAI-compatible API)      β”‚  β”‚
β”‚  β”‚  Volume: /data (PVC with model files)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deployment Manifest

# mistral-nim-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-nim
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-nim
  template:
    metadata:
      labels:
        app: mistral-nim
    spec:
      containers:
        - name: nim
          image: registry.example.com/org/nvidia/llm-nim:latest
          # No command/args β€” NIM entrypoint handles startup
          ports:
            - containerPort: 8000
              name: http
              protocol: TCP
          env:
            - name: NIM_MODEL_NAME
              value: "/data/Mistral-7B-v0.1/"
            - name: NIM_SERVED_MODEL_NAME
              value: "Mistral-7B-v0.1"
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-data
              mountPath: /data
              readOnly: true
          readinessProbe:
            httpGet:
              path: /v1/models
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
            timeoutSeconds: 5
          livenessProbe:
            httpGet:
              path: /v1/models
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
            timeoutSeconds: 5
      volumes:
        - name: model-data
          persistentVolumeClaim:
            claimName: model-storage-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: mistral-nim
  namespace: ai-inference
spec:
  selector:
    app: mistral-nim
  ports:
    - port: 8000
      targetPort: 8000
      protocol: TCP
      name: http

Environment Variables

VariableValuePurpose
NIM_MODEL_NAME/data/Mistral-7B-v0.1/Path to model weights inside the container
NIM_SERVED_MODEL_NAMEMistral-7B-v0.1Model name exposed via the API

Important: Do NOT set a custom command or entrypoint. NIM handles startup internally, including TensorRT-LLM engine building and API server initialization.

GPU Requirements

Mistral-7B with NIM TensorRT-LLM:

MetricValue
Engine size~29.5 GB
Minimum VRAM40 GB (A100 recommended)
Supported GPUsA100, H100, A30 (limited)
dtypebfloat16 (default)
Tensor parallelism1 (single GPU for 7B)

Run:ai Deployment (UI)

FieldValue
Inference typeCustom
Image URLregistry.example.com/org/nvidia/llm-nim:latest
Image pullOnly if not present (recommended)
Container port8000 (HTTP)
Command(leave empty)
Arguments(leave empty)
Env: NIM_MODEL_NAME/data/Mistral-7B-v0.1/
Env: NIM_SERVED_MODEL_NAMEMistral-7B-v0.1
GPU devices1
GPU fraction50% (if fractioning available)
Data origin (PVC)your-model-storage-pvc
Container path/data
Priorityhigh or very-high

Startup Process

NIM goes through these stages on first start:

  1. Detect GPU β€” identifies available CUDA devices
  2. Load model config β€” reads HuggingFace config from /data/Mistral-7B-v0.1/
  3. Build TensorRT-LLM engine β€” JIT compilation (can take 60–120 seconds)
  4. Load weights β€” loads safetensors into GPU memory (~4 seconds)
  5. Initialize KV cache β€” allocates GPU memory for inference batching
  6. Start API server β€” listens on port 8000

Watch for this in logs:

Loading weights concurrently: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 617/617
Model init total -- 4.40s

Verify Deployment

# Check pod status
kubectl get pods -n ai-inference -l app=mistral-nim

# Watch startup logs
kubectl logs -n ai-inference deployment/mistral-nim -f

# List models
curl -k https://<inference-endpoint>/v1/models

# Run a completion
curl -k -X POST https://<inference-endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Mistral-7B-v0.1",
    "prompt": "Hello from NIM!",
    "max_tokens": 32
  }'

Chat vs Completions (Mistral Base Model)

Mistral-7B-v0.1 is a base model β€” it has no chat template:

EndpointWorks?Notes
/v1/completionsYesUse this
/v1/chat/completionsNoReturns error: β€œdoes not have a default chat template”

If you need chat, use Mistral-7B-Instruct-v0.2 or define a custom chat template.

TensorRT-LLM Runtime Configuration

NIM auto-configures these parameters. Key defaults for Mistral-7B:

dtype: bfloat16
tensor_parallel_size: 1
max_batch_size: 512
max_seq_len: 32768
max_num_tokens: 8192
enable_chunked_context: true
cuda_graph_mode: true
kvcache_free_memory_fraction: 0.9
scheduler_policy: guarantee_no_evict
sliding_window: 4096  # per layer

Override with caution. See Troubleshoot NIM TensorRT-LLM for known issues.

#nvidia-nim #tensorrt-llm #mistral #llm #inference #gpu #ai-workloads
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens