πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 30 minutes K8s 1.28+

Deploy Qwen3.5 397B MoE on Kubernetes

Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy Qwen3.5-397B-A17B with vLLM using --tensor-parallel-size 8 on 8x A100 80GB. A MoE vision-language model with 397B total parameters but only 17B active per token. Frontier multimodal quality (1.66M downloads, 1.3K likes) at efficient inference cost.

The Problem

You need the best open multimodal model available β€” one that can:

  • Analyze complex images β€” medical scans, satellite imagery, detailed diagrams
  • Reason across modalities β€” combine visual and textual understanding for deep analysis
  • Handle frontier tasks β€” tasks where 9B and 35B models fall short
  • Stay cost-efficient β€” MoE architecture keeps inference fast despite 397B total parameters

Qwen3.5-397B-A17B is Alibaba’s flagship multimodal MoE model β€” the largest in the Qwen3.5 family.

The Solution

Deploy Qwen3.5-397B-A17B

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen35-397b
  namespace: ai-inference
  labels:
    app: qwen35-397b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen35-397b
  template:
    metadata:
      labels:
        app: qwen35-397b
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "Qwen/Qwen3.5-397B-A17B"
            - "--tensor-parallel-size"
            - "8"
            - "--max-model-len"
            - "16384"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--max-num-seqs"
            - "16"
            - "--enable-chunked-prefill"
            - "--trust-remote-code"
            - "--limit-mm-per-prompt"
            - "image=2"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-token
                  key: token
            - name: NCCL_DEBUG
              value: "WARN"
            - name: VLLM_WORKER_MULTIPROC_METHOD
              value: "spawn"
          resources:
            limits:
              nvidia.com/gpu: "8"
              memory: 256Gi
              cpu: "64"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 600
            periodSeconds: 60
            failureThreshold: 20
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 30
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: qwen35-397b-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 64Gi
      terminationGracePeriodSeconds: 300
---
apiVersion: v1
kind: Service
metadata:
  name: qwen35-397b
  namespace: ai-inference
spec:
  selector:
    app: qwen35-397b
  ports:
    - port: 8000
      targetPort: 8000

FP8 on H100 (Fewer GPUs)

# FP8 cuts VRAM in half β€” fits on 4x H100
args:
  - "--model"
  - "Qwen/Qwen3.5-397B-A17B"
  - "--tensor-parallel-size"
  - "4"
  - "--quantization"
  - "fp8"
  - "--max-model-len"
  - "16384"
  - "--trust-remote-code"
resources:
  limits:
    nvidia.com/gpu: "4"
nodeSelector:
  nvidia.com/gpu.product: "H100-SXM"

Qwen3.5 Family Comparison

| Model               | Total  | Active | GPUs (FP16) | Multimodal | Downloads |
|----------------------|--------|--------|-------------|------------|-----------|
| Qwen3.5-0.8B        | 0.9B   | 0.9B   | CPU/L4      | Yes        | 662K      |
| Qwen3.5-2B          | 2B     | 2B     | T4/L4       | Yes        | 454K      |
| Qwen3.5-4B          | 5B     | 5B     | A10G        | Yes        | 751K      |
| Qwen3.5-9B          | 10B    | 10B    | 1x A100     | Yes        | 1.54M     |
| Qwen3.5-27B         | 28B    | 28B    | 1x A100 80G | Yes        | 1.1M      |
| Qwen3.5-35B-A3B     | 36B    | 3B     | 1x A100 40G | Yes (MoE)  | 1.46M     |
| Qwen3.5-397B-A17B   | 403B   | 17B    | 8x A100 80G | Yes (MoE)  | 1.66M     |
flowchart TD
    A[Image + Text] --> B[Vision Encoder]
    A --> C[Text Tokenizer]
    B --> D[Visual Tokens]
    C --> E[Text Tokens]
    D --> F[MoE Transformer - 397B total]
    E --> F
    F --> G{Router selects experts}
    G --> H[17B active compute]
    H --> I[Frontier-quality response]
    subgraph 8x A100 80GB
        F
    end

Common Issues

397B model needs fast storage

# ~800GB in FP16 β€” NVMe PVC is essential
# Loading from network storage takes 30-60 minutes
# Pre-download with an init container or dedicated pull Job

Image processing at this scale

# Limit images per request β€” each image adds significant KV cache
--limit-mm-per-prompt image=2  # max 2 images per request
# For image-heavy workloads, use Qwen3.5-9B which is more efficient

Best Practices

  • 8x A100 80GB minimum at FP16, or 4x H100 with FP8
  • NVLink/NVSwitch β€” mandatory for 8-GPU tensor parallelism
  • Limit images per prompt β€” vision adds significant memory overhead at 397B scale
  • Use smaller Qwen3.5 models first β€” 9B or 35B-A3B handle most use cases
  • Reserve 397B for frontier tasks β€” complex multi-image analysis, research, benchmarking

Key Takeaways

  • Qwen3.5-397B-A17B is the flagship Qwen3.5 MoE model β€” 397B total, 17B active
  • 1.66M downloads β€” the most downloaded model in the Qwen3.5 MoE family
  • Requires 8x A100 80GB (FP16) or 4x H100 (FP8)
  • Multimodal β€” processes both images and text with frontier quality
  • MoE means 17B compute cost with 397B knowledge β€” best quality per FLOP
#qwen3.5 #mixture-of-experts #moe #multimodal #vision-language #ultra-large #vllm #nvidia #inference #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens