Deploy Qwen3.5 397B MoE on Kubernetes
Deploy Alibaba Qwen3.5-397B-A17B MoE multimodal model on Kubernetes. 397B total parameters with only 17B active per token for frontier VLM inference.
π‘ Quick Answer: Deploy Qwen3.5-397B-A17B with vLLM using
--tensor-parallel-size 8on 8x A100 80GB. A MoE vision-language model with 397B total parameters but only 17B active per token. Frontier multimodal quality (1.66M downloads, 1.3K likes) at efficient inference cost.
The Problem
You need the best open multimodal model available β one that can:
- Analyze complex images β medical scans, satellite imagery, detailed diagrams
- Reason across modalities β combine visual and textual understanding for deep analysis
- Handle frontier tasks β tasks where 9B and 35B models fall short
- Stay cost-efficient β MoE architecture keeps inference fast despite 397B total parameters
Qwen3.5-397B-A17B is Alibabaβs flagship multimodal MoE model β the largest in the Qwen3.5 family.
The Solution
Deploy Qwen3.5-397B-A17B
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen35-397b
namespace: ai-inference
labels:
app: qwen35-397b
spec:
replicas: 1
selector:
matchLabels:
app: qwen35-397b
template:
metadata:
labels:
app: qwen35-397b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen3.5-397B-A17B"
- "--tensor-parallel-size"
- "8"
- "--max-model-len"
- "16384"
- "--gpu-memory-utilization"
- "0.92"
- "--max-num-seqs"
- "16"
- "--enable-chunked-prefill"
- "--trust-remote-code"
- "--limit-mm-per-prompt"
- "image=2"
- "--port"
- "8000"
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: NCCL_DEBUG
value: "WARN"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
resources:
limits:
nvidia.com/gpu: "8"
memory: 256Gi
cpu: "64"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 60
failureThreshold: 20
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: qwen35-397b-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
terminationGracePeriodSeconds: 300
---
apiVersion: v1
kind: Service
metadata:
name: qwen35-397b
namespace: ai-inference
spec:
selector:
app: qwen35-397b
ports:
- port: 8000
targetPort: 8000FP8 on H100 (Fewer GPUs)
# FP8 cuts VRAM in half β fits on 4x H100
args:
- "--model"
- "Qwen/Qwen3.5-397B-A17B"
- "--tensor-parallel-size"
- "4"
- "--quantization"
- "fp8"
- "--max-model-len"
- "16384"
- "--trust-remote-code"
resources:
limits:
nvidia.com/gpu: "4"
nodeSelector:
nvidia.com/gpu.product: "H100-SXM"Qwen3.5 Family Comparison
| Model | Total | Active | GPUs (FP16) | Multimodal | Downloads |
|----------------------|--------|--------|-------------|------------|-----------|
| Qwen3.5-0.8B | 0.9B | 0.9B | CPU/L4 | Yes | 662K |
| Qwen3.5-2B | 2B | 2B | T4/L4 | Yes | 454K |
| Qwen3.5-4B | 5B | 5B | A10G | Yes | 751K |
| Qwen3.5-9B | 10B | 10B | 1x A100 | Yes | 1.54M |
| Qwen3.5-27B | 28B | 28B | 1x A100 80G | Yes | 1.1M |
| Qwen3.5-35B-A3B | 36B | 3B | 1x A100 40G | Yes (MoE) | 1.46M |
| Qwen3.5-397B-A17B | 403B | 17B | 8x A100 80G | Yes (MoE) | 1.66M |flowchart TD
A[Image + Text] --> B[Vision Encoder]
A --> C[Text Tokenizer]
B --> D[Visual Tokens]
C --> E[Text Tokens]
D --> F[MoE Transformer - 397B total]
E --> F
F --> G{Router selects experts}
G --> H[17B active compute]
H --> I[Frontier-quality response]
subgraph 8x A100 80GB
F
endCommon Issues
397B model needs fast storage
# ~800GB in FP16 β NVMe PVC is essential
# Loading from network storage takes 30-60 minutes
# Pre-download with an init container or dedicated pull JobImage processing at this scale
# Limit images per request β each image adds significant KV cache
--limit-mm-per-prompt image=2 # max 2 images per request
# For image-heavy workloads, use Qwen3.5-9B which is more efficientBest Practices
- 8x A100 80GB minimum at FP16, or 4x H100 with FP8
- NVLink/NVSwitch β mandatory for 8-GPU tensor parallelism
- Limit images per prompt β vision adds significant memory overhead at 397B scale
- Use smaller Qwen3.5 models first β 9B or 35B-A3B handle most use cases
- Reserve 397B for frontier tasks β complex multi-image analysis, research, benchmarking
Key Takeaways
- Qwen3.5-397B-A17B is the flagship Qwen3.5 MoE model β 397B total, 17B active
- 1.66M downloads β the most downloaded model in the Qwen3.5 MoE family
- Requires 8x A100 80GB (FP16) or 4x H100 (FP8)
- Multimodal β processes both images and text with frontier quality
- MoE means 17B compute cost with 397B knowledge β best quality per FLOP

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
