Deploy Qwen3.5 35B MoE on Kubernetes
Deploy Alibaba Qwen3.5-35B-A3B mixture-of-experts multimodal model on Kubernetes. 35B total parameters with only 3B active for ultra-efficient inference.
π‘ Quick Answer: Qwen3.5-35B-A3B is a MoE vision-language model β 35B total parameters but only 3B active per token. It runs on a single GPU with near-3B-model speed but 35B-model quality. Ideal for multimodal workloads where cost efficiency matters.
The Problem
You need multimodal (image + text) AI but face a trade-off:
- Small models (3B) β fast and cheap, but limited reasoning
- Large models (35B dense) β great quality, but need expensive multi-GPU setups
- MoE solves this β 35B parameters of knowledge, only 3B active per forward pass
Qwen3.5-35B-A3B gives you the quality of a 35B model at the inference cost of a 3B model, with vision capabilities.
The Solution
Deploy Qwen3.5-35B-A3B
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen35-35b-moe
namespace: ai-inference
labels:
app: qwen35-35b-moe
spec:
replicas: 1
selector:
matchLabels:
app: qwen35-35b-moe
template:
metadata:
labels:
app: qwen35-35b-moe
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "Qwen/Qwen3.5-35B-A3B"
- "--max-model-len"
- "32768"
- "--gpu-memory-utilization"
- "0.92"
- "--max-num-seqs"
- "64"
- "--enable-chunked-prefill"
- "--trust-remote-code"
- "--limit-mm-per-prompt"
- "image=4"
- "--port"
- "8000"
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
resources:
limits:
nvidia.com/gpu: "1"
memory: 48Gi
cpu: "8"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 15
failureThreshold: 20
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: qwen35-moe-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 4Gi
---
apiVersion: v1
kind: Service
metadata:
name: qwen35-35b-moe
namespace: ai-inference
spec:
selector:
app: qwen35-35b-moe
ports:
- port: 8000
targetPort: 8000GGUF Version for llama.cpp
# Use unsloth GGUF quantization for even lower resource usage
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwen35-35b-gguf
namespace: ai-inference
spec:
replicas: 1
selector:
matchLabels:
app: qwen35-35b-gguf
template:
metadata:
labels:
app: qwen35-35b-gguf
spec:
containers:
- name: llamacpp
image: ghcr.io/ggerganov/llama.cpp:server
args:
- "--model"
- "/models/Qwen3.5-35B-A3B-Q4_K_M.gguf"
- "--ctx-size"
- "16384"
- "--n-gpu-layers"
- "99"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
resources:
limits:
nvidia.com/gpu: "1"
memory: 32Gi
volumeMounts:
- name: models
mountPath: /models
volumes:
- name: models
persistentVolumeClaim:
claimName: gguf-modelsMoE Efficiency Comparison
| Model | Total | Active | VRAM (FP16) | Tokens/sec | Quality |
|----------------------|-------|--------|-------------|------------|---------|
| Qwen3.5-0.8B | 0.9B | 0.9B | ~2GB | ~5000 | Basic |
| Qwen3.5-4B | 5B | 5B | ~10GB | ~3000 | Good |
| Qwen3.5-9B | 10B | 10B | ~18GB | ~2000 | Great |
| Qwen3.5-35B-A3B (MoE)| 36B | 3B | ~35GB* | ~3500 | Great |
* All expert weights must be in VRAM even though only 3B are activeflowchart TD
A[Image + Text Input] --> B[Vision Encoder]
A --> C[Text Tokenizer]
B --> D[Visual Tokens]
C --> E[Text Tokens]
D --> F[MoE Transformer]
E --> F
F --> G{Router selects experts}
G --> H[Expert 1 of N]
G --> I[Expert 2 of N]
H --> J[Combine - only 3B compute]
I --> J
J --> K[Response]
L[35B total weights in VRAM] -.-> FCommon Issues
All 35B must fit in VRAM
# MoE doesn't load experts on demand β all weights stay in GPU memory
# FP16: ~35GB VRAM needed (fits on A100 40GB or L40S 48GB)
# GGUF Q4: ~18GB (fits on RTX 4090, A10G, L4)Model slower than expected for 3B active
# MoE routing adds overhead vs a pure 3B dense model
# But quality is much higher β it's the quality of 35B, speed closer to 3B
# Still faster than a 35B dense model by ~10xBest Practices
- Single GPU is enough β despite 35B total, fits on A100 40GB at FP16
- GGUF Q4 for smaller GPUs β unsloth quantization fits on 24GB cards
- Use for multimodal tasks β the MoE architecture especially benefits vision+text
- Higher concurrency than 9B dense β 3B active means less compute per request
- 1.46M+ downloads, 1K+ likes β proven community adoption
Key Takeaways
- 35B total, 3B active β MoE gives 35B quality at near-3B inference cost
- Vision + text multimodal model in one deployment
- Fits on single A100 40GB at FP16 or RTX 4090 with GGUF Q4
- ~3500 tokens/sec β faster than 9B dense despite higher quality
- Available in GGUF format (unsloth) for llama.cpp deployments

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
