Deploy Kimi K2.5 1.1T MoE on Kubernetes
Deploy Moonshot AI Kimi-K2.5 1.1T MoE multimodal model on Kubernetes. The largest open MoE model with 2.69M downloads for frontier AI tasks.
π‘ Quick Answer: Kimi-K2.5 (1.1 trillion parameters) is the largest open MoE model on HuggingFace with 2.69M downloads. Requires multi-node deployment: 16x H100 80GB with FP8, or 2 nodes of 8x H100 each. Only deploy this if smaller models genuinely donβt meet your quality bar.
The Problem
When frontier-quality output is non-negotiable:
- Smaller models plateau on complex reasoning, multi-image analysis, and deep domain expertise
- API-only models (GPT-4, Claude) arenβt an option due to data sovereignty or cost at scale
- You need the best open model β and Kimi-K2.5 at 1.1T parameters with 2.69M downloads represents the frontier
The Solution
Deploy Kimi-K2.5 with Multi-Node vLLM
# Node 1: Head node with 8x H100
apiVersion: apps/v1
kind: Deployment
metadata:
name: kimi-k25
namespace: ai-inference
labels:
app: kimi-k25
spec:
replicas: 1
selector:
matchLabels:
app: kimi-k25
template:
metadata:
labels:
app: kimi-k25
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "moonshotai/Kimi-K2.5"
- "--tensor-parallel-size"
- "8"
- "--pipeline-parallel-size"
- "2"
- "--quantization"
- "fp8"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.95"
- "--max-num-seqs"
- "4"
- "--enable-chunked-prefill"
- "--trust-remote-code"
- "--port"
- "8000"
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: NCCL_DEBUG
value: "WARN"
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_NET_GDR_LEVEL
value: "5"
resources:
limits:
nvidia.com/gpu: "8"
memory: 512Gi
cpu: "128"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 1200
periodSeconds: 60
failureThreshold: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: kimi-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 128Gi
nodeSelector:
nvidia.com/gpu.product: "H100-SXM"
terminationGracePeriodSeconds: 600Ultra-Large Model Comparison
| Model | Total | Active | GPUs Needed | Downloads | Type |
|--------------------|--------|---------|-----------------|-----------|----------|
| Llama 3.1 8B | 8B | 8B | 1x A100 40GB | 7.35M | Dense |
| Qwen3.5-35B-A3B | 36B | 3B | 1x A100 40GB | 1.46M | MoE VLM |
| Qwen3-235B-A22B | 235B | 22B | 4x A100 80GB | 1.66M | MoE |
| MiniMax M2.5 | 229B | 229B | 4x A100 80GB | 485K | Dense |
| Qwen3.5-397B-A17B | 403B | 17B | 8x A100 80GB | 1.66M | MoE VLM |
| GLM-5 | 754B | 754B | 8x H100 | 251K | Dense |
| Kimi-K2.5 | 1.1T | MoE | 16x H100 | 2.69M | MoE VLM |flowchart TD
A[Model Selection] --> B{Quality Requirements}
B -->|Good enough| C[8B-35B models]
B -->|High quality| D[200B-400B models]
B -->|Frontier only| E[700B+ models]
C --> F[1 GPU - $1-2/hr]
D --> G[4-8 GPUs - $10-20/hr]
E --> H[8-16 GPUs - $30-80/hr]
F --> I[Llama 3.1 8B or Qwen3.5-35B MoE]
G --> J[Qwen3-235B or MiniMax M2.5]
H --> K[GLM-5 or Kimi-K2.5]
style K fill:#f96Common Issues
Model too large for single node
# 1.1T even with FP8 needs ~1.1TB VRAM
# Single 8x H100 node = 640GB β not enough
# Pipeline parallelism across 2 nodes:
--tensor-parallel-size 8 --pipeline-parallel-size 2
# Requires InfiniBand between nodesCost justification
# 16x H100 at ~$5/hr each = $80/hr
# Only justified when:
# 1. Smaller models measurably fail on your task
# 2. Volume justifies self-hosting vs API
# 3. Data sovereignty requires on-prem
# Start with Qwen3-235B or MiniMax M2.5 firstBest Practices
- Start smaller β benchmark Qwen3-235B or MiniMax M2.5 first
- Multi-node with InfiniBand β pipeline parallelism needs fast inter-node links
- FP8 quantization β mandatory to fit on 16 GPUs instead of 32
- Very low concurrency β
--max-num-seqs 2-4for 1.1T model - Pre-stage model weights β 2TB+ download, pre-pull to NVMe before deployment
Key Takeaways
- Kimi-K2.5 at 1.1 trillion parameters is the largest open MoE model
- 2.69M downloads β the most downloaded ultra-large model on HuggingFace
- Requires 16x H100 80GB minimum with FP8 and multi-node pipeline parallelism
- Multimodal β processes images and text
- Cost: ~$80/hr on cloud β only for frontier tasks where smaller models fail
- Start with smaller models and only scale up when quality justifies the cost

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
