Deploy MiniMax M2.5 229B on Kubernetes
Deploy MiniMax M2.5 229B model on Kubernetes with vLLM. High-performance LLM with 485K downloads, optimized for multi-turn conversation and long context.
π‘ Quick Answer: Deploy MiniMax M2.5 (229B parameters) with vLLM using
--tensor-parallel-size 4on 4x A100 80GB. A strong large-scale LLM with 485K+ downloads, optimized for long multi-turn conversations and complex reasoning tasks.
The Problem
Large-scale LLMs in the 200B+ range offer significantly better reasoning than smaller models, but:
- GPU cost β 229B in FP16 needs ~460GB VRAM
- Serving efficiency β must maximize throughput across multiple GPUs
- Long conversations β multi-turn context needs efficient KV cache management
- Competition β how does M2.5 compare to Qwen3-235B, Llama, and GPT-4?
MiniMax M2.5 (485K downloads, 1.17K likes) is a strong contender in the 200B+ open model space.
The Solution
Deploy MiniMax M2.5
apiVersion: apps/v1
kind: Deployment
metadata:
name: minimax-m25
namespace: ai-inference
labels:
app: minimax-m25
spec:
replicas: 1
selector:
matchLabels:
app: minimax-m25
template:
metadata:
labels:
app: minimax-m25
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "MiniMaxAI/MiniMax-M2.5"
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "32768"
- "--gpu-memory-utilization"
- "0.92"
- "--max-num-seqs"
- "16"
- "--enable-chunked-prefill"
- "--trust-remote-code"
- "--port"
- "8000"
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: NCCL_DEBUG
value: "WARN"
resources:
limits:
nvidia.com/gpu: "4"
memory: 192Gi
cpu: "32"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
failureThreshold: 30
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 15
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: minimax-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
terminationGracePeriodSeconds: 120
---
apiVersion: v1
kind: Service
metadata:
name: minimax-m25
namespace: ai-inference
spec:
selector:
app: minimax-m25
ports:
- port: 8000
targetPort: 8000FP8 on H100 (2 GPUs)
args:
- "--model"
- "MiniMaxAI/MiniMax-M2.5"
- "--tensor-parallel-size"
- "2"
- "--quantization"
- "fp8"
- "--max-model-len"
- "32768"
- "--trust-remote-code"
resources:
limits:
nvidia.com/gpu: "2"
nodeSelector:
nvidia.com/gpu.product: "H100-SXM"200B+ Model Landscape
| Model | Params | Type | GPUs (FP16) | Downloads | Focus |
|--------------------|--------|-------|-------------|-----------|-----------------|
| MiniMax M2.5 | 229B | Dense | 4x A100 80GB| 485K | Long context |
| Qwen3-235B-A22B | 235B | MoE | 4x A100 80GB| 1.66M | Thinking mode |
| GLM-5 | 754B | Dense | 8x H100 | 251K | Frontier |
| Kimi-K2.5 | 1.1T | MoE | 16x H100 | 2.69M | Ultra frontier |flowchart TD
A{Budget and Use Case} --> B[1-2 GPUs]
A --> C[4 GPUs]
A --> D[8+ GPUs]
B --> E[Qwen3-235B MoE FP8 on H100]
C --> F[MiniMax M2.5 Dense on A100]
C --> G[Qwen3-235B MoE on A100]
D --> H[GLM-5 754B]
E --> I[Best value - MoE efficiency]
F --> J[Best for long conversations]
G --> K[Best for reasoning with thinking]
H --> L[Maximum quality]Common Issues
Dense 229B vs MoE 235B
# MiniMax M2.5 is DENSE β all 229B parameters active per token
# Qwen3-235B-A22B is MoE β only 22B active
# Same GPU count, but:
# - MiniMax: slower inference, all parameters used β potentially deeper reasoning
# - Qwen3 MoE: faster inference, broader but sparser knowledgeMemory for long multi-turn conversations
# Long conversations accumulate KV cache
# With 229B dense model + 32K context, KV cache is substantial
--max-num-seqs 8 # Reduce concurrency for longer conversations
--max-model-len 16384 # Cap context if memory constrainedBest Practices
- 4x A100 80GB at FP16, or 2x H100 with FP8
- Dense advantage β all parameters active means deeper per-token reasoning
- NVMe PVC β ~460GB model weights, fast storage essential
- Generous shared memory β 32Gi
/dev/shmfor 4-GPU NCCL - Compare against Qwen3 MoE β benchmark both for your use case
Key Takeaways
- MiniMax M2.5 is a 229B dense model β all parameters active per token
- Requires 4x A100 80GB (FP16) or 2x H100 (FP8)
- 485K+ downloads β strong community adoption
- Best for long multi-turn conversations and complex reasoning
- Dense vs MoE trade-off: deeper reasoning per token vs faster throughput

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
