Deploy GLM-5 754B on Kubernetes
Deploy Zhipu AI GLM-5 754B model on Kubernetes with vLLM. One of the largest open-weight models with multi-node tensor parallelism across 8+ GPUs.
π‘ Quick Answer: Deploy GLM-5 (754B parameters) with vLLM using
--tensor-parallel-size 8on 8x H100 80GB GPUs. One of the largest open-weight models available β needs 1.5TB+ of VRAM in FP16 or 8x H100 with FP8 quantization. For most teams, FP8 on H100 is the practical deployment path.
The Problem
Ultra-large language models (700B+) push the boundaries of whatβs possible with open weights:
- Frontier reasoning β complex multi-step problems that smaller models struggle with
- Deep knowledge β broader coverage of specialized domains
- GPU requirements β 754B in FP16 needs ~1.5TB VRAM, far beyond a single node
- Inference optimization β tensor parallelism, quantization, and efficient KV cache management are critical
GLM-5 from Zhipu AI (251K+ downloads, 1.78K+ likes) is one of the largest open models on HuggingFace.
The Solution
Step 1: Deploy GLM-5 with FP8 on 8x H100
apiVersion: apps/v1
kind: Deployment
metadata:
name: glm5-754b
namespace: ai-inference
labels:
app: glm5-754b
spec:
replicas: 1
selector:
matchLabels:
app: glm5-754b
template:
metadata:
labels:
app: glm5-754b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "zai-org/GLM-5"
- "--tensor-parallel-size"
- "8"
- "--quantization"
- "fp8"
- "--max-model-len"
- "16384"
- "--gpu-memory-utilization"
- "0.92"
- "--max-num-seqs"
- "8"
- "--enable-chunked-prefill"
- "--trust-remote-code"
- "--dtype"
- "bfloat16"
- "--port"
- "8000"
ports:
- containerPort: 8000
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
- name: NCCL_DEBUG
value: "WARN"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: "spawn"
resources:
limits:
nvidia.com/gpu: "8"
memory: 256Gi
cpu: "64"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 600
periodSeconds: 60
failureThreshold: 30
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: glm5-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: 64Gi
nodeSelector:
nvidia.com/gpu.product: "H100-SXM"
terminationGracePeriodSeconds: 300
---
apiVersion: v1
kind: Service
metadata:
name: glm5-754b
namespace: ai-inference
spec:
selector:
app: glm5-754b
ports:
- port: 8000
targetPort: 8000GPU Requirements
| Precision | Total VRAM | Configuration | Context |
|-----------|-------------|---------------------------|----------|
| FP16 | ~1.5TB | 16x A100 80GB or 20x 80GB | 8K |
| FP8 | ~750GB | 8x H100 80GB | 16K |
| INT4 AWQ | ~375GB | 4x H100 80GB or 8x A100 | 16K |flowchart TD
A[GLM-5 754B] --> B{Quantization}
B -->|FP16 ~1.5TB| C[16x A100 80GB]
B -->|FP8 ~750GB| D[8x H100 80GB]
B -->|INT4 ~375GB| E[4x H100 80GB]
C --> F[Max quality - research]
D --> G[Best balance - production]
E --> H[Most cost-effective]
subgraph NVLink or NVSwitch Required
C
D
E
endCommon Issues
Model loading takes 30+ minutes
# 754B at FP8 is ~750GB of weights
# NVMe-backed PVC is essential
# Pre-download weights as an init container or CronJob
startupProbe:
initialDelaySeconds: 600 # 10 minutes
periodSeconds: 60
failureThreshold: 30 # total 40 minutesNCCL timeout with 8 GPUs
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
- name: NCCL_IB_DISABLE
value: "0" # Enable InfiniBand if available
- name: NCCL_NET_GDR_LEVEL
value: "5" # GPUDirect RDMA
- name: NCCL_TIMEOUT
value: "1800" # 30 min timeout for large modelsBest Practices
- 8x H100 with FP8 β the practical deployment path for 754B
- NVLink/NVSwitch mandatory β PCIe interconnect is too slow for 8-GPU TP
- NVMe PVC β network storage is impractical for 750GB+ model weights
- Low concurrency β
--max-num-seqs 4-8to avoid OOM - 64Gi
/dev/shmβ NCCL needs large shared memory for 8-GPU communication
Key Takeaways
- GLM-5 is 754B parameters β one of the largest open-weight models available
- Minimum 8x H100 80GB with FP8 quantization for practical deployment
- 251K+ downloads β significant community adoption despite extreme hardware requirements
- Use for frontier-level reasoning tasks that smaller models canβt handle
- NVLink/NVSwitch is mandatory β PCIe bandwidth is insufficient for 8-way tensor parallelism

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
