AI Inference Optimization Kubernetes
Optimize AI inference performance on Kubernetes. Request batching, KV cache tuning, speculative decoding, continuous batching.
π‘ Quick Answer: Enable continuous batching in vLLM/TRT-LLM to process multiple requests simultaneously (3-5x throughput vs naive batching). Tune KV cache to use 90% of remaining GPU memory. Enable speculative decoding for 1.5-2x speedup on supported models. Use FP8 KV cache for 50% memory savings.
The Problem
Default LLM inference configurations waste 50-70% of GPU capacity. Naive sequential processing serves one request at a time, KV cache is either too small (high latency) or too large (wastes memory), and quantization isnβt applied. Optimizing these parameters can 3-5x throughput without additional hardware.
The Solution
Continuous Batching (vLLM)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-optimized
spec:
template:
spec:
containers:
- name: vllm
image: registry.example.com/vllm:0.6.0
args:
# Continuous batching config
- --model=/models/llama-3-70b
- --tensor-parallel-size=2
- --max-num-batched-tokens=8192
- --max-num-seqs=64
- --enable-chunked-prefill
# KV cache optimization
- --gpu-memory-utilization=0.92
- --kv-cache-dtype=fp8_e5m2
- --enable-prefix-caching
# Speculative decoding
- --speculative-model=meta-llama/Llama-3-8B
- --num-speculative-tokens=5
# Performance
- --disable-log-requests
- --enforce-eager=false
resources:
limits:
nvidia.com/gpu: 2
memory: 128GiOptimization Parameters Explained
| Parameter | Default | Optimized | Impact |
|---|---|---|---|
max-num-batched-tokens | 2048 | 8192 | 3x throughput |
max-num-seqs | 16 | 64 | More concurrent requests |
gpu-memory-utilization | 0.80 | 0.92 | Larger KV cache |
kv-cache-dtype | auto (fp16) | fp8_e5m2 | 50% KV memory savings |
enable-prefix-caching | false | true | Cache repeated prompts |
enable-chunked-prefill | false | true | Better batching |
TensorRT-LLM Optimization
apiVersion: apps/v1
kind: Deployment
metadata:
name: trtllm-server
spec:
template:
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
env:
- name: DECOUPLED_MODE
value: "true"
- name: BATCH_SCHEDULER_POLICY
value: "guaranteed_no_evict"
- name: MAX_BATCH_SIZE
value: "64"
- name: ENABLE_KV_CACHE_REUSE
value: "true"
- name: KV_CACHE_FREE_GPU_MEM_FRACTION
value: "0.9"Benchmarking Configuration
# Benchmark with realistic workload
python benchmark_serving.py \
--model meta-llama/Llama-3-70B \
--num-prompts 1000 \
--request-rate 10 \
--input-len 512 \
--output-len 256 \
--endpoint http://vllm-svc:8000/v1/completions
# Key metrics to track:
# - Tokens/second (throughput)
# - Time to First Token (TTFT)
# - Inter-Token Latency (ITL)
# - Request latency p50/p95/p99graph TD
subgraph Default Config
D_IN[10 requests] -->|Sequential| D_GPU[GPU 30% utilized]
D_GPU --> D_OUT[10 tok/s throughput]
end
subgraph Optimized Config
O_IN[64 requests] -->|Continuous batching| O_GPU[GPU 90% utilized]
O_GPU --> O_OUT[50 tok/s throughput]
O_KV[FP8 KV Cache<br/>50% less memory] --> O_GPU
O_SPEC[Speculative Decode<br/>1.5x speedup] --> O_GPU
O_PREFIX[Prefix Caching<br/>Skip repeated prompts] --> O_GPU
endCommon Issues
OOM with high gpu-memory-utilization
Start at 0.85 and increase gradually. Monitor with nvidia-smi. Leave headroom for activation memory during long sequences.
Speculative decoding slower than expected
Draft model too large or too different from target. Use a model from the same family (Llama 8B for Llama 70B). Reduce num-speculative-tokens from 5 to 3 if acceptance rate is low.
Best Practices
- Always enable continuous batching β 3-5x throughput improvement over naive serving
- FP8 KV cache β 50% memory savings, minimal quality impact on most models
- Prefix caching for repeated prompts β system prompts, few-shot examples
- Benchmark before deploying β measure TTFT, ITL, and throughput under realistic load
- Speculative decoding for latency-sensitive use cases β 1.5-2x speedup
Key Takeaways
- Continuous batching processes multiple requests simultaneously β 3-5x throughput
- FP8 KV cache halves memory usage with negligible quality impact
- Speculative decoding uses a small draft model to speed up generation by 1.5-2x
- Prefix caching avoids recomputing shared prompt prefixes across requests
- Default configs waste 50-70% of GPU capacity β always tune before production

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
