πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

AI Inference Optimization Kubernetes

Optimize AI inference performance on Kubernetes. Request batching, KV cache tuning, speculative decoding, continuous batching.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Enable continuous batching in vLLM/TRT-LLM to process multiple requests simultaneously (3-5x throughput vs naive batching). Tune KV cache to use 90% of remaining GPU memory. Enable speculative decoding for 1.5-2x speedup on supported models. Use FP8 KV cache for 50% memory savings.

The Problem

Default LLM inference configurations waste 50-70% of GPU capacity. Naive sequential processing serves one request at a time, KV cache is either too small (high latency) or too large (wastes memory), and quantization isn’t applied. Optimizing these parameters can 3-5x throughput without additional hardware.

The Solution

Continuous Batching (vLLM)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-optimized
spec:
  template:
    spec:
      containers:
        - name: vllm
          image: registry.example.com/vllm:0.6.0
          args:
            # Continuous batching config
            - --model=/models/llama-3-70b
            - --tensor-parallel-size=2
            - --max-num-batched-tokens=8192
            - --max-num-seqs=64
            - --enable-chunked-prefill
            # KV cache optimization
            - --gpu-memory-utilization=0.92
            - --kv-cache-dtype=fp8_e5m2
            - --enable-prefix-caching
            # Speculative decoding
            - --speculative-model=meta-llama/Llama-3-8B
            - --num-speculative-tokens=5
            # Performance
            - --disable-log-requests
            - --enforce-eager=false
          resources:
            limits:
              nvidia.com/gpu: 2
              memory: 128Gi

Optimization Parameters Explained

ParameterDefaultOptimizedImpact
max-num-batched-tokens204881923x throughput
max-num-seqs1664More concurrent requests
gpu-memory-utilization0.800.92Larger KV cache
kv-cache-dtypeauto (fp16)fp8_e5m250% KV memory savings
enable-prefix-cachingfalsetrueCache repeated prompts
enable-chunked-prefillfalsetrueBetter batching

TensorRT-LLM Optimization

apiVersion: apps/v1
kind: Deployment
metadata:
  name: trtllm-server
spec:
  template:
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
          env:
            - name: DECOUPLED_MODE
              value: "true"
            - name: BATCH_SCHEDULER_POLICY
              value: "guaranteed_no_evict"
            - name: MAX_BATCH_SIZE
              value: "64"
            - name: ENABLE_KV_CACHE_REUSE
              value: "true"
            - name: KV_CACHE_FREE_GPU_MEM_FRACTION
              value: "0.9"

Benchmarking Configuration

# Benchmark with realistic workload
python benchmark_serving.py \
  --model meta-llama/Llama-3-70B \
  --num-prompts 1000 \
  --request-rate 10 \
  --input-len 512 \
  --output-len 256 \
  --endpoint http://vllm-svc:8000/v1/completions

# Key metrics to track:
# - Tokens/second (throughput)
# - Time to First Token (TTFT)
# - Inter-Token Latency (ITL)
# - Request latency p50/p95/p99
graph TD
    subgraph Default Config
        D_IN[10 requests] -->|Sequential| D_GPU[GPU 30% utilized]
        D_GPU --> D_OUT[10 tok/s throughput]
    end
    
    subgraph Optimized Config
        O_IN[64 requests] -->|Continuous batching| O_GPU[GPU 90% utilized]
        O_GPU --> O_OUT[50 tok/s throughput]
        O_KV[FP8 KV Cache<br/>50% less memory] --> O_GPU
        O_SPEC[Speculative Decode<br/>1.5x speedup] --> O_GPU
        O_PREFIX[Prefix Caching<br/>Skip repeated prompts] --> O_GPU
    end

Common Issues

OOM with high gpu-memory-utilization

Start at 0.85 and increase gradually. Monitor with nvidia-smi. Leave headroom for activation memory during long sequences.

Speculative decoding slower than expected

Draft model too large or too different from target. Use a model from the same family (Llama 8B for Llama 70B). Reduce num-speculative-tokens from 5 to 3 if acceptance rate is low.

Best Practices

  • Always enable continuous batching β€” 3-5x throughput improvement over naive serving
  • FP8 KV cache β€” 50% memory savings, minimal quality impact on most models
  • Prefix caching for repeated prompts β€” system prompts, few-shot examples
  • Benchmark before deploying β€” measure TTFT, ITL, and throughput under realistic load
  • Speculative decoding for latency-sensitive use cases β€” 1.5-2x speedup

Key Takeaways

  • Continuous batching processes multiple requests simultaneously β€” 3-5x throughput
  • FP8 KV cache halves memory usage with negligible quality impact
  • Speculative decoding uses a small draft model to speed up generation by 1.5-2x
  • Prefix caching avoids recomputing shared prompt prefixes across requests
  • Default configs waste 50-70% of GPU capacity β€” always tune before production
#inference #optimization #batching #kv-cache #speculative-decoding
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens