AIPerf Trace Replay Benchmarks on K8s
Replay production traffic traces with AIPerf on Kubernetes. Use moon_cake format, ShareGPT datasets, and fixed schedules for realistic LLM benchmarks.
π‘ Quick Answer: Use
aiperf profile --input-file payload:trace.jsonlto replay production traffic patterns with exact timestamps. For quick realistic benchmarks, use--input-file sharegptto load ShareGPT conversation data with real input/output length distributions.
The Problem
Synthetic benchmarks with uniform input/output lengths donβt represent real production traffic:
- Input lengths vary wildly β from 10 tokens (βsummarize thisβ) to 8000+ (RAG context)
- Output lengths are unpredictable β short answers vs long code generation
- Traffic is bursty β spikes after product launches, quiet at night
- Multi-turn conversations have session state and KV cache implications
Trace replay benchmarks use actual traffic patterns to answer: βCan my deployment handle yesterdayβs peak?β
The Solution
Step 1: Replay Production Traces (Moon Cake Format)
apiVersion: v1
kind: ConfigMap
metadata:
name: production-trace
namespace: ai-inference
data:
trace.jsonl: |
{"timestamp": 0, "input_length": 6955, "output_length": 52}
{"timestamp": 1053, "input_length": 6472, "output_length": 26}
{"timestamp": 2748, "input_length": 1024, "output_length": 256}
{"timestamp": 3500, "input_length": 512, "output_length": 128}
{"timestamp": 3520, "input_length": 4096, "output_length": 64}
{"timestamp": 3580, "input_length": 2048, "output_length": 512}
{"timestamp": 5000, "input_length": 256, "output_length": 1024}
{"timestamp": 8200, "input_length": 8192, "output_length": 32}
{"timestamp": 10500, "input_length": 550, "output_length": 256}
{"timestamp": 12000, "input_length": 1500, "output_length": 100}
---
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-trace-replay
namespace: ai-inference
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
pip install aiperf
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server.ai-inference:8000 \
--tokenizer meta-llama/Llama-3-8B-Instruct \
--input-file payload:/trace/trace.jsonl \
--ui simple \
--artifact-dir /results/trace-replay
volumeMounts:
- name: trace
mountPath: /trace
- name: results
mountPath: /results
volumes:
- name: trace
configMap:
name: production-trace
- name: results
persistentVolumeClaim:
claimName: benchmark-resultsStep 2: ShareGPT Dataset Benchmarks
# Use ShareGPT for realistic conversation distributions
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--input-file sharegpt \
--concurrency 16 \
--request-count 200 \
--ui simple
# ShareGPT provides real conversation data with:
# - Varied input lengths (short questions to long contexts)
# - Realistic output length distributions
# - Multi-turn conversation structuresStep 3: Custom Prompt Benchmarks
# Send exact prompts from a file (no tokenization/synthesis)
cat > /tmp/prompts.jsonl << 'EOF'
{"text": "Explain the difference between a Pod and a Deployment in Kubernetes"}
{"text": "Write a Python script that monitors Kubernetes pod health and sends Slack alerts when pods crash"}
{"text": "What is the best practice for managing secrets in Kubernetes? Include examples with External Secrets Operator"}
EOF
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--input-file /tmp/prompts.jsonl \
--concurrency 4 \
--ui simpleStep 4: Multi-Turn Session Benchmarks
# Simulate multi-turn conversations
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--num-sessions 10 \
--session-concurrency 5 \
--session-turns-mean 4 \
--session-turns-stddev 2 \
--session-turn-delay-mean 2000 \
--session-turn-delay-stddev 500 \
--ui simple
# This simulates 10 users having multi-turn conversations
# Average 4 turns per session with 2s think time between turns
# Tests KV cache efficiency and session managementStep 5: Mixed Input/Output Length Distributions
# Simulate bimodal workload (short Q&A + long generation)
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--synthetic-input-tokens-mean 550 \
--synthetic-input-tokens-stddev 200 \
--output-tokens-mean 256 \
--output-tokens-stddev 128 \
--concurrency 16 \
--request-count 500 \
--random-seed 42Step 6: Capture and Replay Your Own Traffic
# traffic_collector.py β capture real requests for replay
import json
import time
from datetime import datetime
class TrafficCollector:
"""Capture inference requests in moon_cake format."""
def __init__(self, output_file="traffic.jsonl"):
self.output_file = output_file
self.start_time = time.time()
def log_request(self, input_tokens, output_tokens):
timestamp_ms = int((time.time() - self.start_time) * 1000)
entry = {
"timestamp": timestamp_ms,
"input_length": input_tokens,
"output_length": output_tokens,
}
with open(self.output_file, "a") as f:
f.write(json.dumps(entry) + "\n")# Mount collected traces into AIPerf job
apiVersion: batch/v1
kind: Job
metadata:
name: replay-yesterday
namespace: ai-inference
spec:
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
pip install aiperf
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--input-file payload:/traces/yesterday.jsonl \
--session-delay-ratio 0.5 \
--ui simple
volumeMounts:
- name: traces
mountPath: /traces
volumes:
- name: traces
persistentVolumeClaim:
claimName: traffic-tracesflowchart TD
A[Traffic Sources] --> B{Trace Format}
B -->|Production logs| C[Moon Cake JSONL]
B -->|Open dataset| D[ShareGPT]
B -->|Custom prompts| E[Prompt JSONL]
C --> F[AIPerf Trace Replay]
D --> F
E --> F
F --> G[Exact timestamp replay]
F --> H[Configurable delay ratio]
F --> I[Multi-turn sessions]
G --> J[Results match production patterns]
H --> K[Speed up or slow down replay]
I --> L[KV cache efficiency testing]Common Issues
Trace timestamps too spread out
# Speed up replay with delay ratio
# 0.5 = replay at 2x speed, 0.1 = 10x speed
aiperf profile \
--input-file payload:trace.jsonl \
--session-delay-ratio 0.1ShareGPT outputs longer than model supports
# Cap output length
--output-tokens-mean 512 \
--extra-inputs max_tokens:512Multi-turn sessions overwhelm server
# Reduce session concurrency
--session-concurrency 2 \
--num-sessions 5
# Or increase turn delay
--session-turn-delay-mean 5000Best Practices
- Capture production traffic in moon_cake format β synthetic benchmarks miss real-world distributions
- Use
--session-delay-ratioto speed up or slow down replays without altering relative timing - ShareGPT for quick realistic tests β better than uniform synthetic data with zero setup
- Multi-turn sessions test KV cache reuse β critical for chat applications
- Store traces on PVCs β build a library of traffic patterns (peak, normal, bursty) for regression testing
- Compare trace replay results before and after changes (model swap, quantization, backend migration)
Key Takeaways
- Trace replay uses real traffic patterns with exact timestamps for production-realistic benchmarks
- Moon cake format captures timestamp, input_length, output_length per request as JSONL
- ShareGPT provides instant access to realistic conversation data with varied distributions
- Multi-turn sessions test KV cache efficiency with configurable think time between turns
- Delay ratio lets you speed up or slow down replay without altering relative request patterns
- Combine trace replay with concurrency sweeps to find how much headroom your deployment has

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
