AIPerf Benchmark LLMs on Kubernetes
Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.
π‘ Quick Answer: Run
aiperf profile --model <model> --streaming --endpoint-type chat --url http://<server>:8000to benchmark any OpenAI-compatible LLM endpoint. AIPerf is NVIDIAβs next-gen benchmarking tool replacing GenAI-Perf, with a real-time TUI dashboard, plugin architecture, and multiprocess scalability.
The Problem
Benchmarking LLM inference in production Kubernetes clusters requires:
- Realistic load generation β concurrency sweeps, request rate control, trace replay
- Real-time visibility β watching metrics live during the benchmark, not just at the end
- GPU telemetry β correlating inference latency with GPU utilization and memory pressure
- Reproducibility β deterministic datasets and configurable random seeds
- Flexibility β testing across Triton, vLLM, TGI, Ollama, and OpenAI-compatible endpoints
GenAI-Perf handled some of these but is now deprecated. AIPerf is its successor β built on a scalable multiprocess architecture with 9 ZMQ-connected services, extensible plugins, and three UI modes.
The Solution
Step 1: Deploy AIPerf as a Kubernetes Job
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: ai-inference
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
pip install aiperf
# Benchmark vLLM deployment
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server.ai-inference:8000 \
--concurrency 16 \
--request-count 200 \
--tokenizer meta-llama/Llama-3-8B-Instruct \
--ui simple \
--artifact-dir /results/llama3-c16
echo "=== Benchmark Complete ==="
cat /results/llama3-c16/*_aiperf.csv
resources:
limits:
cpu: "4"
memory: 8Gi
volumeMounts:
- name: results
mountPath: /results
volumes:
- name: results
persistentVolumeClaim:
claimName: benchmark-resultsStep 2: Quick Benchmarks from a Debug Pod
# Install AIPerf
pip install aiperf
# Benchmark OpenAI-compatible endpoint (vLLM, TGI, etc.)
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--concurrency 10 \
--request-count 100
# Benchmark Triton with TensorRT-LLM backend
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://triton-server:8000 \
--concurrency 32
# Benchmark with dashboard UI (requires TTY)
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--ui dashboardStep 3: Understanding the Output
NVIDIA AIPerf | LLM Metrics
ββββββββββββββββββββββββββββββββββββββββ³βββββββββ³βββββββββ³βββββββββ³βββββββββ
β Metric β avg β min β p99 β p50 β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β Time to First Token (ms) β 45.20 β 32.10 β 98.50 β 42.30 β
β Time to Second Token (ms) β 12.50 β 8.20 β 28.90 β 11.80 β
β Inter Token Latency (ms) β 11.30 β 8.50 β 25.60 β 10.90 β
β Request Latency (ms) β 892.40 β 456.20 β1845.30 β 812.50 β
β Output Token Throughput (tokens/sec) β 1420.5 β N/A β N/A β N/A β
β Request Throughput (requests/sec) β 11.2 β N/A β N/A β N/A β
ββββββββββββββββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββKey metrics:
- TTFT β time to first token, determines perceived responsiveness
- TTST β time to second token (new in AIPerf), captures KV cache allocation overhead
- ITL β inter-token latency, affects streaming UX quality
- Output token throughput β total tokens/sec across all concurrent requests
Step 4: GPU Telemetry with DCGM
# Collect GPU metrics during benchmark
aiperf profile \
--model llama3-8b \
--streaming \
--endpoint-type chat \
--url http://vllm-server:8000 \
--concurrency 32 \
--server-metrics-urls http://dcgm-exporter.gpu-operator:9400/metrics \
--verbose
# GPU metrics collected:
# - GPU utilization, SM utilization
# - Memory used/free, power usage
# - PCIe throughput, NVLink errors
# - Temperature, clock speedsflowchart TD
A[AIPerf Pod] -->|HTTP/gRPC| B[Inference Server]
A -->|Prometheus scrape| C[DCGM Exporter]
B --> D[GPU Workers]
C --> D
A --> E[Results]
E --> F[Console Table]
E --> G[CSV Export]
E --> H[JSON Export]
E --> I[PNG Plots]
subgraph AIPerf Architecture
J[Request Generator] -->|ZMQ| K[Transport Layer]
K -->|ZMQ| L[Response Collector]
L -->|ZMQ| M[Metrics Aggregator]
endCommon Issues
Dashboard UI not rendering in Kubernetes Job
# Jobs don't have a TTY β use simple or none UI mode
aiperf profile --ui simple # progress bars
aiperf profile --ui none # headless, logs only
# For interactive debugging, use kubectl exec with TTY
kubectl exec -it debug-pod -- aiperf profile --ui dashboardTokenizer download fails in air-gapped clusters
# Pre-download tokenizer and mount as volume
# Or use --tokenizer with a local path
aiperf profile \
--tokenizer /models/tokenizers/llama3 \
--model llama3-8bHigh concurrency causes port exhaustion
# AIPerf note: >15,000 concurrency may exhaust ports
# Adjust system limits if needed:
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Or reduce concurrency to realistic levelsBest Practices
- Use
--ui simplefor Kubernetes Jobs β dashboard requires TTY, simple mode shows progress bars - Set
--random-seedfor reproducible benchmarks across runs - Match
--tokenizerto your model β token counts affect all per-token metrics - Start with
--warmup-request-count 10to eliminate cold-start effects - Export results to PVC β use
--artifact-dirmounted to a PersistentVolumeClaim for post-analysis - Migrate from GenAI-Perf β AIPerf is the successor with the same CLI patterns plus new features
Key Takeaways
- AIPerf replaces GenAI-Perf as NVIDIAβs official LLM benchmarking tool
- Built on a 9-service multiprocess architecture with ZMQ for scalability
- Three UI modes: dashboard (real-time TUI), simple (progress bars), none (headless)
- Measures TTFT, TTST, ITL, output token throughput, and request throughput
- Works with any OpenAI-compatible endpoint β vLLM, Triton, TGI, Ollama, OpenAI
- Plugin system for custom endpoints, datasets, transports, and metrics
- Collects GPU telemetry from DCGM Exporter during benchmarks

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
