AIPerf LLM Benchmarking on K8s
Benchmark generative AI inference on Kubernetes with NVIDIA AIPerf. Measure TTFT, ITL, throughput, and latency across vLLM, NIM.
π‘ Quick Answer: AIPerf (
aiperf profile) is NVIDIAβs comprehensive LLM benchmarking tool that measures TTFT, ITL, output throughput, and request latency against any OpenAI-compatible endpoint. Deploy it as a Kubernetes Job targeting your inference service, with configurable concurrency, request rates, arrival patterns, and dataset workloads.
The Problem
Before deploying LLM inference to production, you need answers to:
- Whatβs the Time to First Token (TTFT) under load?
- How does Inter-Token Latency (ITL) degrade at concurrency 50 vs 200?
- Whatβs the maximum throughput (tokens/sec) before SLA violations?
- How does your inference engine (vLLM, NIM, TGI) perform with realistic traffic patterns?
- Is your GPU utilization optimal or are you over-provisioned?
Generic HTTP benchmarking tools (wrk, hey) donβt understand streaming tokens, canβt measure TTFT/ITL, and donβt generate realistic LLM workloads.
The Solution
Install AIPerf
# In a Python virtual environment
pip install aiperf
# Or use the container image
# nvcr.io/nvidia/aiperf:0.7.0Quick Benchmark Against a K8s Inference Service
# Profile a vLLM deployment exposed via Service
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service.ai-inference:8000 \
--concurrency 10 \
--request-count 100Kubernetes Benchmark Job
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: ai-inference
spec:
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.12-slim
command:
- bash
- -c
- |
pip install aiperf -q
# Warmup run
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--concurrency 1 \
--request-count 5 \
--ui none
# Actual benchmark
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--concurrency 50 \
--request-count 500 \
--ui none
echo "Results:"
cat artifacts/*/profile_export_aiperf.json
volumeMounts:
- name: results
mountPath: /artifacts
resources:
requests:
cpu: "4"
memory: 4Gi
volumes:
- name: results
emptyDir: {}Concurrency Sweep Job
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-sweep
namespace: ai-inference
spec:
template:
spec:
restartPolicy: Never
containers:
- name: aiperf
image: python:3.12-slim
command:
- bash
- -c
- |
pip install aiperf -q
MODEL="meta-llama/Llama-3.1-8B-Instruct"
URL="http://vllm-service:8000"
for CONC in 1 5 10 25 50 100; do
echo "=== Concurrency: $CONC ==="
aiperf profile \
--model "$MODEL" \
--streaming \
--endpoint-type chat \
--tokenizer "$MODEL" \
--url "$URL" \
--concurrency $CONC \
--request-count 200 \
--ui none
done
resources:
requests:
cpu: "4"
memory: 4GiBenchmark NIM Deployment
# Profile NVIDIA NIM
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://nim-service.ai-inference:8000 \
--concurrency 20 \
--request-count 300Request Rate Control
# Fixed request rate (requests per second)
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--request-rate 10.0 \
--request-count 500
# Request rate with max concurrency cap
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--request-rate 10.0 \
--concurrency 50 \
--request-count 500Arrival Patterns
# Poisson arrivals (realistic)
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--request-rate 10.0 \
--arrival-pattern poisson \
--request-count 500
# Gradual ramp-up
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--concurrency 100 \
--ramp-up-duration 60 \
--request-count 1000Custom Dataset / ShareGPT
# Use ShareGPT dataset for realistic prompts
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--dataset sharegpt \
--concurrency 20 \
--request-count 500
# Synthetic dataset with controlled ISL/OSL
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--input-tokens-mean 512 \
--output-tokens-mean 256 \
--concurrency 20 \
--request-count 500Benchmark Embeddings / Rankings
# Embedding model
aiperf profile \
--model "nvidia/nv-embedqa-e5-v5" \
--endpoint-type embeddings \
--url http://embedding-service:8000 \
--concurrency 50 \
--request-count 1000
# Ranking model
aiperf profile \
--model "nvidia/nv-rerankqa-mistral-4b-v3" \
--endpoint-type rankings \
--url http://ranking-service:8000 \
--concurrency 20 \
--request-count 500Multi-URL Load Balancing
# Distribute across multiple inference replicas
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-0.vllm-headless:8000 \
--url http://vllm-1.vllm-headless:8000 \
--url http://vllm-2.vllm-headless:8000 \
--concurrency 60 \
--request-count 1000Goodput (SLO-Based Throughput)
# Measure requests meeting SLA targets
aiperf profile \
--model "meta-llama/Llama-3.1-8B-Instruct" \
--streaming \
--endpoint-type chat \
--tokenizer meta-llama/Llama-3.1-8B-Instruct \
--url http://vllm-service:8000 \
--concurrency 50 \
--request-count 500 \
--goodput ttft:500 itl:100
# Only count requests with TTFT < 500ms AND ITL < 100msUI Modes
# Real-time TUI dashboard (interactive)
aiperf profile ... --ui dashboard
# Simple progress bars (CI-friendly)
aiperf profile ... --ui simple
# Headless (no output, just results files)
aiperf profile ... --ui nonegraph TD
subgraph AIPerf Architecture
CLI[aiperf profile] --> SM[Session Manager]
SM --> RG[Request Generator]
SM --> DP[Data Plane]
SM --> MP[Metrics Plane]
RG -->|Concurrency / Rate| DP
DP -->|HTTP/SSE| INF[Inference Service]
DP -->|Token Events| MP
MP --> DASH[Dashboard UI]
MP --> CSV[CSV/JSON Export]
end
subgraph Kubernetes
INF --> VLLM[vLLM Pod]
INF --> NIM[NIM Pod]
INF --> TGI[TGI Pod]
end
subgraph Metrics
TTFT[Time to First Token]
ITL[Inter-Token Latency]
OTT[Output Throughput]
RL[Request Latency]
GP[Goodput SLO]
endKey Metrics Explained
| Metric | Description | What Good Looks Like |
|---|---|---|
| TTFT (Time to First Token) | Latency until first generated token | < 200ms at target concurrency |
| TTST (Time to Second Token) | Latency from first to second token | Close to ITL (no startup spike) |
| ITL (Inter-Token Latency) | Average time between consecutive tokens | < 50ms for interactive use |
| Output Throughput (tokens/sec) | Total tokens generated per second | Model/GPU dependent |
| Per-User Throughput (tok/sec/user) | Throughput experienced per user | Decreases with concurrency |
| Request Latency | End-to-end time per request | TTFT + (output_tokens Γ ITL) |
| Goodput | Requests meeting SLA thresholds | > 95% of requests within SLA |
Supported Endpoint Types
| Type | Flag | APIs |
|---|---|---|
| Chat completions | --endpoint-type chat | OpenAI /v1/chat/completions |
| Text completions | --endpoint-type completions | OpenAI /v1/completions |
| Embeddings | --endpoint-type embeddings | OpenAI + NIM embeddings |
| Rankings | --endpoint-type rankings | NIM ranking/reranking |
| Audio | --endpoint-type audio | OpenAI audio |
| Vision | --endpoint-type vision | Vision LLMs (LLaVA, etc.) |
| Image generation | --endpoint-type image | OpenAI images |
Common Issues
TTFT spikes at concurrency > 1
First request triggers model loading or KV cache warmup. Use --warmup-requests or run a warmup phase first:
# 5 warmup requests before measurement
aiperf profile ... --warmup-requests 5Token counts donβt match expected output
AIPerf needs the correct tokenizer to count tokens. Always specify --tokenizer:
--tokenizer meta-llama/Llama-3.1-8B-Instruct
# Or local path
--tokenizer /models/tokenizerConnection refused to inference service
From the AIPerf pod, verify the service is reachable:
kubectl exec -it aiperf-pod -- curl -s http://vllm-service:8000/v1/modelsOutput tokens truncated β OSL lower than expected
Inference servers may apply max_tokens defaults. Use --extra-inputs to control:
aiperf profile ... --extra-inputs max_tokens:512Very high concurrency causes port exhaustion
System limit on ephemeral ports (typically 28K). For concurrency > 15K, increase system limits:
sysctl -w net.ipv4.ip_local_port_range="1024 65535"Dashboard mode not rendering in pod
Use --ui none or --ui simple for non-interactive environments (Jobs, CI pipelines).
Best Practices
- Warmup before measuring β run 5-10 warmup requests to fill KV caches and JIT-compile kernels
- Use realistic workloads β ShareGPT or custom datasets over synthetic random tokens
- Sweep concurrency β test 1, 5, 10, 25, 50, 100 to find the throughput-latency curve
- Set SLOs with goodput β
--goodput ttft:200 itl:50measures real production fitness - Use Poisson arrivals β
--arrival-pattern poissonmodels real traffic better than constant rate - Pin tokenizer β always specify
--tokenizerto get accurate token counts - Compare engines fairly β same model, same dataset, same concurrency, same hardware
- Export results β JSON/CSV artifacts in
artifacts/directory for post-analysis - Run from within the cluster β deploy AIPerf as a Job to avoid network latency from external clients
- Multi-URL for distributed β pass multiple
--urlflags to benchmark across inference replicas - Combine with GPU telemetry β use DCGM metrics to correlate throughput with GPU utilization
Key Takeaways
- AIPerf replaces generic HTTP benchmark tools with LLM-aware metrics (TTFT, ITL, per-user throughput)
- Supports all major inference APIs: OpenAI chat/completions, embeddings, rankings, vision, audio
- Scalable multiprocess architecture with 9 ZMQ-connected services
- Three benchmark modes: concurrency-based, request-rate, and trace replay
- Arrival patterns: constant, Poisson, gamma β model realistic traffic
- Goodput measures SLO compliance (% of requests meeting latency targets)
- Extensive dataset support: ShareGPT, AIMO, MMStar, synthetic, custom, and multi-turn
- Plugin system for custom endpoints, datasets, transports, and metrics
- Export to CSV/JSON + visualization/plotting for multi-run comparison
- Deploy as K8s Job for in-cluster benchmarking β
--ui nonefor headless mode - Always benchmark before production: find the concurrency cliff where latency degrades

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
