GenAI-Perf Benchmark LLM Kubernetes
Benchmark LLM inference with GenAI-Perf on Kubernetes. Use --service-kind openai for vLLM, NIM, and TGI. Measure TTFT, ITL, and throughput.
π‘ Quick Answer: GenAI-Perf is NVIDIAβs tool for benchmarking LLM inference endpoints. Use
--service-kind openaito test any OpenAI-compatible API (vLLM, NIM, TGI, Ollama). Run:genai-perf profile --model llama3 --service-kind openai --endpoint-type chat --url http://llm-service:8000 --concurrency 10. It measures throughput (tokens/s), request latency, time-to-first-token (TTFT), inter-token latency (ITL), and output token throughput.
The Problem
Deploying LLMs on Kubernetes requires performance validation:
- Whatβs the max throughput at acceptable latency?
- How does concurrency affect time-to-first-token?
- Is the model GPU-bound or network-bound?
- How does batching perform under load?
- Does the endpoint handle sustained traffic without degradation?
GenAI-Perf provides standardized benchmarking for all OpenAI-compatible inference servers.
The Solution
Install GenAI-Perf
# Option 1: pip install
pip install genai-perf
# Option 2: NVIDIA Triton SDK container (includes genai-perf)
kubectl run genai-perf \
--image=nvcr.io/nvidia/tritonserver:24.07-py3-sdk \
--restart=Never \
-- sleep infinity
kubectl exec -it genai-perf -- bashBasic Benchmark with βservice-kind openai
# Benchmark a vLLM endpoint
genai-perf profile \
--model meta-llama/Llama-3.1-8B-Instruct \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service:8000 \
--num-prompts 100 \
--concurrency 10 \
--streaming
# Output:
# LLM Metrics
# ββββββββββββββββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββ
# β Metric β avg β p50 β p99 β
# ββββββββββββββββββββββββΌβββββββββββΌβββββββββββΌβββββββββββ€
# β Request latency (ms) β 1,245 β 1,180 β 2,890 β
# β TTFT (ms) β 45 β 38 β 142 β
# β ITL (ms) β 12 β 11 β 28 β
# β Output tokens/req β 156 β 148 β 312 β
# β Throughput (tok/s) β 1,250 β β β
# β Request throughput β 8.0/s β β β
# ββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββEndpoint Types
# Chat completions (OpenAI chat format)
genai-perf profile \
--service-kind openai \
--endpoint-type chat \
--model llama3 \
--url http://llm-service:8000
# Text completions (legacy /v1/completions)
genai-perf profile \
--service-kind openai \
--endpoint-type completions \
--model llama3 \
--url http://llm-service:8000
# Embeddings
genai-perf profile \
--service-kind openai \
--endpoint-type embeddings \
--model text-embedding-ada-002 \
--url http://embedding-service:8000Concurrency Sweep
# Test increasing concurrency to find saturation point
for c in 1 2 4 8 16 32 64; do
echo "=== Concurrency: $c ==="
genai-perf profile \
--model llama3 \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service:8000 \
--concurrency $c \
--num-prompts 50 \
--streaming \
2>&1 | grep -E "Throughput|TTFT|ITL|Request latency"
doneInput/Output Token Control
# Control prompt and output length
genai-perf profile \
--model llama3 \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service:8000 \
--concurrency 10 \
--num-prompts 100 \
--streaming \
--input-tokens-mean 512 \
--input-tokens-stddev 50 \
--output-tokens-mean 256 \
--output-tokens-stddev 25 \
--extra-inputs max_tokens:256Custom Prompts Dataset
# Use your own prompts
cat > prompts.jsonl << 'EOF'
{"text_input": "Explain Kubernetes pod scheduling in detail"}
{"text_input": "Write a Python function to parse YAML"}
{"text_input": "What are the best practices for container security?"}
EOF
genai-perf profile \
--model llama3 \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service:8000 \
--input-file prompts.jsonl \
--concurrency 10 \
--streamingKubernetes Job for Benchmarking
apiVersion: batch/v1
kind: Job
metadata:
name: llm-benchmark
namespace: ai-workloads
spec:
template:
spec:
containers:
- name: genai-perf
image: nvcr.io/nvidia/tritonserver:24.07-py3-sdk
command:
- genai-perf
- profile
- --model=meta-llama/Llama-3.1-8B-Instruct
- --service-kind=openai
- --endpoint-type=chat
- --url=http://vllm-service:8000
- --concurrency=10
- --num-prompts=200
- --streaming
resources:
requests:
cpu: "2"
memory: 4Gi
restartPolicy: Never
backoffLimit: 0Key Metrics Explained
| Metric | What It Measures | Good Value (8B model, A100) |
|---|---|---|
| TTFT | Time to first token | < 100ms |
| ITL | Inter-token latency | < 20ms |
| Throughput | Output tokens/second | > 1000 tok/s |
| Request latency | End-to-end per request | Depends on output length |
| Request throughput | Requests/second | > 5/s at concurrency 10 |
Compare Inference Servers
# Same benchmark against different backends
MODELS="llama3"
BACKENDS=(
"http://vllm-service:8000"
"http://nim-service:8000"
"http://tgi-service:8080"
)
for backend in "${BACKENDS[@]}"; do
echo "=== $backend ==="
genai-perf profile \
--model $MODELS \
--service-kind openai \
--endpoint-type chat \
--url "$backend" \
--concurrency 10 \
--num-prompts 100 \
--streaming
doneCommon Issues
βConnection refusedβ to inference endpoint
Service not reachable from the benchmark pod. Check: kubectl get svc vllm-service, port forwarding, NetworkPolicy.
TTFT is high but ITL is normal
Prompt processing (prefill) is the bottleneck. Check if the model is compute-bound during prefill β may need more GPU memory or prefix caching.
Throughput plateaus at low concurrency
Continuous batching may not be enabled. For vLLM, itβs enabled by default. For NIM, check model profile settings.
ββservice-kind openaiβ not recognized
Old genai-perf version. Update: pip install --upgrade genai-perf.
Best Practices
- Always benchmark with
--streamingβ matches real-world LLM usage - Run concurrency sweep β find the saturation point before production deployment
- Control input/output tokens β standardize for reproducible benchmarks
- Benchmark from within the cluster β avoid network latency skewing results
- Compare TTFT across configs β most important metric for user experience
- Run multiple iterations β use
--num-prompts 200+for statistical significance
Key Takeaways
--service-kind openaiworks with any OpenAI-compatible API (vLLM, NIM, TGI, Ollama)- TTFT and ITL are the key metrics for LLM serving quality
- Concurrency sweeps reveal the throughput saturation point
- Run benchmarks from inside the cluster to avoid external network noise
- GenAI-Perf is the standard NVIDIA tool for LLM inference benchmarking

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
