GenAI-Perf Benchmarking LLM Inference on Kubernetes
Benchmark LLM inference performance with NVIDIA GenAI-Perf on Kubernetes. Profile vLLM, TensorRT-LLM, and Triton endpoints with concurrency sweeps, token
π‘ Quick Answer: GenAI-Perf is NVIDIAβs benchmarking tool for LLM inference endpoints. Run
genai-perf profile -m <model> --service-kind openai --endpoint-type chatagainst vLLM/NIM/Triton services. It measures time-to-first-token (TTFT), inter-token latency (ITL), output token throughput, and request latency at configurable concurrency levels.
The Problem
- No standardized way to benchmark LLM inference throughput and latency
- Manual
curltests donβt represent real concurrent workload patterns - Need to compare vLLM vs TensorRT-LLM vs Triton performance objectively
- Latency percentiles (P50/P90/P99) are critical but hard to measure manually
- Token throughput varies by prompt length, output length, batch size, and concurrency
The Solution
Install GenAI-Perf
# GenAI-Perf comes with the Triton SDK container
# Or install standalone:
pip install genai-perf
# Verify
genai-perf --versionProfile vLLM with OpenAI-Compatible API
# Basic profiling against vLLM endpoint
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service.llm.svc:8000 \
--concurrency 10 \
--num-prompts 100 \
--random-seed 42 \
--input-tokens-mean 128 \
--input-tokens-stddev 16 \
--output-tokens-mean 256
# Concurrency sweep (find saturation point)
for c in 1 2 4 8 16 32 64; do
echo "=== Concurrency: $c ==="
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service.llm.svc:8000 \
--concurrency $c \
--num-prompts 50 \
--input-tokens-mean 128 \
--output-tokens-mean 256
doneProfile as Kubernetes Job
apiVersion: batch/v1
kind: Job
metadata:
name: genai-perf-benchmark
namespace: llm
spec:
template:
spec:
containers:
- name: genai-perf
image: nvcr.io/nvidia/tritonserver:24.05-py3-sdk
command:
- bash
- -c
- |
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-service:8000 \
--concurrency 1,2,4,8,16,32 \
--num-prompts 200 \
--input-tokens-mean 128 \
--input-tokens-stddev 32 \
--output-tokens-mean 512 \
--output-tokens-stddev 64 \
--streaming \
--profile-export-file /results/benchmark.json
# Copy results
cp -r artifacts/ /results/
volumeMounts:
- name: results
mountPath: /results
volumes:
- name: results
persistentVolumeClaim:
claimName: benchmark-results
restartPolicy: Never
backoffLimit: 0Key Metrics Explained
GenAI-Perf output metrics:
Time To First Token (TTFT):
β’ Time from request sent to first token received
β’ Measures prefill/prompt processing latency
β’ Target: <500ms for interactive, <2s for batch
Inter-Token Latency (ITL):
β’ Time between consecutive output tokens
β’ Measures decode step latency
β’ Target: <50ms for smooth streaming
Output Token Throughput:
β’ Total output tokens / total time (across all requests)
β’ Measures system-wide generation capacity
β’ Higher = better utilization
Request Throughput:
β’ Completed requests per second
β’ Depends on output length β shorter = more req/s
End-to-End Latency:
β’ Total time from request to last token
β’ = TTFT + (output_tokens Γ ITL)
Example output:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Metrics (concurrency=16) β
βββββββββββββββββββββββ¬ββββββββ¬ββββββββ¬ββββββββ¬ββββββββ€
β Metric β P50 β P90 β P99 β Avg β
βββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌββββββββ€
β TTFT (ms) β 85 β 142 β 310 β 98 β
β ITL (ms) β 32 β 45 β 67 β 35 β
β Request latency (s) β 8.4 β 11.2 β 15.8 β 9.1 β
βββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌββββββββ€
β Output throughput β β β β 2847 β
β (tokens/sec) β β β β β
β Request throughput β β β β 11.1 β
β (req/sec) β β β β β
βββββββββββββββββββββββ΄ββββββββ΄ββββββββ΄ββββββββ΄ββββββββCompare vLLM vs TensorRT-LLM
# Profile vLLM
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-svc:8000 \
--concurrency 16 \
--num-prompts 200 \
--streaming \
--profile-export-file vllm-results.json
# Profile TensorRT-LLM (via Triton)
genai-perf profile \
-m "llama-70b" \
--service-kind triton \
--backend tensorrtllm \
--url triton-svc:8001 \
--concurrency 16 \
--num-prompts 200 \
--streaming \
--profile-export-file trtllm-results.json
# Compare results
genai-perf compare \
--files vllm-results.json trtllm-results.jsonAdvanced Options
# Custom prompts from file
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-svc:8000 \
--input-file prompts.jsonl \
--concurrency 8
# Completions endpoint (not chat)
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type completions \
--url http://vllm-svc:8000 \
--concurrency 16
# Warmup requests before measurement
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-svc:8000 \
--warmup-prompts 20 \
--num-prompts 100 \
--concurrency 16
# Export to CSV for graphing
genai-perf profile \
-m "llama-70b" \
--service-kind openai \
--endpoint-type chat \
--url http://vllm-svc:8000 \
--profile-export-file results.csvCommon Issues
βConnection refusedβ to vLLM service
- Cause: Service not ready, or wrong port (vLLM default: 8000)
- Fix: Verify service with
kubectl port-forward svc/vllm-svc 8000:8000first
TTFT very high at low concurrency
- Cause: Model not warmed up; first requests trigger CUDA compilation
- Fix: Use
--warmup-prompts 10to exclude cold-start from measurements
Throughput doesnβt scale with concurrency
- Cause: GPU saturated; or KV-cache full causing request queuing
- Fix: Check GPU utilization; increase
--max-num-seqsin vLLM; add more GPU replicas
βModel not foundβ error
- Cause: Model name doesnβt match vLLMβs
--served-model-name - Fix: Check
curl http://vllm-svc:8000/v1/modelsfor exact model name
Best Practices
- Sweep concurrency β find the saturation point (throughput plateaus, latency spikes)
- Use realistic input/output lengths β match your actual workload distribution
- Warmup before measuring β exclude cold-start from results
- Test with streaming β matches real chat/completion use cases
- Run multiple iterations β single runs have high variance; average 3-5 runs
- Profile after changes β quantify impact of model optimization, scaling, config changes
- Export results β track performance over time as code/infra changes
- Test at target SLA β find max concurrency that meets your P99 latency target
Key Takeaways
- GenAI-Perf profiles LLM endpoints with realistic concurrent workloads
- Works with vLLM (
--service-kind openai), TensorRT-LLM, Triton, and NIM - Key metrics: TTFT (prefill speed), ITL (decode speed), output token throughput
- Concurrency sweep reveals saturation point β where latency degrades
- Use
--streamingfor chat workloads;--endpoint-type completionsfor batch - Run as Kubernetes Job for reproducible, in-cluster benchmarking
- Compare backends objectively with
genai-perf compare

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
