GenAI-Perf Benchmark Triton on K8s
Benchmark NVIDIA Triton Inference Server performance on Kubernetes using GenAI-Perf. Measure TTFT, inter-token latency, throughput, and GPU telemetry.
π‘ Quick Answer: Run
genai-perf profile -m <model> --backend tensorrtllm --streamingagainst your Triton endpoint. It measures time to first token (TTFT), inter-token latency (ITL), request throughput, and output token throughput β the key metrics for LLM serving performance.
The Problem
Deploying an LLM on Triton is only half the battle. You need to answer:
- How fast is the first token? β TTFT determines perceived responsiveness
- Whatβs the token generation speed? β ITL affects streaming UX
- How many concurrent users can I serve? β throughput determines capacity planning
- Whereβs the bottleneck? β GPU utilization, memory, or network
- How do backends compare? β TensorRT-LLM vs vLLM on your actual hardware
GenAI-Perf (NVIDIAβs benchmarking tool) answers all of these with a single command, including GPU telemetry from DCGM.
The Solution
Step 1: Deploy GenAI-Perf as a Kubernetes Job
apiVersion: batch/v1
kind: Job
metadata:
name: genai-perf-benchmark
namespace: ai-inference
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: genai-perf
image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
command:
- /bin/bash
- -c
- |
pip install genai-perf
# Basic LLM benchmark
genai-perf profile \
-m llama3-8b \
--backend tensorrtllm \
--streaming \
--url triton-trtllm.ai-inference:8001 \
--concurrency 10 \
--request-count 100 \
--synthetic-input-tokens-mean 550 \
--output-tokens-mean 256 \
--artifact-dir /results/benchmark-run-1
# Copy results
cp -r /results /shared/
resources:
limits:
cpu: "4"
memory: 8Gi
volumeMounts:
- name: results
mountPath: /shared
volumes:
- name: results
persistentVolumeClaim:
claimName: benchmark-resultsStep 2: Quick Benchmark Commands
Run these from inside a pod with network access to Triton:
# Install GenAI-Perf
pip install genai-perf
# Basic streaming benchmark (TensorRT-LLM backend)
genai-perf profile \
-m llama3-8b \
--backend tensorrtllm \
--streaming \
--url triton-trtllm:8001
# Benchmark with specific concurrency levels
genai-perf profile \
-m llama3-8b \
--backend tensorrtllm \
--streaming \
--url triton-trtllm:8001 \
--concurrency 32 \
--request-count 200
# Benchmark vLLM backend
genai-perf profile \
-m mistral-7b \
--backend vllm \
--streaming \
--url triton-vllm:8001 \
--concurrency 16Step 3: Sweep Concurrency Levels
Find the optimal concurrency for your deployment:
apiVersion: batch/v1
kind: Job
metadata:
name: genai-perf-sweep
namespace: ai-inference
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
containers:
- name: sweep
image: nvcr.io/nvidia/tritonserver:25.01-trtllm-python-py3
command:
- /bin/bash
- -c
- |
pip install genai-perf
for CONCURRENCY in 1 2 4 8 16 32 64 128; do
echo "=== Concurrency: $CONCURRENCY ==="
genai-perf profile \
-m llama3-8b \
--backend tensorrtllm \
--streaming \
--url triton-trtllm.ai-inference:8001 \
--concurrency $CONCURRENCY \
--request-count 100 \
--synthetic-input-tokens-mean 550 \
--output-tokens-mean 256 \
--artifact-dir /results/concurrency-$CONCURRENCY \
--generate-plots
done
resources:
limits:
cpu: "4"
memory: 8Gi
volumeMounts:
- name: results
mountPath: /results
volumes:
- name: results
persistentVolumeClaim:
claimName: benchmark-resultsStep 4: GPU Telemetry Collection
GenAI-Perf collects GPU metrics from DCGM Exporter automatically:
# Ensure DCGM Exporter is running
kubectl get pods -n gpu-operator | grep dcgm
# Run benchmark with GPU telemetry
genai-perf profile \
-m llama3-8b \
--backend tensorrtllm \
--streaming \
--url triton-trtllm:8001 \
--concurrency 32 \
--server-metrics-urls http://dcgm-exporter.gpu-operator:9400/metrics \
--verbose
# GPU metrics collected automatically:
# - GPU utilization and SM utilization
# - GPU memory used/free/total
# - GPU power usage and energy consumption
# - GPU temperature
# - PCIe throughput
# - NVLink errorsStep 5: Compare TensorRT-LLM vs vLLM
# Benchmark TensorRT-LLM
genai-perf profile \
-m llama3-8b \
--backend tensorrtllm \
--streaming \
--url triton-trtllm:8001 \
--concurrency 32 \
--request-count 200 \
--synthetic-input-tokens-mean 550 \
--output-tokens-mean 256 \
--artifact-dir /results/trtllm-c32
# Benchmark vLLM (same model, same parameters)
genai-perf profile \
-m llama3-8b-vllm \
--backend vllm \
--streaming \
--url triton-vllm:8001 \
--concurrency 32 \
--request-count 200 \
--synthetic-input-tokens-mean 550 \
--output-tokens-mean 256 \
--artifact-dir /results/vllm-c32
# Compare results side by side
echo "=== TensorRT-LLM ===" && cat /results/trtllm-c32/*genai_perf.csv
echo "=== vLLM ===" && cat /results/vllm-c32/*genai_perf.csvStep 6: Use a Config File for Reproducible Benchmarks
# genai_perf_config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: genai-perf-config
namespace: ai-inference
data:
config.yaml: |
endpoint:
model_selection_strategy: round_robin
backend: tensorrtllm
type: kserve
streaming: True
server_metrics_urls: http://dcgm-exporter.gpu-operator:9400/metrics
url: triton-trtllm.ai-inference:8001
input:
num_dataset_entries: 200
synthetic_input_tokens_mean: 550
synthetic_input_tokens_stddev: 50
output_tokens_mean: 256
output_tokens_stddev: 32
warmup_request_count: 10
profiling:
concurrency: 32
request_count: 200
output:
artifact_dir: /results
generate_plots: True# Run with config file
genai-perf config -f /config/config.yaml
# Override specific settings
genai-perf config -f /config/config.yaml \
--override-config --concurrency 64 --request-count 500Understanding the Output
NVIDIA GenAI-Perf | LLM Metrics
βββββββββββββββββββββββββββββββββββββ³βββββββββ³βββββββββ³βββββββββ³βββββββββ³βββββββββ
β Statistic β avg β min β max β p99 β p90 β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β Time to first token (ms) β 16.26 β 12.39 β 17.25 β 17.09 β 16.68 β
β Inter token latency (ms) β 1.85 β 1.55 β 2.04 β 2.02 β 1.97 β
β Request latency (ms) β 499.20 β 451.01 β 554.61 β 548.69 β 526.13 β
β Output sequence length β 261.90 β 256.00 β 298.00 β 296.60 β 270.00 β
β Input sequence length β 550.06 β 550.00 β 553.00 β 551.60 β 550.00 β
β Output token throughput (per sec) β 520.87 β N/A β N/A β N/A β N/A β
β Request throughput (per sec) β 1.99 β N/A β N/A β N/A β N/A β
βββββββββββββββββββββββββββββββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββKey metrics to watch:
- TTFT β under 100ms is excellent for chat, under 500ms is acceptable
- ITL β under 30ms feels real-time, under 50ms is good streaming UX
- Output token throughput β tokens/sec across all concurrent requests
- Request throughput β completed requests per second
flowchart TD
A[GenAI-Perf Pod] -->|gRPC port 8001| B[Triton Inference Server]
A -->|HTTP port 9400| C[DCGM Exporter]
B --> D[GPU - Model Inference]
C --> E[GPU Telemetry Metrics]
A --> F[Results]
F --> G[Console Table]
F --> H[CSV Export]
F --> I[JSON Export]
F --> J[Plots - TTFT, Latency, Throughput]Common Issues
Connection refused to Triton gRPC
# GenAI-Perf uses gRPC (port 8001) by default, not HTTP (8000)
# Verify Triton gRPC is accessible
grpc_health_probe -addr=triton-trtllm:8001
# If using HTTP instead:
genai-perf profile \
-m llama3-8b \
--url triton-trtllm:8000 \
--endpoint-type chatResults vary between runs
# Use warmup requests to stabilize
genai-perf profile \
-m llama3-8b \
--warmup-request-count 20 \
--request-count 200 \
--stability-percentage 5
# Run multiple times and compare artifactsGPU telemetry not collected
# Verify DCGM Exporter is accessible from the benchmark pod
curl http://dcgm-exporter.gpu-operator:9400/metrics | head -5
# If using custom namespace or port:
--server-metrics-urls http://<dcgm-service>:<port>/metrics
# Use --verbose to see telemetry in console outputBenchmarking OpenAI-compatible endpoints
# For vLLM or other OpenAI-compatible servers
genai-perf profile \
-m llama3-8b \
--endpoint-type chat \
--streaming \
--url http://vllm-server:8000
# With API key authentication
genai-perf profile \
-m gpt-4 \
--endpoint-type chat \
--streaming \
--url https://api.openai.com \
-H "Authorization: Bearer ${OPENAI_API_KEY}" \
-H "Accept: text/event-stream"Best Practices
- Warm up before measuring β use
--warmup-request-count 10-20to fill KV cache and stabilize GPU clocks - Test realistic input/output lengths β set
--synthetic-input-tokens-meanand--output-tokens-meanto match your workload - Sweep concurrency levels β find the sweet spot where throughput plateaus without latency degradation
- Collect GPU telemetry β connect to DCGM Exporter to understand GPU utilization and memory pressure
- Use config files β reproducible benchmarks are essential for comparing deployments
- Generate plots β
--generate-plotscreates visual TTFT and latency analysis automatically - Run from inside the cluster β benchmark from a pod in the same namespace to eliminate network variability
Key Takeaways
- GenAI-Perf measures the metrics that matter: TTFT, ITL, token throughput, and request throughput
- Run it as a Kubernetes Job against your Triton Service for reproducible, in-cluster benchmarks
- Sweep concurrency (1 to 128) to find the optimal operating point for your model and GPU
- GPU telemetry from DCGM Exporter reveals whether youβre compute-bound, memory-bound, or network-bound
- Use config files with
--override-configfor systematic A/B comparisons between backends - GenAI-Perf is being phased out in favor of AIPerf β but remains the standard tool for Triton benchmarking today

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
