NVIDIA GenAI-Perf Inference Benchmarking
Benchmark LLM inference throughput and latency on Kubernetes using NVIDIA GenAI-Perf. Covers vLLM, Run:ai, concurrency testing, and multi-location client runs.
π‘ Quick Answer: NVIDIA GenAI-Perf (formerly AI-Perf) benchmarks LLM inference endpoints by measuring tokens/second, latency percentiles, and time-to-first-token. Run it from multiple locations (cluster, admin node, laptop) to measure real network impact on inference performance.
The Problem
Deploying an LLM on Kubernetes is only half the battle. You need to know:
- Throughput: How many tokens/second can it generate?
- Latency: Whatβs the P50/P95/P99 time-to-first-token (TTFT)?
- Concurrency: How does performance degrade under load?
- Network impact: Does HA Proxy / Ingress routing add latency?
- Scaling: When do you need more replicas or GPUs?
The Solution
Deploy vLLM with Run:ai
# Deploy model for inference benchmarking
runai training pytorch submit inference-bench \
--image registry.example.com/inference/vllm:0.8.0 \
--gpu-devices-request 1 \
--cpu-memory-request 3846 \
--large-shm \
--run-as-uid 2000 \
--run-as-gid 2000 \
--environment-variable MODEL_NAME="mistralai/Mistral-Small-4-11B" \
--environment-variable TENSOR_PARALLEL_SIZE="1" \
--existing-pvc claimname=model-cache,path=/models \
--command -- vllm serve $MODEL_NAME --port 8000Run GenAI-Perf from Inside the Cluster
# Install GenAI-Perf
pip install genai-perf
# Basic benchmark: single concurrency
genai-perf \
--endpoint-type chat \
--backend vllm \
--url http://inference-bench:8000/v1 \
--model mistralai/Mistral-Small-4-11B \
--concurrency 1 \
--input-tokens-mean 200 \
--output-tokens-mean 200 \
--num-requests 100 \
--output-format csv
# High concurrency test
genai-perf \
--endpoint-type chat \
--backend vllm \
--url http://inference-bench:8000/v1 \
--model mistralai/Mistral-Small-4-11B \
--concurrency 2 \
--input-tokens-mean 200 \
--output-tokens-mean 100 \
--num-requests 200 \
--output-format csvBenchmark from Multiple Locations
Testing from 3 locations reveals network bottlenecks:
# Location 1: From within the cluster (Pod-to-Pod, lowest latency)
kubectl run genai-perf-client --image=nvcr.io/nvidia/tritonserver:24.12-py3 \
--rm -it -- genai-perf \
--url http://inference-svc.default:8000/v1 \
--concurrency 1 \
--input-tokens-mean 200
# Location 2: From admin node (behind HA Proxy)
genai-perf \
--url https://inference.apps.cluster.example.com/v1 \
--concurrency 1 \
--input-tokens-mean 200
# Location 3: From developer laptop (full network path)
genai-perf \
--url https://inference.apps.cluster.example.com/v1 \
--concurrency 1 \
--input-tokens-mean 200Key Metrics to Capture
Metric Target (Single H200)
βββββββββββββββββββββββββββββββββββββββββββββ
Tokens/sec (output) > 50 tok/s per request
Time to First Token P50 < 200ms (cluster), < 500ms (external)
Time to First Token P99 < 1000ms
Inter-token latency P50 < 30ms
Requests/sec @ conc=8 > 5 req/s
GPU utilization > 70% during benchmarkBenchmark Script (Automated Multi-Concurrency)
#!/bin/bash
set -euo pipefail
URL="${1:-http://inference-svc:8000/v1}"
MODEL="${2:-mistralai/Mistral-Small-4-11B}"
OUTPUT_DIR="${3:-/data/output/benchmarks}"
mkdir -p "$OUTPUT_DIR"
echo "=== GenAI-Perf Benchmark Suite ==="
echo "URL: $URL"
echo "Model: $MODEL"
echo "Date: $(date -Iseconds)"
for CONCURRENCY in 1 2 4 8 16; do
for INPUT_TOKENS in 100 200 500; do
echo "--- concurrency=$CONCURRENCY, input_tokens=$INPUT_TOKENS ---"
genai-perf \
--endpoint-type chat \
--backend vllm \
--url "$URL" \
--model "$MODEL" \
--concurrency $CONCURRENCY \
--input-tokens-mean $INPUT_TOKENS \
--output-tokens-mean 200 \
--num-requests 50 \
--output-format csv \
--output-file "$OUTPUT_DIR/bench_c${CONCURRENCY}_i${INPUT_TOKENS}.csv" \
2>&1 | tee -a "$OUTPUT_DIR/benchmark.log"
done
done
echo "=== Benchmark complete ==="Compare Results Across Locations
import pandas as pd
import glob
# Load all benchmark CSVs
results = []
for f in glob.glob("benchmarks/*.csv"):
df = pd.read_csv(f)
df["source"] = f.split("/")[-1]
results.append(df)
combined = pd.concat(results)
# Summary table
summary = combined.groupby(["concurrency", "location"]).agg({
"output_tokens_per_sec": "mean",
"time_to_first_token_ms": ["p50", "p95", "p99"],
"inter_token_latency_ms": "mean",
}).round(2)
print(summary)Run:ai Multi-Node Distributed Inference Benchmark
# Deploy distributed inference (tensor parallel across 2 GPUs)
runai training pytorch submit inference-distributed \
--image registry.example.com/inference/vllm:0.8.0 \
--workers 2 \
--gpu-devices-request 1 \
--large-shm \
--environment-variable NCCL_SOCKET_IFNAME="net1" \
--annotation "k8s.v1.cni.cncf.io/networks=sriov-rdma" \
--extended-resource "openshift.io/mellanoxnics=1" \
--existing-pvc claimname=model-cache,path=/models \
--command -- vllm serve meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 --port 8000
# Then benchmark the distributed endpoint
genai-perf \
--url http://inference-distributed:8000/v1 \
--model meta-llama/Llama-3.1-70B-Instruct \
--concurrency 2 \
--input-tokens-mean 200 \
--num-requests 100Common Issues
GenAI-Perf timeout on first request
- Cause: Model still loading into GPU memory
- Fix: Wait for
/healthendpoint to return 200, then run benchmark
Different results from different locations
- Expected: Cluster (best) > Admin node > Laptop (worst)
- Investigation: HA Proxy adds ~10-50ms; TLS termination adds ~5-10ms
Low throughput despite high GPU utilization
- Cause: Batch size too small or KV cache undersized
- Fix: Increase concurrent requests to fill GPU batching pipeline
Best Practices
- Warm up before benchmarking β send 10 requests before measuring
- Test from multiple locations β reveals network vs compute bottlenecks
- Vary concurrency β find the saturation point for capacity planning
- Use fixed input/output token counts β reproducible comparisons
- Capture GPU metrics simultaneously β correlate throughput with utilization
- Base image: Use NVIDIA NGC PyTorch (
nvidia/pytorch:26.02-py3, PyTorch 2.11)
Key Takeaways
- GenAI-Perf measures tokens/sec, TTFT, inter-token latency at various concurrency levels
- Run from 3 locations (cluster, admin, laptop) to isolate network overhead
- Use
--input-tokens-meanand--concurrencyto simulate realistic workloads - Compare single-GPU vs tensor-parallel to validate scaling efficiency
- Results drive capacity planning: when to add replicas vs GPUs

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
