πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

GenAI-Perf Benchmark LLM Serving

Benchmark LLM inference endpoints with NVIDIA GenAI-Perf for throughput, latency percentiles, time-to-first-token, and ITL metrics.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run genai-perf -m <model> --endpoint-type chat --streaming against your Triton/vLLM/TGI endpoint. GenAI-Perf measures LLM-specific metrics: time-to-first-token (TTFT), inter-token latency (ITL), output token throughput, and request throughput β€” critical for SLO validation.

The Problem

Standard HTTP benchmarking tools (wrk, hey, k6) measure request latency but miss LLM-specific metrics. Users care about how fast the first token appears (TTFT), how smooth streaming feels (ITL), and total throughput (tokens/sec). Without GenAI-Perf, you’re flying blind on inference SLOs.

The Solution

Install GenAI-Perf

# GenAI-Perf ships inside the Triton SDK container
# or install via pip
pip install genai-perf

# Or use the NGC container
docker pull nvcr.io/nvidia/tritonserver:24.12-py3-sdk

Basic LLM Benchmark

# Benchmark a vLLM OpenAI-compatible endpoint
genai-perf \
  -m llama-3.1-70b \
  --endpoint-type chat \
  --service-kind openai \
  --url http://vllm-service:8000 \
  --streaming \
  --num-prompts 100 \
  --concurrency 10 \
  --input-tokens-mean 128 \
  --input-tokens-stddev 32 \
  --output-tokens-mean 256

# Output:
#                        LLM Metrics
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
# β”‚ Metric               β”‚ p50      β”‚ p95      β”‚ p99      β”‚
# β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
# β”‚ Time to First Token  β”‚ 45ms     β”‚ 120ms    β”‚ 250ms    β”‚
# β”‚ Inter Token Latency  β”‚ 12ms     β”‚ 18ms     β”‚ 25ms     β”‚
# β”‚ Request Latency      β”‚ 3.2s     β”‚ 5.1s     β”‚ 7.8s     β”‚
# β”‚ Output Token Tput    β”‚ 82 tok/s β”‚          β”‚          β”‚
# β”‚ Request Throughput   β”‚ 8.5 req/sβ”‚          β”‚          β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Benchmark as Kubernetes Job

apiVersion: batch/v1
kind: Job
metadata:
  name: genai-perf-benchmark
  namespace: tenant-alpha
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: genai-perf
          image: nvcr.io/nvidia/tritonserver:24.12-py3-sdk
          command:
            - genai-perf
          args:
            - "-m"
            - "llama-3.1-70b"
            - "--endpoint-type"
            - "chat"
            - "--service-kind"
            - "openai"
            - "--url"
            - "http://vllm-service:8000"
            - "--streaming"
            - "--num-prompts"
            - "200"
            - "--concurrency"
            - "1,5,10,20,50"
            - "--input-tokens-mean"
            - "128"
            - "--output-tokens-mean"
            - "256"
            - "--artifact-dir"
            - "/results"
          volumeMounts:
            - name: results
              mountPath: /results
      volumes:
        - name: results
          persistentVolumeClaim:
            claimName: benchmark-results

Concurrency Sweep

# Sweep concurrency to find saturation point
genai-perf \
  -m llama-3.1-70b \
  --endpoint-type chat \
  --service-kind openai \
  --url http://vllm-service:8000 \
  --streaming \
  --num-prompts 100 \
  --concurrency 1,2,4,8,16,32,64 \
  --input-tokens-mean 128 \
  --output-tokens-mean 256 \
  --artifact-dir ./sweep-results

# Generates per-concurrency reports:
# concurrency=1:  TTFT p50=25ms,  throughput=15 tok/s
# concurrency=4:  TTFT p50=35ms,  throughput=55 tok/s
# concurrency=8:  TTFT p50=50ms,  throughput=95 tok/s
# concurrency=16: TTFT p50=120ms, throughput=130 tok/s  ← sweet spot
# concurrency=32: TTFT p50=350ms, throughput=140 tok/s  ← diminishing returns
# concurrency=64: TTFT p50=900ms, throughput=135 tok/s  ← saturated

Triton Native Protocol

# Benchmark Triton native gRPC (not OpenAI-compatible)
genai-perf \
  -m ensemble_model \
  --endpoint-type generate \
  --service-kind triton \
  --url grpc://triton-service:8001 \
  --streaming \
  --num-prompts 100 \
  --concurrency 10

Compare Backends

# Compare vLLM vs TensorRT-LLM on same model
# vLLM
genai-perf -m llama-70b --service-kind openai \
  --url http://vllm:8000 --streaming \
  --concurrency 16 --artifact-dir ./results-vllm

# TensorRT-LLM
genai-perf -m llama-70b --service-kind openai \
  --url http://trtllm:8000 --streaming \
  --concurrency 16 --artifact-dir ./results-trtllm

# Compare results
genai-perf compare ./results-vllm ./results-trtllm

SLO Validation Script

#!/bin/bash
# validate-slo.sh β€” fail CI if SLOs not met
RESULT=$(genai-perf -m $MODEL --service-kind openai \
  --url $ENDPOINT --streaming \
  --concurrency 16 --num-prompts 200 \
  --output-format json 2>/dev/null)

TTFT_P95=$(echo $RESULT | jq '.time_to_first_token.p95')
ITL_P95=$(echo $RESULT | jq '.inter_token_latency.p95')
THROUGHPUT=$(echo $RESULT | jq '.output_token_throughput')

echo "TTFT p95: ${TTFT_P95}ms (SLO: <200ms)"
echo "ITL p95: ${ITL_P95}ms (SLO: <30ms)"
echo "Throughput: ${THROUGHPUT} tok/s (SLO: >50)"

PASS=true
[ $(echo "$TTFT_P95 > 200" | bc) -eq 1 ] && echo "❌ TTFT SLO FAIL" && PASS=false
[ $(echo "$ITL_P95 > 30" | bc) -eq 1 ] && echo "❌ ITL SLO FAIL" && PASS=false
[ $(echo "$THROUGHPUT < 50" | bc) -eq 1 ] && echo "❌ Throughput SLO FAIL" && PASS=false

$PASS && echo "βœ… All SLOs PASS" || exit 1
graph TD
    A[GenAI-Perf] --> B[Generate Prompts]
    B --> C[Send to Endpoint]
    C --> D[vLLM or Triton or TGI]
    
    D --> E[Streaming Response]
    E --> F[Measure TTFT]
    E --> G[Measure ITL per token]
    E --> H[Measure Total Latency]
    
    F --> I[Results Report]
    G --> I
    H --> I
    
    I --> J[p50 p95 p99 percentiles]
    I --> K[Token throughput]
    I --> L[Concurrency sweep plots]

Common Issues

  • Connection refused β€” verify endpoint URL; vLLM uses port 8000 by default, Triton gRPC uses 8001
  • TTFT extremely high β€” model may be loading on first request; run warmup requests first
  • Throughput drops at high concurrency β€” GPU memory saturated; check nvidia-smi for memory usage; reduce --max-tokens or batch size
  • Streaming metrics missing β€” must pass --streaming flag for TTFT and ITL; without it, only request-level latency is measured
  • Results vary between runs β€” use --num-prompts 200+ for statistical significance; short runs have high variance

Best Practices

  • Always benchmark with --streaming for LLM endpoints β€” TTFT and ITL are the metrics users feel
  • Run concurrency sweeps to find the saturation point β€” sweet spot is usually where TTFT p95 starts inflecting
  • Compare backends (vLLM vs TensorRT-LLM) with identical prompts and concurrency for fair comparison
  • Integrate SLO validation into CI/CD β€” fail deploys that regress inference performance
  • Store results in persistent volume for historical comparison across model versions and configs
  • Warm up the model before benchmarking β€” cold start skews TTFT

Key Takeaways

  • GenAI-Perf measures LLM-specific metrics: TTFT, ITL, output token throughput
  • Concurrency sweeps reveal the optimal load for your GPU and model size
  • SLO validation (TTFT p95 < 200ms, ITL p95 < 30ms) gates deployments
  • Works with vLLM, Triton (TensorRT-LLM), TGI, and any OpenAI-compatible endpoint
  • Successor path: AIPerf (next-gen) adds multiprocess architecture and plugin system
#genai-perf #benchmark #llm #inference #nvidia #latency
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens