πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

GenAI-Perf Benchmarking LLM Inference on Kubernetes

Benchmark LLM inference performance with NVIDIA GenAI-Perf on Kubernetes. Profile vLLM, TensorRT-LLM, and Triton endpoints with concurrency sweeps, token

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: GenAI-Perf is NVIDIA’s benchmarking tool for LLM inference endpoints. Run genai-perf profile -m <model> --service-kind openai --endpoint-type chat against vLLM/NIM/Triton services. It measures time-to-first-token (TTFT), inter-token latency (ITL), output token throughput, and request latency at configurable concurrency levels.

The Problem

  • No standardized way to benchmark LLM inference throughput and latency
  • Manual curl tests don’t represent real concurrent workload patterns
  • Need to compare vLLM vs TensorRT-LLM vs Triton performance objectively
  • Latency percentiles (P50/P90/P99) are critical but hard to measure manually
  • Token throughput varies by prompt length, output length, batch size, and concurrency

The Solution

Install GenAI-Perf

# GenAI-Perf comes with the Triton SDK container
# Or install standalone:
pip install genai-perf

# Verify
genai-perf --version

Profile vLLM with OpenAI-Compatible API

# Basic profiling against vLLM endpoint
genai-perf profile \
  -m "llama-70b" \
  --service-kind openai \
  --endpoint-type chat \
  --url http://vllm-service.llm.svc:8000 \
  --concurrency 10 \
  --num-prompts 100 \
  --random-seed 42 \
  --input-tokens-mean 128 \
  --input-tokens-stddev 16 \
  --output-tokens-mean 256

# Concurrency sweep (find saturation point)
for c in 1 2 4 8 16 32 64; do
  echo "=== Concurrency: $c ==="
  genai-perf profile \
    -m "llama-70b" \
    --service-kind openai \
    --endpoint-type chat \
    --url http://vllm-service.llm.svc:8000 \
    --concurrency $c \
    --num-prompts 50 \
    --input-tokens-mean 128 \
    --output-tokens-mean 256
done

Profile as Kubernetes Job

apiVersion: batch/v1
kind: Job
metadata:
  name: genai-perf-benchmark
  namespace: llm
spec:
  template:
    spec:
      containers:
        - name: genai-perf
          image: nvcr.io/nvidia/tritonserver:24.05-py3-sdk
          command:
            - bash
            - -c
            - |
              genai-perf profile \
                -m "llama-70b" \
                --service-kind openai \
                --endpoint-type chat \
                --url http://vllm-service:8000 \
                --concurrency 1,2,4,8,16,32 \
                --num-prompts 200 \
                --input-tokens-mean 128 \
                --input-tokens-stddev 32 \
                --output-tokens-mean 512 \
                --output-tokens-stddev 64 \
                --streaming \
                --profile-export-file /results/benchmark.json

              # Copy results
              cp -r artifacts/ /results/
          volumeMounts:
            - name: results
              mountPath: /results
      volumes:
        - name: results
          persistentVolumeClaim:
            claimName: benchmark-results
      restartPolicy: Never
  backoffLimit: 0

Key Metrics Explained

GenAI-Perf output metrics:

Time To First Token (TTFT):
  β€’ Time from request sent to first token received
  β€’ Measures prefill/prompt processing latency
  β€’ Target: <500ms for interactive, <2s for batch

Inter-Token Latency (ITL):
  β€’ Time between consecutive output tokens
  β€’ Measures decode step latency
  β€’ Target: <50ms for smooth streaming

Output Token Throughput:
  β€’ Total output tokens / total time (across all requests)
  β€’ Measures system-wide generation capacity
  β€’ Higher = better utilization

Request Throughput:
  β€’ Completed requests per second
  β€’ Depends on output length β€” shorter = more req/s

End-to-End Latency:
  β€’ Total time from request to last token
  β€’ = TTFT + (output_tokens Γ— ITL)

Example output:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LLM Metrics (concurrency=16)                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Metric              β”‚  P50  β”‚  P90  β”‚  P99  β”‚  Avg  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ TTFT (ms)           β”‚   85  β”‚  142  β”‚  310  β”‚   98  β”‚
β”‚ ITL (ms)            β”‚   32  β”‚   45  β”‚   67  β”‚   35  β”‚
β”‚ Request latency (s) β”‚  8.4  β”‚ 11.2  β”‚ 15.8  β”‚  9.1  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Output throughput   β”‚       β”‚       β”‚       β”‚ 2847  β”‚
β”‚ (tokens/sec)        β”‚       β”‚       β”‚       β”‚       β”‚
β”‚ Request throughput  β”‚       β”‚       β”‚       β”‚  11.1 β”‚
β”‚ (req/sec)           β”‚       β”‚       β”‚       β”‚       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Compare vLLM vs TensorRT-LLM

# Profile vLLM
genai-perf profile \
  -m "llama-70b" \
  --service-kind openai \
  --endpoint-type chat \
  --url http://vllm-svc:8000 \
  --concurrency 16 \
  --num-prompts 200 \
  --streaming \
  --profile-export-file vllm-results.json

# Profile TensorRT-LLM (via Triton)
genai-perf profile \
  -m "llama-70b" \
  --service-kind triton \
  --backend tensorrtllm \
  --url triton-svc:8001 \
  --concurrency 16 \
  --num-prompts 200 \
  --streaming \
  --profile-export-file trtllm-results.json

# Compare results
genai-perf compare \
  --files vllm-results.json trtllm-results.json

Advanced Options

# Custom prompts from file
genai-perf profile \
  -m "llama-70b" \
  --service-kind openai \
  --endpoint-type chat \
  --url http://vllm-svc:8000 \
  --input-file prompts.jsonl \
  --concurrency 8

# Completions endpoint (not chat)
genai-perf profile \
  -m "llama-70b" \
  --service-kind openai \
  --endpoint-type completions \
  --url http://vllm-svc:8000 \
  --concurrency 16

# Warmup requests before measurement
genai-perf profile \
  -m "llama-70b" \
  --service-kind openai \
  --endpoint-type chat \
  --url http://vllm-svc:8000 \
  --warmup-prompts 20 \
  --num-prompts 100 \
  --concurrency 16

# Export to CSV for graphing
genai-perf profile \
  -m "llama-70b" \
  --service-kind openai \
  --endpoint-type chat \
  --url http://vllm-svc:8000 \
  --profile-export-file results.csv

Common Issues

”Connection refused” to vLLM service

  • Cause: Service not ready, or wrong port (vLLM default: 8000)
  • Fix: Verify service with kubectl port-forward svc/vllm-svc 8000:8000 first

TTFT very high at low concurrency

  • Cause: Model not warmed up; first requests trigger CUDA compilation
  • Fix: Use --warmup-prompts 10 to exclude cold-start from measurements

Throughput doesn’t scale with concurrency

  • Cause: GPU saturated; or KV-cache full causing request queuing
  • Fix: Check GPU utilization; increase --max-num-seqs in vLLM; add more GPU replicas

”Model not found” error

  • Cause: Model name doesn’t match vLLM’s --served-model-name
  • Fix: Check curl http://vllm-svc:8000/v1/models for exact model name

Best Practices

  1. Sweep concurrency β€” find the saturation point (throughput plateaus, latency spikes)
  2. Use realistic input/output lengths β€” match your actual workload distribution
  3. Warmup before measuring β€” exclude cold-start from results
  4. Test with streaming β€” matches real chat/completion use cases
  5. Run multiple iterations β€” single runs have high variance; average 3-5 runs
  6. Profile after changes β€” quantify impact of model optimization, scaling, config changes
  7. Export results β€” track performance over time as code/infra changes
  8. Test at target SLA β€” find max concurrency that meets your P99 latency target

Key Takeaways

  • GenAI-Perf profiles LLM endpoints with realistic concurrent workloads
  • Works with vLLM (--service-kind openai), TensorRT-LLM, Triton, and NIM
  • Key metrics: TTFT (prefill speed), ITL (decode speed), output token throughput
  • Concurrency sweep reveals saturation point β€” where latency degrades
  • Use --streaming for chat workloads; --endpoint-type completions for batch
  • Run as Kubernetes Job for reproducible, in-cluster benchmarking
  • Compare backends objectively with genai-perf compare
#genai-perf #benchmarking #vllm #tensorrt-llm #triton #performance
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens