πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 20 minutes K8s 1.28+

AIPerf Benchmark LLMs on Kubernetes

Deploy NVIDIA AIPerf to benchmark LLM inference performance on Kubernetes. Measure TTFT, ITL, throughput with real-time dashboard and GPU telemetry.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Run aiperf profile --model <model> --streaming --endpoint-type chat --url http://<server>:8000 to benchmark any OpenAI-compatible LLM endpoint. AIPerf is NVIDIA’s next-gen benchmarking tool replacing GenAI-Perf, with a real-time TUI dashboard, plugin architecture, and multiprocess scalability.

The Problem

Benchmarking LLM inference in production Kubernetes clusters requires:

  • Realistic load generation β€” concurrency sweeps, request rate control, trace replay
  • Real-time visibility β€” watching metrics live during the benchmark, not just at the end
  • GPU telemetry β€” correlating inference latency with GPU utilization and memory pressure
  • Reproducibility β€” deterministic datasets and configurable random seeds
  • Flexibility β€” testing across Triton, vLLM, TGI, Ollama, and OpenAI-compatible endpoints

GenAI-Perf handled some of these but is now deprecated. AIPerf is its successor β€” built on a scalable multiprocess architecture with 9 ZMQ-connected services, extensible plugins, and three UI modes.

The Solution

Step 1: Deploy AIPerf as a Kubernetes Job

apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-benchmark
  namespace: ai-inference
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: aiperf
          image: python:3.11-slim
          command:
            - /bin/bash
            - -c
            - |
              pip install aiperf

              # Benchmark vLLM deployment
              aiperf profile \
                --model llama3-8b \
                --streaming \
                --endpoint-type chat \
                --url http://vllm-server.ai-inference:8000 \
                --concurrency 16 \
                --request-count 200 \
                --tokenizer meta-llama/Llama-3-8B-Instruct \
                --ui simple \
                --artifact-dir /results/llama3-c16

              echo "=== Benchmark Complete ==="
              cat /results/llama3-c16/*_aiperf.csv
          resources:
            limits:
              cpu: "4"
              memory: 8Gi
          volumeMounts:
            - name: results
              mountPath: /results
      volumes:
        - name: results
          persistentVolumeClaim:
            claimName: benchmark-results

Step 2: Quick Benchmarks from a Debug Pod

# Install AIPerf
pip install aiperf

# Benchmark OpenAI-compatible endpoint (vLLM, TGI, etc.)
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://vllm-server:8000 \
  --concurrency 10 \
  --request-count 100

# Benchmark Triton with TensorRT-LLM backend
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://triton-server:8000 \
  --concurrency 32

# Benchmark with dashboard UI (requires TTY)
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://vllm-server:8000 \
  --ui dashboard

Step 3: Understanding the Output

                    NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Metric                               ┃    avg ┃    min ┃    p99 ┃    p50 ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
β”‚ Time to First Token (ms)             β”‚  45.20 β”‚  32.10 β”‚  98.50 β”‚  42.30 β”‚
β”‚ Time to Second Token (ms)            β”‚  12.50 β”‚   8.20 β”‚  28.90 β”‚  11.80 β”‚
β”‚ Inter Token Latency (ms)             β”‚  11.30 β”‚   8.50 β”‚  25.60 β”‚  10.90 β”‚
β”‚ Request Latency (ms)                 β”‚ 892.40 β”‚ 456.20 β”‚1845.30 β”‚ 812.50 β”‚
β”‚ Output Token Throughput (tokens/sec) β”‚ 1420.5 β”‚    N/A β”‚    N/A β”‚    N/A β”‚
β”‚ Request Throughput (requests/sec)    β”‚   11.2 β”‚    N/A β”‚    N/A β”‚    N/A β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key metrics:

  • TTFT β€” time to first token, determines perceived responsiveness
  • TTST β€” time to second token (new in AIPerf), captures KV cache allocation overhead
  • ITL β€” inter-token latency, affects streaming UX quality
  • Output token throughput β€” total tokens/sec across all concurrent requests

Step 4: GPU Telemetry with DCGM

# Collect GPU metrics during benchmark
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://vllm-server:8000 \
  --concurrency 32 \
  --server-metrics-urls http://dcgm-exporter.gpu-operator:9400/metrics \
  --verbose

# GPU metrics collected:
# - GPU utilization, SM utilization
# - Memory used/free, power usage
# - PCIe throughput, NVLink errors
# - Temperature, clock speeds
flowchart TD
    A[AIPerf Pod] -->|HTTP/gRPC| B[Inference Server]
    A -->|Prometheus scrape| C[DCGM Exporter]
    B --> D[GPU Workers]
    C --> D
    A --> E[Results]
    E --> F[Console Table]
    E --> G[CSV Export]
    E --> H[JSON Export]
    E --> I[PNG Plots]
    subgraph AIPerf Architecture
        J[Request Generator] -->|ZMQ| K[Transport Layer]
        K -->|ZMQ| L[Response Collector]
        L -->|ZMQ| M[Metrics Aggregator]
    end

Common Issues

Dashboard UI not rendering in Kubernetes Job

# Jobs don't have a TTY β€” use simple or none UI mode
aiperf profile --ui simple  # progress bars
aiperf profile --ui none    # headless, logs only

# For interactive debugging, use kubectl exec with TTY
kubectl exec -it debug-pod -- aiperf profile --ui dashboard

Tokenizer download fails in air-gapped clusters

# Pre-download tokenizer and mount as volume
# Or use --tokenizer with a local path
aiperf profile \
  --tokenizer /models/tokenizers/llama3 \
  --model llama3-8b

High concurrency causes port exhaustion

# AIPerf note: >15,000 concurrency may exhaust ports
# Adjust system limits if needed:
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Or reduce concurrency to realistic levels

Best Practices

  • Use --ui simple for Kubernetes Jobs β€” dashboard requires TTY, simple mode shows progress bars
  • Set --random-seed for reproducible benchmarks across runs
  • Match --tokenizer to your model β€” token counts affect all per-token metrics
  • Start with --warmup-request-count 10 to eliminate cold-start effects
  • Export results to PVC β€” use --artifact-dir mounted to a PersistentVolumeClaim for post-analysis
  • Migrate from GenAI-Perf β€” AIPerf is the successor with the same CLI patterns plus new features

Key Takeaways

  • AIPerf replaces GenAI-Perf as NVIDIA’s official LLM benchmarking tool
  • Built on a 9-service multiprocess architecture with ZMQ for scalability
  • Three UI modes: dashboard (real-time TUI), simple (progress bars), none (headless)
  • Measures TTFT, TTST, ITL, output token throughput, and request throughput
  • Works with any OpenAI-compatible endpoint β€” vLLM, Triton, TGI, Ollama, OpenAI
  • Plugin system for custom endpoints, datasets, transports, and metrics
  • Collects GPU telemetry from DCGM Exporter during benchmarks
#aiperf #benchmarking #nvidia #inference #llm #gpu #ai #performance
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens