πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 20 minutes K8s 1.28+

AIPerf LLM Benchmarking on K8s

Benchmark generative AI inference on Kubernetes with NVIDIA AIPerf. Measure TTFT, ITL, throughput, and latency across vLLM, NIM.

By Luca Berton β€’ β€’ πŸ“– 8 min read

πŸ’‘ Quick Answer: AIPerf (aiperf profile) is NVIDIA’s comprehensive LLM benchmarking tool that measures TTFT, ITL, output throughput, and request latency against any OpenAI-compatible endpoint. Deploy it as a Kubernetes Job targeting your inference service, with configurable concurrency, request rates, arrival patterns, and dataset workloads.

The Problem

Before deploying LLM inference to production, you need answers to:

  • What’s the Time to First Token (TTFT) under load?
  • How does Inter-Token Latency (ITL) degrade at concurrency 50 vs 200?
  • What’s the maximum throughput (tokens/sec) before SLA violations?
  • How does your inference engine (vLLM, NIM, TGI) perform with realistic traffic patterns?
  • Is your GPU utilization optimal or are you over-provisioned?

Generic HTTP benchmarking tools (wrk, hey) don’t understand streaming tokens, can’t measure TTFT/ITL, and don’t generate realistic LLM workloads.

The Solution

Install AIPerf

# In a Python virtual environment
pip install aiperf

# Or use the container image
# nvcr.io/nvidia/aiperf:0.7.0

Quick Benchmark Against a K8s Inference Service

# Profile a vLLM deployment exposed via Service
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service.ai-inference:8000 \
  --concurrency 10 \
  --request-count 100

Kubernetes Benchmark Job

apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-benchmark
  namespace: ai-inference
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: aiperf
          image: python:3.12-slim
          command:
            - bash
            - -c
            - |
              pip install aiperf -q
              
              # Warmup run
              aiperf profile \
                --model "meta-llama/Llama-3.1-8B-Instruct" \
                --streaming \
                --endpoint-type chat \
                --tokenizer meta-llama/Llama-3.1-8B-Instruct \
                --url http://vllm-service:8000 \
                --concurrency 1 \
                --request-count 5 \
                --ui none
              
              # Actual benchmark
              aiperf profile \
                --model "meta-llama/Llama-3.1-8B-Instruct" \
                --streaming \
                --endpoint-type chat \
                --tokenizer meta-llama/Llama-3.1-8B-Instruct \
                --url http://vllm-service:8000 \
                --concurrency 50 \
                --request-count 500 \
                --ui none
              
              echo "Results:"
              cat artifacts/*/profile_export_aiperf.json
          volumeMounts:
            - name: results
              mountPath: /artifacts
          resources:
            requests:
              cpu: "4"
              memory: 4Gi
      volumes:
        - name: results
          emptyDir: {}

Concurrency Sweep Job

apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-sweep
  namespace: ai-inference
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: aiperf
          image: python:3.12-slim
          command:
            - bash
            - -c
            - |
              pip install aiperf -q
              MODEL="meta-llama/Llama-3.1-8B-Instruct"
              URL="http://vllm-service:8000"
              
              for CONC in 1 5 10 25 50 100; do
                echo "=== Concurrency: $CONC ==="
                aiperf profile \
                  --model "$MODEL" \
                  --streaming \
                  --endpoint-type chat \
                  --tokenizer "$MODEL" \
                  --url "$URL" \
                  --concurrency $CONC \
                  --request-count 200 \
                  --ui none
              done
          resources:
            requests:
              cpu: "4"
              memory: 4Gi

Benchmark NIM Deployment

# Profile NVIDIA NIM
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://nim-service.ai-inference:8000 \
  --concurrency 20 \
  --request-count 300

Request Rate Control

# Fixed request rate (requests per second)
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --request-rate 10.0 \
  --request-count 500

# Request rate with max concurrency cap
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --request-rate 10.0 \
  --concurrency 50 \
  --request-count 500

Arrival Patterns

# Poisson arrivals (realistic)
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --request-rate 10.0 \
  --arrival-pattern poisson \
  --request-count 500

# Gradual ramp-up
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --concurrency 100 \
  --ramp-up-duration 60 \
  --request-count 1000

Custom Dataset / ShareGPT

# Use ShareGPT dataset for realistic prompts
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --dataset sharegpt \
  --concurrency 20 \
  --request-count 500

# Synthetic dataset with controlled ISL/OSL
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --input-tokens-mean 512 \
  --output-tokens-mean 256 \
  --concurrency 20 \
  --request-count 500

Benchmark Embeddings / Rankings

# Embedding model
aiperf profile \
  --model "nvidia/nv-embedqa-e5-v5" \
  --endpoint-type embeddings \
  --url http://embedding-service:8000 \
  --concurrency 50 \
  --request-count 1000

# Ranking model
aiperf profile \
  --model "nvidia/nv-rerankqa-mistral-4b-v3" \
  --endpoint-type rankings \
  --url http://ranking-service:8000 \
  --concurrency 20 \
  --request-count 500

Multi-URL Load Balancing

# Distribute across multiple inference replicas
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-0.vllm-headless:8000 \
  --url http://vllm-1.vllm-headless:8000 \
  --url http://vllm-2.vllm-headless:8000 \
  --concurrency 60 \
  --request-count 1000

Goodput (SLO-Based Throughput)

# Measure requests meeting SLA targets
aiperf profile \
  --model "meta-llama/Llama-3.1-8B-Instruct" \
  --streaming \
  --endpoint-type chat \
  --tokenizer meta-llama/Llama-3.1-8B-Instruct \
  --url http://vllm-service:8000 \
  --concurrency 50 \
  --request-count 500 \
  --goodput ttft:500 itl:100
  # Only count requests with TTFT < 500ms AND ITL < 100ms

UI Modes

# Real-time TUI dashboard (interactive)
aiperf profile ... --ui dashboard

# Simple progress bars (CI-friendly)
aiperf profile ... --ui simple

# Headless (no output, just results files)
aiperf profile ... --ui none
graph TD
    subgraph AIPerf Architecture
        CLI[aiperf profile] --> SM[Session Manager]
        SM --> RG[Request Generator]
        SM --> DP[Data Plane]
        SM --> MP[Metrics Plane]
        
        RG -->|Concurrency / Rate| DP
        DP -->|HTTP/SSE| INF[Inference Service]
        DP -->|Token Events| MP
        MP --> DASH[Dashboard UI]
        MP --> CSV[CSV/JSON Export]
    end
    
    subgraph Kubernetes
        INF --> VLLM[vLLM Pod]
        INF --> NIM[NIM Pod]
        INF --> TGI[TGI Pod]
    end
    
    subgraph Metrics
        TTFT[Time to First Token]
        ITL[Inter-Token Latency]
        OTT[Output Throughput]
        RL[Request Latency]
        GP[Goodput SLO]
    end

Key Metrics Explained

MetricDescriptionWhat Good Looks Like
TTFT (Time to First Token)Latency until first generated token< 200ms at target concurrency
TTST (Time to Second Token)Latency from first to second tokenClose to ITL (no startup spike)
ITL (Inter-Token Latency)Average time between consecutive tokens< 50ms for interactive use
Output Throughput (tokens/sec)Total tokens generated per secondModel/GPU dependent
Per-User Throughput (tok/sec/user)Throughput experienced per userDecreases with concurrency
Request LatencyEnd-to-end time per requestTTFT + (output_tokens Γ— ITL)
GoodputRequests meeting SLA thresholds> 95% of requests within SLA

Supported Endpoint Types

TypeFlagAPIs
Chat completions--endpoint-type chatOpenAI /v1/chat/completions
Text completions--endpoint-type completionsOpenAI /v1/completions
Embeddings--endpoint-type embeddingsOpenAI + NIM embeddings
Rankings--endpoint-type rankingsNIM ranking/reranking
Audio--endpoint-type audioOpenAI audio
Vision--endpoint-type visionVision LLMs (LLaVA, etc.)
Image generation--endpoint-type imageOpenAI images

Common Issues

TTFT spikes at concurrency > 1

First request triggers model loading or KV cache warmup. Use --warmup-requests or run a warmup phase first:

# 5 warmup requests before measurement
aiperf profile ... --warmup-requests 5

Token counts don’t match expected output

AIPerf needs the correct tokenizer to count tokens. Always specify --tokenizer:

--tokenizer meta-llama/Llama-3.1-8B-Instruct
# Or local path
--tokenizer /models/tokenizer

Connection refused to inference service

From the AIPerf pod, verify the service is reachable:

kubectl exec -it aiperf-pod -- curl -s http://vllm-service:8000/v1/models

Output tokens truncated β€” OSL lower than expected

Inference servers may apply max_tokens defaults. Use --extra-inputs to control:

aiperf profile ... --extra-inputs max_tokens:512

Very high concurrency causes port exhaustion

System limit on ephemeral ports (typically 28K). For concurrency > 15K, increase system limits:

sysctl -w net.ipv4.ip_local_port_range="1024 65535"

Dashboard mode not rendering in pod

Use --ui none or --ui simple for non-interactive environments (Jobs, CI pipelines).

Best Practices

  • Warmup before measuring β€” run 5-10 warmup requests to fill KV caches and JIT-compile kernels
  • Use realistic workloads β€” ShareGPT or custom datasets over synthetic random tokens
  • Sweep concurrency β€” test 1, 5, 10, 25, 50, 100 to find the throughput-latency curve
  • Set SLOs with goodput β€” --goodput ttft:200 itl:50 measures real production fitness
  • Use Poisson arrivals β€” --arrival-pattern poisson models real traffic better than constant rate
  • Pin tokenizer β€” always specify --tokenizer to get accurate token counts
  • Compare engines fairly β€” same model, same dataset, same concurrency, same hardware
  • Export results β€” JSON/CSV artifacts in artifacts/ directory for post-analysis
  • Run from within the cluster β€” deploy AIPerf as a Job to avoid network latency from external clients
  • Multi-URL for distributed β€” pass multiple --url flags to benchmark across inference replicas
  • Combine with GPU telemetry β€” use DCGM metrics to correlate throughput with GPU utilization

Key Takeaways

  • AIPerf replaces generic HTTP benchmark tools with LLM-aware metrics (TTFT, ITL, per-user throughput)
  • Supports all major inference APIs: OpenAI chat/completions, embeddings, rankings, vision, audio
  • Scalable multiprocess architecture with 9 ZMQ-connected services
  • Three benchmark modes: concurrency-based, request-rate, and trace replay
  • Arrival patterns: constant, Poisson, gamma β€” model realistic traffic
  • Goodput measures SLO compliance (% of requests meeting latency targets)
  • Extensive dataset support: ShareGPT, AIMO, MMStar, synthetic, custom, and multi-turn
  • Plugin system for custom endpoints, datasets, transports, and metrics
  • Export to CSV/JSON + visualization/plotting for multi-run comparison
  • Deploy as K8s Job for in-cluster benchmarking β€” --ui none for headless mode
  • Always benchmark before production: find the concurrency cliff where latency degrades
#aiperf #benchmarking #llm #inference #nvidia #vllm #nim
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens