πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 25 minutes K8s 1.28+

AIPerf Goodput and SLO Benchmarks

Measure LLM goodput with AIPerf on Kubernetes. Define SLOs for TTFT and ITL, calculate effective throughput, and benchmark with timeslice analysis.

By Luca Berton β€’ β€’ πŸ“– 6 min read

πŸ’‘ Quick Answer: Use aiperf profile --goodput "time_to_first_token:200 inter_token_latency:50" to measure goodput β€” the throughput of requests that meet your SLO. A server doing 100 req/s is useless if 40% of requests exceed your latency budget.

The Problem

Raw throughput numbers are misleading. Consider:

  • Server A: 100 req/s, P99 TTFT = 2000ms
  • Server B: 70 req/s, P99 TTFT = 150ms

Server B is better for a chat application with a 200ms TTFT SLO. Goodput measures only the requests that meet your quality bar.

Without goodput:

  • You overprovision based on raw throughput
  • Users experience latency spikes you didn’t plan for
  • Capacity planning is based on fiction

The Solution

Step 1: Define SLOs and Measure Goodput

# Define your SLOs and measure goodput
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://vllm-server:8000 \
  --concurrency 32 \
  --request-count 500 \
  --random-seed 42 \
  --goodput "time_to_first_token:200" \
  --goodput "inter_token_latency:50" \
  --ui simple

# Output includes:
# Request Throughput: 45.2 req/s
# Goodput: 38.7 req/s (85.6% of requests met all SLOs)

Step 2: Goodput Sweep β€” Find Sustainable Capacity

apiVersion: batch/v1
kind: Job
metadata:
  name: aiperf-goodput-sweep
  namespace: ai-inference
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: sweep
          image: python:3.11-slim
          command:
            - /bin/bash
            - -c
            - |
              pip install aiperf

              echo "=== Goodput Sweep ==="
              echo "concurrency,throughput,goodput,goodput_pct"

              for C in 1 4 8 16 32 64 128; do
                aiperf profile \
                  --model llama3-8b \
                  --streaming \
                  --endpoint-type chat \
                  --url http://vllm-server.ai-inference:8000 \
                  --tokenizer meta-llama/Llama-3-8B-Instruct \
                  --concurrency $C \
                  --request-count 200 \
                  --warmup-request-count 10 \
                  --random-seed 42 \
                  --synthetic-input-tokens-mean 550 \
                  --output-tokens-mean 256 \
                  --goodput "time_to_first_token:200" \
                  --goodput "inter_token_latency:50" \
                  --goodput "request_latency:5000" \
                  --ui none \
                  --artifact-dir /results/goodput-c$C
              done
          resources:
            limits:
              cpu: "4"
              memory: 8Gi
          volumeMounts:
            - name: results
              mountPath: /results
      volumes:
        - name: results
          persistentVolumeClaim:
            claimName: benchmark-results

Step 3: Timeslice Analysis

Track how performance changes over time during a benchmark:

# Enable timeslice metrics β€” see per-interval performance
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://vllm-server:8000 \
  --concurrency 32 \
  --request-count 1000 \
  --timeslice-duration 10 \
  --ui simple \
  --artifact-dir /results/timeslice

# Generates per-10-second metrics:
# Interval 0-10s:  TTFT=42ms, ITL=11ms, 45 req/s
# Interval 10-20s: TTFT=48ms, ITL=12ms, 43 req/s
# Interval 20-30s: TTFT=95ms, ITL=18ms, 38 req/s  ← degradation!

Step 4: SLO Tiers for Different Use Cases

apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-profiles
  namespace: ai-inference
data:
  chat-interactive.sh: |
    # Real-time chat β€” strictest SLOs
    aiperf profile \
      --model llama3-8b \
      --streaming --endpoint-type chat \
      --url http://vllm-server:8000 \
      --concurrency 32 --request-count 500 \
      --goodput "time_to_first_token:100" \
      --goodput "inter_token_latency:30" \
      --goodput "request_latency:3000" \
      --ui none --artifact-dir /results/slo-chat

  batch-processing.sh: |
    # Batch processing β€” relaxed SLOs, maximize throughput
    aiperf profile \
      --model llama3-8b \
      --streaming --endpoint-type chat \
      --url http://vllm-server:8000 \
      --concurrency 128 --request-count 1000 \
      --goodput "request_latency:30000" \
      --ui none --artifact-dir /results/slo-batch

  code-generation.sh: |
    # Code generation β€” long outputs, moderate latency
    aiperf profile \
      --model llama3-8b \
      --streaming --endpoint-type chat \
      --url http://vllm-server:8000 \
      --concurrency 16 --request-count 300 \
      --output-tokens-mean 1024 \
      --goodput "time_to_first_token:200" \
      --goodput "inter_token_latency:50" \
      --goodput "output_token_throughput_per_user:20" \
      --ui none --artifact-dir /results/slo-code

Step 5: HTTP Trace Metrics for Network Analysis

# Measure network overhead separately
aiperf profile \
  --model llama3-8b \
  --streaming \
  --endpoint-type chat \
  --url http://vllm-server:8000 \
  --concurrency 16 \
  --request-count 200 \
  --http-trace-metrics \
  --verbose

# HTTP trace includes:
# - DNS resolution time
# - TCP connection time
# - TLS handshake time
# - Time to first byte (TTFB)
# Helps distinguish network latency from inference latency

Goodput vs Throughput Visualization

flowchart TD
    A[AIPerf Benchmark] --> B[All Requests]
    B --> C{Met SLO?}
    C -->|TTFT < 200ms AND ITL < 50ms| D[Good Requests]
    C -->|SLO violated| E[Bad Requests]
    D --> F[Goodput = Good / Duration]
    B --> G[Throughput = Total / Duration]
    F --> H{Goodput vs Throughput}
    H -->|Goodput β‰ˆ Throughput| I[Server healthy - scale up OK]
    H -->|Goodput << Throughput| J[Server overloaded - scale down concurrency]
    J --> K[Find concurrency where Goodput β‰ˆ Throughput]

Common Issues

Goodput is 0 at all concurrency levels

# SLOs may be too strict for your hardware
# Start with relaxed SLOs and tighten:
--goodput "time_to_first_token:1000"  # start at 1s
--goodput "time_to_first_token:500"   # tighten
--goodput "time_to_first_token:200"   # target

# Check baseline TTFT at concurrency=1
aiperf profile --concurrency 1 --request-count 20
# If TTFT > SLO at c=1, the SLO is unreachable

Timeslice shows performance degradation over time

# Likely causes:
# 1. GPU thermal throttling β€” check temperatures
# 2. Memory fragmentation β€” KV cache not reclaiming
# 3. Garbage collection pauses (Python-based servers)

# Monitor GPU temp during benchmark:
kubectl exec dcgm-pod -- nvidia-smi --query-gpu=temperature.gpu --format=csv -l 5

Goodput different for same server on repeat runs

# Use multi-run for confidence intervals
--multi-run-count 3

# And fix the random seed
--random-seed 42

# Background processes on the GPU (monitoring, other pods)
# can cause variability β€” isolate the GPU for benchmarking

Best Practices

  • Define SLOs before benchmarking β€” TTFT < 200ms for chat, < 500ms for batch, adjust per use case
  • Goodput sweep > throughput sweep β€” find the concurrency where goodput starts dropping
  • Timeslice analysis catches thermal throttling and memory degradation over time
  • HTTP trace metrics separate network latency from inference latency β€” critical in multi-zone clusters
  • Report goodput, not throughput β€” stakeholders care about requests that meet quality bars
  • Test multiple SLO tiers β€” same server has different capacity for chat vs batch workloads

Key Takeaways

  • Goodput is throughput filtered by SLO compliance β€” the metric that actually matters
  • Use --goodput "metric:value" to define latency and throughput constraints
  • Timeslice analysis reveals performance degradation patterns invisible in aggregate metrics
  • The optimal concurrency is where goodput β‰ˆ throughput β€” beyond that, you’re wasting GPU on SLO-violating requests
  • Different use cases (chat, batch, code gen) have different SLO profiles on the same hardware
  • HTTP trace metrics isolate network overhead from inference latency in Kubernetes clusters
#aiperf #benchmarking #goodput #slo #nvidia #inference #llm #ai #performance
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens