πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 25 minutes K8s 1.28+

K8s AI Gateway: Inference Extension Guide

Use the Kubernetes AI Gateway and Inference Extension to route LLM traffic. Model-aware routing, load balancing across inference backends.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: The Kubernetes AI Gateway Working Group is building an Inference Extension for Gateway API that adds model-aware routing: route requests to the right backend based on model name, balance load across replicas by GPU utilization (not just connections), and handle model-specific concerns like token rate limiting. Deploy as an extension to any Gateway API implementation (Envoy Gateway, Istio, Kong).

The Problem

66% of organizations run AI inference on Kubernetes (CNCF 2026 survey), but routing LLM traffic is different from web traffic. You need to: route by model name (e.g., /v1/chat/completions with model: llama-3-70b), balance load by GPU memory utilization (not round-robin), handle long-running streaming connections, and implement per-model rate limits. Standard HTTPRoute doesn’t understand AI inference semantics.

flowchart TB
    CLIENT["Client<br/>model: llama-3-70b"] --> GW["Gateway API<br/>+ Inference Extension"]
    
    GW -->|"Route by model name"| POOL["InferencePool"]
    
    POOL -->|"Balance by<br/>GPU utilization"| POD1["vLLM Pod 1<br/>GPU: 45%"]
    POOL -->|"Balance by<br/>GPU utilization"| POD2["vLLM Pod 2<br/>GPU: 82%"]
    POOL -->|"Balance by<br/>GPU utilization"| POD3["vLLM Pod 3<br/>GPU: 23% ←"]

The Solution

InferencePool: Model Backend Group

# InferencePool groups pods serving the same model
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferencePool
metadata:
  name: llama-3-70b
  namespace: ai-inference
spec:
  # Select pods serving this model
  targetPortNumber: 8000
  selector:
    matchLabels:
      model: llama-3-70b
  # Endpoint picker: balance by least-loaded GPU
  endpointPickerConfig:
    extensionRef:
      name: gpu-aware-picker

InferenceModel: Route Configuration

# InferenceModel maps model names to InferencePool
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: llama-3-70b
  namespace: ai-inference
spec:
  modelName: llama-3-70b             # Match from request body
  criticality: Critical               # Priority level
  poolRef:
    name: llama-3-70b
---
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: InferenceModel
metadata:
  name: mistral-7b
  namespace: ai-inference
spec:
  modelName: mistral-7b-instruct
  criticality: Standard
  poolRef:
    name: mistral-7b

Gateway API HTTPRoute for AI

# Standard Gateway API HTTPRoute pointing to InferencePool
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: ai-inference-route
  namespace: ai-inference
spec:
  parentRefs:
    - name: ai-gateway
      namespace: gateway-system
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1
      backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: llama-3-70b
          port: 8000

Gateway with Inference Extension

# Gateway with inference-aware controller
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: ai-gateway
  namespace: gateway-system
spec:
  gatewayClassName: envoy-gateway      # Or istio, kong
  listeners:
    - name: http
      port: 80
      protocol: HTTP
    - name: https
      port: 443
      protocol: HTTPS
      tls:
        mode: Terminate
        certificateRefs:
          - name: ai-gateway-cert

vLLM Backend Deployment

# Backend pods with model labels for InferencePool selection
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: ai-inference
spec:
  replicas: 3
  selector:
    matchLabels:
      model: llama-3-70b
  template:
    metadata:
      labels:
        model: llama-3-70b
        serving-framework: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.3
          args:
            - "--model=meta-llama/Meta-Llama-3-70B-Instruct"
            - "--tensor-parallel-size=4"
            - "--max-model-len=8192"
            - "--gpu-memory-utilization=0.90"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 4
          # Expose metrics for GPU-aware load balancing
          env:
            - name: VLLM_USAGE_STATS
              value: "true"

Multi-Model Routing

# Route different models to different pools
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: multi-model-route
spec:
  parentRefs:
    - name: ai-gateway
  rules:
    # Small models: fast, cheap
    - matches:
        - headers:
            - name: x-model-class
              value: small
      backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: mistral-7b
    
    # Large models: powerful, GPU-intensive
    - matches:
        - headers:
            - name: x-model-class
              value: large
      backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: llama-3-70b
    
    # Default: route based on model field in request body
    - backendRefs:
        - group: inference.networking.x-k8s.io
          kind: InferencePool
          name: llama-3-70b

Token-Based Rate Limiting

# Rate limit by tokens instead of requests
# (one LLM request can use 1 or 10,000 tokens)
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: ai-rate-limit
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: ai-inference-route
  rateLimit:
    type: Global
    global:
      rules:
        - clientSelectors:
            - headers:
                - name: Authorization
                  type: Distinct
          limit:
            requests: 100            # Requests per window
            unit: Minute

Common Issues

IssueCauseFix
Requests not routing to right modelInferenceModel modelName mismatchMatch exact model string from API request
Uneven GPU loadRound-robin ignoring GPU utilizationUse GPU-aware endpoint picker
Streaming disconnectsGateway timeout too lowIncrease timeout for SSE/streaming connections
503 during model loadingPods not readyAdd readiness probe checking /health
Token rate limit inaccurateCounting requests not tokensUse token-counting middleware or KEDA scaler

Best Practices

  • Route by model name β€” InferenceModel maps API model field to backend pools
  • Balance by GPU utilization β€” not round-robin; GPU inference is asymmetric
  • Set long timeouts for streaming β€” LLM responses stream for 10-60+ seconds
  • Separate pools by model size β€” 7B and 70B have very different resource profiles
  • Monitor tokens/second per pool β€” this is the real throughput metric for LLMs
  • Use criticality levels β€” shed low-priority traffic before impacting critical models

Key Takeaways

  • Kubernetes AI Gateway Working Group is building inference-aware routing
  • InferencePool groups backend pods by model; InferenceModel maps model names
  • GPU-aware load balancing routes to least-loaded replica (not round-robin)
  • Standard Gateway API HTTPRoute integrates with InferencePool backends
  • 66% of orgs run AI inference on K8s β€” this solves their routing problem
  • Works with any Gateway API implementation: Envoy Gateway, Istio, Kong
#ai-gateway #gateway-api #inference #llm-routing #load-balancing
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens