πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 25 minutes K8s 1.28+

RetinaNet Object Detection on K8s

Deploy RetinaNet object detection model on Kubernetes with Triton Inference Server, TensorRT optimization, and batch processing pipelines.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Export RetinaNet to ONNX, optimize with TensorRT, and serve via Triton Inference Server on Kubernetes. Use batch processing for throughput (up to 100+ images/sec on A100) or real-time inference with <20ms latency.

The Problem

Deploying computer vision models like RetinaNet for production object detection requires:

  • Low latency β€” real-time detection for video streams needs <30ms per frame
  • High throughput β€” batch processing thousands of images for offline analysis
  • Model optimization β€” raw PyTorch models are 3-5x slower than TensorRT-optimized versions
  • Scaling β€” handling variable load from multiple camera feeds or batch upload spikes

The Solution

Step 1: Export RetinaNet to ONNX

apiVersion: batch/v1
kind: Job
metadata:
  name: retinanet-export
  namespace: ai-inference
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: export
          image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
          command:
            - /bin/bash
            - -c
            - |
              pip install torchvision onnx onnxruntime

              python3 << 'EOF'
              import torch
              import torchvision

              # Load pretrained RetinaNet
              model = torchvision.models.detection.retinanet_resnet50_fpn_v2(
                  weights=torchvision.models.detection.RetinaNet_ResNet50_FPN_V2_Weights.DEFAULT
              )
              model.eval()

              # Export to ONNX
              dummy_input = torch.randn(1, 3, 800, 800)
              torch.onnx.export(
                  model, dummy_input,
                  "/models/retinanet/1/model.onnx",
                  opset_version=17,
                  input_names=["images"],
                  output_names=["boxes", "scores", "labels"],
                  dynamic_axes={
                      "images": {0: "batch_size"},
                      "boxes": {0: "batch_size"},
                      "scores": {0: "batch_size"},
                      "labels": {0: "batch_size"},
                  }
              )
              print("ONNX export complete")
              EOF
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: 16Gi
          volumeMounts:
            - name: models
              mountPath: /models
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-repository

Step 2: Create Triton Model Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: retinanet-config
  namespace: ai-inference
data:
  config.pbtxt: |
    name: "retinanet"
    platform: "onnxruntime_onnx"
    max_batch_size: 16

    input [
      {
        name: "images"
        data_type: TYPE_FP32
        dims: [ 3, 800, 800 ]
      }
    ]

    output [
      {
        name: "boxes"
        data_type: TYPE_FP32
        dims: [ -1, 4 ]
      },
      {
        name: "scores"
        data_type: TYPE_FP32
        dims: [ -1 ]
      },
      {
        name: "labels"
        data_type: TYPE_INT64
        dims: [ -1 ]
      }
    ]

    dynamic_batching {
      preferred_batch_size: [ 4, 8, 16 ]
      max_queue_delay_microseconds: 100000
    }

    instance_group [
      {
        count: 2
        kind: KIND_GPU
      }
    ]

Step 3: Deploy Triton with RetinaNet

apiVersion: apps/v1
kind: Deployment
metadata:
  name: retinanet-triton
  namespace: ai-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: retinanet-triton
  template:
    metadata:
      labels:
        app: retinanet-triton
    spec:
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:25.01-py3
          args:
            - "tritonserver"
            - "--model-repository=/models"
            - "--log-verbose=0"
          ports:
            - containerPort: 8000
              name: http
            - containerPort: 8001
              name: grpc
            - containerPort: 8002
              name: metrics
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: 16Gi
              cpu: "8"
          volumeMounts:
            - name: models
              mountPath: /models
          livenessProbe:
            httpGet:
              path: /v2/health/live
              port: 8000
            periodSeconds: 30
          readinessProbe:
            httpGet:
              path: /v2/health/ready
              port: 8000
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: model-repository
---
apiVersion: v1
kind: Service
metadata:
  name: retinanet-triton
  namespace: ai-inference
spec:
  selector:
    app: retinanet-triton
  ports:
    - port: 8000
      targetPort: 8000
      name: http
    - port: 8001
      targetPort: 8001
      name: grpc

Step 4: Client Inference

# Test with curl
kubectl run test-retinanet --rm -it --image=python:3.11-slim -- bash -c '
pip install tritonclient[http] pillow numpy

python3 << "EOF"
import numpy as np
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient("retinanet-triton:8000")

# Create test image (800x800 RGB)
image = np.random.rand(1, 3, 800, 800).astype(np.float32)

inputs = [httpclient.InferInput("images", image.shape, "FP32")]
inputs[0].set_data_from_numpy(image)

outputs = [
    httpclient.InferRequestedOutput("boxes"),
    httpclient.InferRequestedOutput("scores"),
    httpclient.InferRequestedOutput("labels"),
]

result = client.infer("retinanet", inputs, outputs=outputs)
boxes = result.as_numpy("boxes")
scores = result.as_numpy("scores")
labels = result.as_numpy("labels")

# Filter high-confidence detections
mask = scores > 0.5
print(f"Detected {mask.sum()} objects")
print(f"Labels: {labels[mask]}")
print(f"Scores: {scores[mask]}")
EOF
'
flowchart TD
    A[Image Input] --> B[Preprocessing]
    B --> C[Triton Server]
    C --> D[Dynamic Batching]
    D --> E[RetinaNet ONNX/TensorRT]
    E --> F[GPU Inference]
    F --> G[Boxes + Scores + Labels]
    G --> H[Post-processing]
    H --> I[Filtered Detections]

Common Issues

Dynamic batching not triggering

# Increase max queue delay for better batching
max_queue_delay_microseconds: 200000  # 200ms
# Or reduce preferred batch sizes for lower latency
preferred_batch_size: [ 2, 4 ]

ONNX export fails with detection models

# RetinaNet has custom ops β€” use opset 17+
# If export fails, try tracing instead of scripting:
torch.onnx.export(model, dummy_input, "model.onnx",
    opset_version=17, do_constant_folding=True)

TensorRT optimization for maximum throughput

# Convert ONNX to TensorRT engine
trtexec --onnx=model.onnx \
  --saveEngine=model.plan \
  --fp16 \
  --optShapes=images:8x3x800x800 \
  --maxShapes=images:16x3x800x800 \
  --minShapes=images:1x3x800x800

Best Practices

  • TensorRT optimization β€” 2-3x speedup over ONNX Runtime on NVIDIA GPUs
  • Dynamic batching β€” accumulate requests for better GPU utilization
  • FP16 inference β€” halves memory with negligible accuracy loss for detection
  • Multiple model instances β€” run 2 instances per GPU for pipeline overlap
  • Preprocess on CPU β€” resize and normalize images before sending to GPU

Key Takeaways

  • Export RetinaNet to ONNX then optionally convert to TensorRT for maximum performance
  • Triton’s dynamic batching accumulates requests for better GPU utilization
  • Single A100 can process 100+ images/sec with TensorRT FP16 optimization
  • Use 2 model instances per GPU to overlap data transfer and computation
  • COCO-pretrained RetinaNet detects 80 object classes out of the box
#retinanet #object-detection #computer-vision #triton #tensorrt #nvidia #inference #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens