πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai beginner ⏱ 10 minutes K8s Any

Test LLM Inference Endpoints with curl

Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: First call /v1/models to discover the exact model ID. Then use that ID in /v1/completions for text generation. Base models (Mistral-7B-v0.1) only support /v1/completions; instruct-tuned models also support /v1/chat/completions. Use -k flag if TLS certificates are self-signed.

Both vLLM and NVIDIA NIM expose an OpenAI-compatible REST API. This recipe shows how to test every endpoint systematically.

Step 1: Discover the Model ID

Always start by listing available models:

curl -k https://<inference-endpoint>/v1/models

vLLM Response

{
  "object": "list",
  "data": [{
    "id": "/data/Mistral-7B-v0.1",
    "object": "model",
    "owned_by": "vllm",
    "max_model_len": 32768
  }]
}

NIM Response

{
  "object": "list",
  "data": [{
    "id": "Mistral-7B-v0.1",
    "object": "model",
    "owned_by": "system",
    "max_model_len": 32768
  }]
}

Critical: The id field is the exact string you must use in all subsequent requests. vLLM uses the path (/data/Mistral-7B-v0.1); NIM uses the served name (Mistral-7B-v0.1).

Step 2: Text Completion

curl -k -X POST https://<inference-endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-id-from-step-1>",
    "prompt": "Write a one-line greeting:",
    "max_tokens": 32
  }'

Successful Response

{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "choices": [{
    "text": " Hello! Welcome to the world of AI.",
    "index": 0,
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 7,
    "completion_tokens": 9,
    "total_tokens": 16
  }
}

Step 3: Chat Completion (Instruct Models Only)

curl -k -X POST https://<inference-endpoint>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-id>",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is Kubernetes?"}
    ],
    "max_tokens": 64
  }'

Note: This only works with instruct-tuned models that have a chat template (e.g., Mistral-7B-Instruct-v0.2). Base models return:

{
  "error": {
    "message": "Model does not have a default chat template defined in the tokenizer.",
    "code": 500
  }
}

Step 4: Health Check

# Basic health
curl -k https://<inference-endpoint>/v1/models

# Some deployments also expose
curl -k https://<inference-endpoint>/v1/health
curl -k https://<inference-endpoint>/health

If /v1/models returns valid JSON, the backend is alive.

Common Errors and Fixes

ErrorCauseFix
"The model X does not exist" (404)Model ID mismatchCopy exact id from /v1/models
"does not have a default chat template" (500)Using /chat/completions with base modelUse /v1/completions instead
activator request timeoutBackend never initializedCheck pod logs for TRT-LLM errors
curl: (60) SSL certificate problemSelf-signed or wrong SANUse -k or fix certificate SANs
Connection refusedPod not running or service misconfiguredCheck kubectl get pods and kubectl get svc
Empty response / hangsModel still loading or GPU issueWait for startup; check logs

TLS Certificate Issues

If the inference route uses internal certificates:

# Skip TLS verification (testing only)
curl -k https://<endpoint>/v1/models

# Use custom CA bundle
curl --cacert /path/to/ca-bundle.crt https://<endpoint>/v1/models

The SSL error no alternative certificate subject name matches target host name means the route certificate SAN does not include the hostname. Fix the certificate, not the curl command.

Useful Parameters

# Control output length
"max_tokens": 32

# Adjust randomness
"temperature": 0.7

# Top-p sampling
"top_p": 0.9

# Get multiple responses
"n": 3

# Stream responses
"stream": true

Streaming Example

curl -k -X POST https://<endpoint>/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<model-id>",
    "prompt": "Explain Kubernetes in one paragraph:",
    "max_tokens": 128,
    "stream": true
  }'

Quick Validation Script

#!/bin/bash
ENDPOINT="https://<inference-endpoint>"
MODEL_ID="<model-id>"

echo "=== Health Check ==="
curl -sk "$ENDPOINT/v1/models" | python3 -m json.tool

echo ""
echo "=== Completion Test ==="
curl -sk -X POST "$ENDPOINT/v1/completions" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"$MODEL_ID\",
    \"prompt\": \"Hello, this is a test:\",
    \"max_tokens\": 16
  }" | python3 -m json.tool

echo ""
echo "=== Done ==="
#llm #inference #curl #openai-api #testing #vllm #nvidia-nim
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens