Test LLM Inference Endpoints with curl
Validate Kubernetes-hosted LLM inference services using curl against OpenAI-compatible /v1/models, /v1/completions, and /v1/chat/completions endpoints.
π‘ Quick Answer: First call
/v1/modelsto discover the exact model ID. Then use that ID in/v1/completionsfor text generation. Base models (Mistral-7B-v0.1) only support/v1/completions; instruct-tuned models also support/v1/chat/completions. Use-kflag if TLS certificates are self-signed.
Both vLLM and NVIDIA NIM expose an OpenAI-compatible REST API. This recipe shows how to test every endpoint systematically.
Step 1: Discover the Model ID
Always start by listing available models:
curl -k https://<inference-endpoint>/v1/modelsvLLM Response
{
"object": "list",
"data": [{
"id": "/data/Mistral-7B-v0.1",
"object": "model",
"owned_by": "vllm",
"max_model_len": 32768
}]
}NIM Response
{
"object": "list",
"data": [{
"id": "Mistral-7B-v0.1",
"object": "model",
"owned_by": "system",
"max_model_len": 32768
}]
}Critical: The id field is the exact string you must use in all subsequent requests. vLLM uses the path (/data/Mistral-7B-v0.1); NIM uses the served name (Mistral-7B-v0.1).
Step 2: Text Completion
curl -k -X POST https://<inference-endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-id-from-step-1>",
"prompt": "Write a one-line greeting:",
"max_tokens": 32
}'Successful Response
{
"id": "cmpl-abc123",
"object": "text_completion",
"choices": [{
"text": " Hello! Welcome to the world of AI.",
"index": 0,
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 7,
"completion_tokens": 9,
"total_tokens": 16
}
}Step 3: Chat Completion (Instruct Models Only)
curl -k -X POST https://<inference-endpoint>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-id>",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Kubernetes?"}
],
"max_tokens": 64
}'Note: This only works with instruct-tuned models that have a chat template (e.g., Mistral-7B-Instruct-v0.2). Base models return:
{
"error": {
"message": "Model does not have a default chat template defined in the tokenizer.",
"code": 500
}
}Step 4: Health Check
# Basic health
curl -k https://<inference-endpoint>/v1/models
# Some deployments also expose
curl -k https://<inference-endpoint>/v1/health
curl -k https://<inference-endpoint>/healthIf /v1/models returns valid JSON, the backend is alive.
Common Errors and Fixes
| Error | Cause | Fix |
|---|---|---|
"The model X does not exist" (404) | Model ID mismatch | Copy exact id from /v1/models |
"does not have a default chat template" (500) | Using /chat/completions with base model | Use /v1/completions instead |
activator request timeout | Backend never initialized | Check pod logs for TRT-LLM errors |
curl: (60) SSL certificate problem | Self-signed or wrong SAN | Use -k or fix certificate SANs |
| Connection refused | Pod not running or service misconfigured | Check kubectl get pods and kubectl get svc |
| Empty response / hangs | Model still loading or GPU issue | Wait for startup; check logs |
TLS Certificate Issues
If the inference route uses internal certificates:
# Skip TLS verification (testing only)
curl -k https://<endpoint>/v1/models
# Use custom CA bundle
curl --cacert /path/to/ca-bundle.crt https://<endpoint>/v1/modelsThe SSL error no alternative certificate subject name matches target host name means the route certificate SAN does not include the hostname. Fix the certificate, not the curl command.
Useful Parameters
# Control output length
"max_tokens": 32
# Adjust randomness
"temperature": 0.7
# Top-p sampling
"top_p": 0.9
# Get multiple responses
"n": 3
# Stream responses
"stream": trueStreaming Example
curl -k -X POST https://<endpoint>/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "<model-id>",
"prompt": "Explain Kubernetes in one paragraph:",
"max_tokens": 128,
"stream": true
}'Quick Validation Script
#!/bin/bash
ENDPOINT="https://<inference-endpoint>"
MODEL_ID="<model-id>"
echo "=== Health Check ==="
curl -sk "$ENDPOINT/v1/models" | python3 -m json.tool
echo ""
echo "=== Completion Test ==="
curl -sk -X POST "$ENDPOINT/v1/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL_ID\",
\"prompt\": \"Hello, this is a test:\",
\"max_tokens\": 16
}" | python3 -m json.tool
echo ""
echo "=== Done ==="Related Recipes

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
