Deploy Granite 4.0 Speech on Kubernetes
Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.
π‘ Quick Answer: Deploy IBM Granite 4.0 1B Speech for automatic speech recognition. At just 2B parameters, it runs on CPU (no GPU needed) or accelerates on a small GPU like T4 or L4. Ideal for cost-effective speech-to-text pipelines.
The Problem
Speech-to-text on Kubernetes doesnβt always justify a GPU:
- Whisper (1.5B) is great but often overkill for simple transcription
- Cost β GPU instances are expensive for intermittent STT workloads
- Latency requirements vary β batch processing doesnβt need GPU speeds
- Edge deployment β some clusters donβt have GPU nodes at all
Granite 4.0 1B Speech from IBM (9.2K downloads) offers a lightweight alternative that runs on CPU.
The Solution
Deploy Granite Speech (CPU)
apiVersion: apps/v1
kind: Deployment
metadata:
name: granite-speech
namespace: ai-inference
labels:
app: granite-speech
spec:
replicas: 1
selector:
matchLabels:
app: granite-speech
template:
metadata:
labels:
app: granite-speech
spec:
containers:
- name: granite-stt
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
apt-get update && apt-get install -y ffmpeg
pip install transformers torch torchaudio fastapi uvicorn soundfile
python3 << 'PYEOF'
import torch
from transformers import pipeline
from fastapi import FastAPI, UploadFile, File
import soundfile as sf
import io
app = FastAPI()
pipe = pipeline(
"automatic-speech-recognition",
model="ibm-granite/granite-4.0-1b-speech",
device="cpu", # or "cuda" if GPU available
)
@app.get("/health")
def health():
return {"status": "ready", "model": "granite-4.0-1b-speech"}
@app.post("/transcribe")
async def transcribe(file: UploadFile = File(...)):
audio_bytes = await file.read()
audio, sr = sf.read(io.BytesIO(audio_bytes))
result = pipe({"raw": audio, "sampling_rate": sr})
return {"text": result["text"]}
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
PYEOF
ports:
- containerPort: 8000
resources:
requests:
memory: 4Gi
cpu: "4"
limits:
memory: 8Gi
cpu: "8"
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
failureThreshold: 12
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: granite-speech
namespace: ai-inference
spec:
selector:
app: granite-speech
ports:
- port: 8000
targetPort: 8000GPU-Accelerated Version
# Add GPU for 5-10x faster inference
resources:
limits:
nvidia.com/gpu: "1" # T4, L4, or A10G β any small GPU works
memory: 8Gi
cpu: "4"STT Model Comparison
| Model | Params | GPU Required | Languages | Speed (CPU) |
|------------------------|--------|-------------|-----------|-------------|
| Granite 4.0 1B Speech | 2B | No | Multi | ~1x realtime|
| Whisper Large v3 | 1.5B | Recommended | 99+ | ~0.3x |
| faster-whisper Large | 1.5B | Recommended | 99+ | ~1.2x |
| Whisper Tiny | 39M | No | 99+ | ~5x |flowchart LR
A[Audio Input] --> B{Deployment Type}
B -->|CPU only| C[Granite 4.0 1B]
B -->|GPU available| D[Whisper Large v3]
C --> E[~1x realtime on CPU]
D --> F[~10x realtime on GPU]
E --> G[Cost-effective STT]
F --> H[High-throughput STT]Common Issues
CPU inference speed
# 2B model on CPU processes audio at roughly real-time speed
# 60s audio β 60s processing
# For faster: add a T4 GPU ($0.35/hr) β 5-10x speedup
# Or use HPA to scale replicas during peaksAudio format compatibility
# Ensure ffmpeg is installed for format conversion
# Supported: WAV, FLAC, MP3, OGG
# Convert before sending: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wavBest Practices
- CPU-first deployment β no GPU needed, dramatically reduces cost
- HPA on CPU utilization β scale replicas during peak transcription load
- ffmpeg for preprocessing β normalize to 16kHz mono WAV for best results
- Pair with LLM β transcribe β summarize/analyze with Llama or Phi-4
- Batch processing β use Kubernetes Jobs for bulk audio transcription
Key Takeaways
- IBM Granite 4.0 1B Speech: 2B parameter ASR model that runs on CPU
- No GPU required β cost-effective speech-to-text for any Kubernetes cluster
- ~1x realtime on CPU, 5-10x with a small GPU (T4, L4)
- 9.2K downloads β IBMβs latest speech model
- Pair with Fish Audio TTS for a complete CPU-friendly speech pipeline

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
