πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai beginner ⏱ 15 minutes K8s 1.28+

Deploy Granite 4.0 Speech on Kubernetes

Deploy IBM Granite 4.0 1B Speech model on Kubernetes for automatic speech recognition. Lightweight 2B model runs on CPU or small GPU for STT workloads.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy IBM Granite 4.0 1B Speech for automatic speech recognition. At just 2B parameters, it runs on CPU (no GPU needed) or accelerates on a small GPU like T4 or L4. Ideal for cost-effective speech-to-text pipelines.

The Problem

Speech-to-text on Kubernetes doesn’t always justify a GPU:

  • Whisper (1.5B) is great but often overkill for simple transcription
  • Cost β€” GPU instances are expensive for intermittent STT workloads
  • Latency requirements vary β€” batch processing doesn’t need GPU speeds
  • Edge deployment β€” some clusters don’t have GPU nodes at all

Granite 4.0 1B Speech from IBM (9.2K downloads) offers a lightweight alternative that runs on CPU.

The Solution

Deploy Granite Speech (CPU)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: granite-speech
  namespace: ai-inference
  labels:
    app: granite-speech
spec:
  replicas: 1
  selector:
    matchLabels:
      app: granite-speech
  template:
    metadata:
      labels:
        app: granite-speech
    spec:
      containers:
        - name: granite-stt
          image: python:3.11-slim
          command:
            - /bin/bash
            - -c
            - |
              apt-get update && apt-get install -y ffmpeg
              pip install transformers torch torchaudio fastapi uvicorn soundfile

              python3 << 'PYEOF'
              import torch
              from transformers import pipeline
              from fastapi import FastAPI, UploadFile, File
              import soundfile as sf
              import io

              app = FastAPI()

              pipe = pipeline(
                  "automatic-speech-recognition",
                  model="ibm-granite/granite-4.0-1b-speech",
                  device="cpu",  # or "cuda" if GPU available
              )

              @app.get("/health")
              def health():
                  return {"status": "ready", "model": "granite-4.0-1b-speech"}

              @app.post("/transcribe")
              async def transcribe(file: UploadFile = File(...)):
                  audio_bytes = await file.read()
                  audio, sr = sf.read(io.BytesIO(audio_bytes))
                  result = pipe({"raw": audio, "sampling_rate": sr})
                  return {"text": result["text"]}

              import uvicorn
              uvicorn.run(app, host="0.0.0.0", port=8000)
              PYEOF
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: 4Gi
              cpu: "4"
            limits:
              memory: 8Gi
              cpu: "8"
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
            failureThreshold: 12
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: granite-speech
  namespace: ai-inference
spec:
  selector:
    app: granite-speech
  ports:
    - port: 8000
      targetPort: 8000

GPU-Accelerated Version

# Add GPU for 5-10x faster inference
resources:
  limits:
    nvidia.com/gpu: "1"  # T4, L4, or A10G β€” any small GPU works
    memory: 8Gi
    cpu: "4"

STT Model Comparison

| Model                  | Params | GPU Required | Languages | Speed (CPU) |
|------------------------|--------|-------------|-----------|-------------|
| Granite 4.0 1B Speech  | 2B     | No          | Multi     | ~1x realtime|
| Whisper Large v3       | 1.5B   | Recommended | 99+       | ~0.3x       |
| faster-whisper Large   | 1.5B   | Recommended | 99+       | ~1.2x       |
| Whisper Tiny           | 39M    | No          | 99+       | ~5x         |
flowchart LR
    A[Audio Input] --> B{Deployment Type}
    B -->|CPU only| C[Granite 4.0 1B]
    B -->|GPU available| D[Whisper Large v3]
    C --> E[~1x realtime on CPU]
    D --> F[~10x realtime on GPU]
    E --> G[Cost-effective STT]
    F --> H[High-throughput STT]

Common Issues

CPU inference speed

# 2B model on CPU processes audio at roughly real-time speed
# 60s audio β‰ˆ 60s processing
# For faster: add a T4 GPU ($0.35/hr) β†’ 5-10x speedup
# Or use HPA to scale replicas during peaks

Audio format compatibility

# Ensure ffmpeg is installed for format conversion
# Supported: WAV, FLAC, MP3, OGG
# Convert before sending: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

Best Practices

  • CPU-first deployment β€” no GPU needed, dramatically reduces cost
  • HPA on CPU utilization β€” scale replicas during peak transcription load
  • ffmpeg for preprocessing β€” normalize to 16kHz mono WAV for best results
  • Pair with LLM β€” transcribe β†’ summarize/analyze with Llama or Phi-4
  • Batch processing β€” use Kubernetes Jobs for bulk audio transcription

Key Takeaways

  • IBM Granite 4.0 1B Speech: 2B parameter ASR model that runs on CPU
  • No GPU required β€” cost-effective speech-to-text for any Kubernetes cluster
  • ~1x realtime on CPU, 5-10x with a small GPU (T4, L4)
  • 9.2K downloads β€” IBM’s latest speech model
  • Pair with Fish Audio TTS for a complete CPU-friendly speech pipeline
#granite #ibm #speech-recognition #stt #asr #lightweight #cpu-inference #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens