Deploy Fish Audio TTS on Kubernetes
Deploy Fish Audio S2-Pro 5B text-to-speech model on Kubernetes for high-quality voice synthesis with multi-speaker support and streaming audio.
π‘ Quick Answer: Deploy Fish Audio S2-Pro (5B parameters) for high-quality text-to-speech on Kubernetes. Supports multi-speaker voice cloning, streaming audio output, and multiple languages. Pair with Whisper for a complete speech pipeline (STT β LLM β TTS).
The Problem
Production text-to-speech on Kubernetes needs:
- Natural-sounding voices β robotic TTS is unacceptable for modern applications
- Voice cloning β generate speech in a specific personβs voice from a short sample
- Streaming β start playing audio before the full text is synthesized
- Multi-language β support for English, Chinese, Japanese, and other languages
- Low latency β real-time enough for conversational AI
Fish Audio S2-Pro is a 5B parameter TTS model with 1.8K+ downloads, designed for high-quality voice synthesis.
The Solution
Step 1: Deploy Fish Audio S2-Pro
apiVersion: apps/v1
kind: Deployment
metadata:
name: fish-tts
namespace: ai-inference
labels:
app: fish-tts
spec:
replicas: 1
selector:
matchLabels:
app: fish-tts
template:
metadata:
labels:
app: fish-tts
spec:
containers:
- name: fish-tts
image: python:3.11-slim
command:
- /bin/bash
- -c
- |
apt-get update && apt-get install -y ffmpeg
pip install fish-speech fastapi uvicorn
python3 << 'PYEOF'
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import io
app = FastAPI()
# Initialize Fish Audio model
from fish_speech.models import load_model
model = load_model("fishaudio/s2-pro", device="cuda")
@app.get("/health")
def health():
return {"status": "ready", "model": "s2-pro"}
@app.post("/synthesize")
async def synthesize(request: dict):
text = request.get("text", "Hello world")
speaker = request.get("speaker", "default")
format = request.get("format", "wav")
audio = model.synthesize(
text=text,
speaker=speaker,
)
buf = io.BytesIO()
audio.save(buf, format=format)
buf.seek(0)
return StreamingResponse(
buf,
media_type=f"audio/{format}",
headers={"Content-Disposition": f"attachment; filename=speech.{format}"}
)
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
PYEOF
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: "1"
memory: 24Gi
cpu: "8"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 15
failureThreshold: 16
readinessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 10
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: fish-tts-cache
---
apiVersion: v1
kind: Service
metadata:
name: fish-tts
namespace: ai-inference
spec:
selector:
app: fish-tts
ports:
- port: 8000
targetPort: 8000Step 2: Complete Speech Pipeline (STT β LLM β TTS)
apiVersion: v1
kind: ConfigMap
metadata:
name: speech-pipeline
namespace: ai-inference
data:
pipeline.py: |
"""
Full voice AI pipeline:
1. Whisper STT: Audio β Text
2. LLM: Text β Response
3. Fish TTS: Response β Audio
"""
import requests
def speech_pipeline(audio_file_path):
# Step 1: Speech-to-Text (Whisper)
with open(audio_file_path, "rb") as f:
stt_response = requests.post(
"http://whisper-server:8000/transcribe",
files={"file": f}
)
user_text = stt_response.json()["text"]
# Step 2: LLM Response
llm_response = requests.post(
"http://llama31-8b:8000/v1/chat/completions",
json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": user_text}],
"max_tokens": 512,
}
)
reply_text = llm_response.json()["choices"][0]["message"]["content"]
# Step 3: Text-to-Speech (Fish Audio)
tts_response = requests.post(
"http://fish-tts:8000/synthesize",
json={"text": reply_text, "format": "wav"}
)
with open("response.wav", "wb") as f:
f.write(tts_response.content)
return reply_textflowchart LR
A[User Audio] --> B[Whisper STT]
B --> C[Transcribed Text]
C --> D[LLM - Llama 3.1]
D --> E[Response Text]
E --> F[Fish Audio TTS]
F --> G[Response Audio]
subgraph Speech Pipeline
B
D
F
endCommon Issues
Audio quality varies
# Use higher sample rate for better quality
# Ensure ffmpeg is installed for format conversion
apt-get install -y ffmpegVoice cloning requires reference audio
# Provide a 10-30 second reference audio sample
# The model extracts voice characteristics from the sample
curl -X POST http://fish-tts:8000/synthesize \
-F "text=Hello, this is a test" \
-F "reference_audio=@speaker_sample.wav"Best Practices
- Single GPU β S2-Pro fits on A100 or L40S
- Stream responses β start audio playback while generating
- Pair with Whisper β complete STT β LLM β TTS pipeline
- Cache voice profiles β pre-extract speaker embeddings for faster cloning
- ffmpeg for format conversion β serve WAV, MP3, or OGG as needed
Key Takeaways
- Fish Audio S2-Pro is a 5B parameter TTS model with high-quality voice synthesis
- Supports voice cloning from short reference audio samples
- Deploy on single GPU β A100 or L40S sufficient
- Combine with Whisper (STT) + LLM for a complete voice AI pipeline
- Streaming audio output for low-latency conversational applications

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
