πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai intermediate ⏱ 15 minutes K8s 1.28+

Deploy Qwen3 TTS on Kubernetes

Deploy Qwen3-TTS-12Hz-1.7B-CustomVoice on Kubernetes for text-to-speech with custom voice cloning. 1.13M downloads, lightweight single-GPU deployment.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy Qwen3-TTS (1.7B parameters) for text-to-speech with custom voice cloning. At just 1.7B params, it runs on a small GPU (T4, L4) or even CPU. 1.13M downloads and 1.28K likes β€” one of the most popular open TTS models. Supports custom voice creation from short audio samples.

The Problem

Text-to-speech on Kubernetes needs to be:

  • Lightweight β€” not every TTS task justifies an A100
  • Custom voices β€” enterprise apps need branded voices, not generic ones
  • Multilingual β€” Chinese, English, and other languages from one model
  • Low latency β€” conversational AI needs fast synthesis

Qwen3-TTS at 1.7B parameters with 1.13M downloads is the sweet spot β€” high quality, low resource requirements, with voice cloning.

The Solution

Deploy Qwen3-TTS

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3-tts
  namespace: ai-inference
  labels:
    app: qwen3-tts
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3-tts
  template:
    metadata:
      labels:
        app: qwen3-tts
    spec:
      containers:
        - name: tts
          image: python:3.11-slim
          command:
            - /bin/bash
            - -c
            - |
              apt-get update && apt-get install -y ffmpeg
              pip install transformers torch torchaudio \
                fastapi uvicorn soundfile

              python3 << 'PYEOF'
              import torch
              from transformers import AutoModelForCausalLM, AutoTokenizer
              from fastapi import FastAPI
              from fastapi.responses import StreamingResponse
              import io, soundfile as sf

              app = FastAPI()

              model_name = "Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"
              tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
              model = AutoModelForCausalLM.from_pretrained(
                  model_name,
                  torch_dtype=torch.float16,
                  device_map="auto",
                  trust_remote_code=True,
              )

              @app.get("/health")
              def health():
                  return {"status": "ready", "model": "qwen3-tts"}

              @app.post("/synthesize")
              async def synthesize(request: dict):
                  text = request.get("text", "Hello world")
                  speaker = request.get("speaker", None)

                  # Generate audio tokens
                  inputs = tokenizer(text, return_tensors="pt").to(model.device)
                  with torch.no_grad():
                      outputs = model.generate(
                          **inputs,
                          max_new_tokens=2048,
                      )

                  # Decode audio
                  audio = tokenizer.decode_audio(outputs[0])

                  buf = io.BytesIO()
                  sf.write(buf, audio, 24000, format="WAV")
                  buf.seek(0)

                  return StreamingResponse(buf, media_type="audio/wav")

              import uvicorn
              uvicorn.run(app, host="0.0.0.0", port=8000)
              PYEOF
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: "1"
              memory: 8Gi
              cpu: "4"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
            failureThreshold: 15
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 10
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: qwen3-tts-cache
---
apiVersion: v1
kind: Service
metadata:
  name: qwen3-tts
  namespace: ai-inference
spec:
  selector:
    app: qwen3-tts
  ports:
    - port: 8000
      targetPort: 8000

TTS Model Comparison

| Model                | Params | GPU Required | Custom Voice | Downloads |
|----------------------|--------|-------------|-------------|-----------|
| Qwen3-TTS 1.7B      | 1.7B   | T4/L4/CPU   | Yes          | 1.13M     |
| Fish Audio S2-Pro    | 5B     | A100/L40S   | Yes          | 1.8K      |
| HumeAI TADA 1B      | 2B     | T4/L4       | Emotional    | 5.6K      |
| Bark                 | ~1B    | T4/L4/CPU   | Limited      | 800K+     |

Complete Voice AI Pipeline

flowchart LR
    A[User Speech] --> B[Granite Speech STT]
    B --> C[Text]
    C --> D[LLM - Llama or Phi-4]
    D --> E[Response Text]
    E --> F[Qwen3-TTS]
    F --> G[Response Audio]
    subgraph CPU or Small GPU
        B
        F
    end
    subgraph GPU Required
        D
    end

Common Issues

CPU-only deployment

# 1.7B model can run on CPU β€” slower but works
resources:
  requests:
    memory: 8Gi
    cpu: "8"
  limits:
    memory: 16Gi
    cpu: "16"
# Remove nvidia.com/gpu from limits
# Change device_map to "cpu" in code

Custom voice cloning

# Provide 10-30 seconds of reference audio
# The CustomVoice variant specifically supports voice cloning
curl -X POST http://qwen3-tts:8000/synthesize \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Kubernetes pods are the smallest deployable units.",
    "reference_audio_url": "https://example.com/voice-sample.wav"
  }'

Best Practices

  • T4 or L4 GPU β€” 1.7B model only needs ~4GB VRAM
  • CPU fallback β€” works without GPU, just slower
  • Custom voice β€” use the CustomVoice variant for voice cloning
  • 24kHz output β€” standard high-quality audio
  • Pair with lightweight STT β€” Granite Speech + Qwen3-TTS = full pipeline on small GPUs

Key Takeaways

  • Qwen3-TTS: 1.7B parameter TTS model with 1.13M downloads
  • Runs on T4, L4, or even CPU β€” no expensive GPU needed
  • Custom voice cloning from short audio samples
  • 12Hz audio token rate β€” efficient generation
  • Pair with Granite Speech (STT) for a complete voice pipeline on minimal hardware
#qwen3 #text-to-speech #tts #voice-cloning #custom-voice #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens