πŸ“šBook Signing at KubeCon EU 2026Meet us at Booking.com HQ (Mon 18:30-21:00) & vCluster booth #521 (Tue 24 Mar, 12:30-1:30pm) β€” free book giveaway!RSVP Booking.com Event
ai advanced ⏱ 25 minutes K8s 1.28+

Deploy Qwen3 Coder 80B on Kubernetes

Deploy Qwen3-Coder-Next 80B on Kubernetes for code generation, review, and refactoring. Production-ready AI coding assistant with multi-GPU serving.

By Luca Berton β€’ β€’ πŸ“– 5 min read

πŸ’‘ Quick Answer: Deploy Qwen3-Coder-Next (80B) with vLLM using --tensor-parallel-size 2 on 2x A100 80GB. Purpose-built for code generation, review, and refactoring. 1.16M downloads β€” one of the most popular open coding models. Supports 200+ programming languages.

The Problem

Self-hosted AI coding assistants need:

  • Code quality β€” generate production-ready code, not just snippets
  • Long context β€” understand entire codebases, not just single files
  • Privacy β€” proprietary code stays on your infrastructure
  • Integration β€” OpenAI-compatible API for IDE plugins and CI/CD pipelines
  • Cost β€” avoid per-token API pricing for high-volume teams

Qwen3-Coder-Next at 80B parameters (1.16M downloads, 1.12K likes) is a leading open coding model.

The Solution

Deploy Qwen3-Coder-Next 80B

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen3-coder
  namespace: ai-inference
  labels:
    app: qwen3-coder
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen3-coder
  template:
    metadata:
      labels:
        app: qwen3-coder
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "Qwen/Qwen3-Coder-Next"
            - "--tensor-parallel-size"
            - "2"
            - "--max-model-len"
            - "65536"
            - "--gpu-memory-utilization"
            - "0.92"
            - "--max-num-seqs"
            - "16"
            - "--enable-chunked-prefill"
            - "--trust-remote-code"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-token
                  key: token
          resources:
            limits:
              nvidia.com/gpu: "2"
              memory: 96Gi
              cpu: "16"
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: shm
              mountPath: /dev/shm
          startupProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 300
            periodSeconds: 30
            failureThreshold: 20
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            periodSeconds: 15
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: qwen3-coder-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 16Gi
---
apiVersion: v1
kind: Service
metadata:
  name: qwen3-coder
  namespace: ai-inference
spec:
  selector:
    app: qwen3-coder
  ports:
    - port: 8000
      targetPort: 8000

IDE Integration (Continue.dev / Copilot Alternative)

{
  "models": [
    {
      "title": "Qwen3 Coder",
      "provider": "openai",
      "model": "Qwen/Qwen3-Coder-Next",
      "apiBase": "http://qwen3-coder.ai-inference.svc:8000/v1",
      "apiKey": "not-needed"
    }
  ]
}

Code Review in CI/CD

# GitHub Actions: AI-powered code review
apiVersion: batch/v1
kind: Job
metadata:
  name: ai-code-review
  namespace: ai-inference
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: reviewer
          image: curlimages/curl
          command:
            - /bin/sh
            - -c
            - |
              DIFF=$(cat /workspace/pr-diff.txt)
              curl -s http://qwen3-coder:8000/v1/chat/completions \
                -H "Content-Type: application/json" \
                -d "{
                  \"model\": \"Qwen/Qwen3-Coder-Next\",
                  \"messages\": [
                    {\"role\": \"system\", \"content\": \"You are a senior code reviewer. Review this diff for bugs, security issues, and best practice violations. Be concise.\"},
                    {\"role\": \"user\", \"content\": $(echo "$DIFF" | jq -Rs .)}
                  ],
                  \"max_tokens\": 2048,
                  \"temperature\": 0.1
                }" | jq -r '.choices[0].message.content'
flowchart TD
    A[Developer] --> B{Code Task}
    B -->|Write code| C[IDE Plugin]
    B -->|Review PR| D[CI/CD Pipeline]
    B -->|Refactor| E[IDE Plugin]
    C --> F[Qwen3-Coder 80B]
    D --> F
    E --> F
    F --> G[OpenAI-compatible API]
    G --> H[Generated/Reviewed Code]
    subgraph Kubernetes Cluster
        F
    end

Common Issues

Long file context

# 65K context supports large files but uses significant KV cache
--max-model-len 65536  # for whole-file understanding
--max-num-seqs 8       # reduce concurrency for long contexts

# For shorter completions (autocomplete), reduce context:
--max-model-len 16384 --max-num-seqs 32

Streaming for IDE autocomplete

# Use streaming for responsive autocomplete
curl -N http://qwen3-coder:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-Coder-Next",
    "prompt": "def fibonacci(n):\n    ",
    "max_tokens": 256,
    "stream": true,
    "temperature": 0.1
  }'

Best Practices

  • 2x A100 80GB for 80B model at FP16, or FP8 on H100 for single GPU
  • 65K context β€” enough for entire files, reduce for higher concurrency
  • Low temperature (0.1) β€” code generation needs determinism
  • IDE integration via Continue.dev, Cody, or custom plugins
  • Streaming β€” essential for autocomplete responsiveness
  • CI/CD integration β€” automate code reviews on PRs

Key Takeaways

  • Qwen3-Coder-Next: 80B parameter coding model β€” 1.16M downloads
  • Deploys on 2x A100 80GB with 65K context for whole-file understanding
  • OpenAI-compatible API β€” works with Continue.dev, custom IDE plugins, CI/CD
  • Use for code generation, review, refactoring, and documentation
  • Self-hosted Copilot alternative β€” proprietary code stays on your infrastructure
#qwen3 #code-generation #coding-assistant #llm #vllm #multi-gpu #nvidia #inference #ai
Luca Berton
Written by Luca Berton

Principal Solutions Architect specializing in Kubernetes, AI/GPU infrastructure, and cloud-native platforms. Author of Kubernetes Recipes and creator of CopyPasteLearn courses.

Kubernetes Recipes book cover

Want More Kubernetes Recipes?

This recipe is from Kubernetes Recipes, our 750-page practical guide with hundreds of production-ready patterns.

Luca Berton Ansible Pilot Ansible by Example Open Empower K8s Recipes Terraform Pilot CopyPasteLearn ProteinLens