Integrate DisaggregatedSet with llm-d on Kubernetes
Deploy disaggregated LLM inference using DisaggregatedSet and llm-d on Kubernetes. Install LWS then DS controller, model prefill/decode roles, wire llm-d
π‘ Quick Answer: DisaggregatedSet is the workload-orchestration layer underneath llm-dβs serving/routing stack. Install LWS first, then the DisaggregatedSet controller from
disaggregatedset/config/default. Replace manually managed prefill/decode LeaderWorkerSets with a single DisaggregatedSet CR, then point llm-d routing to the auto-created per-role Services using label selectors (disaggregatedset.x-k8s.io/role: prefill|decode).
The Problem
- Deploying disaggregated inference requires manually creating and coordinating separate LeaderWorkerSets for prefill and decode
- Rolling updates across roles are uncoordinated β one role can update while the other runs an incompatible version
- Service lifecycle is manual β you must ensure routing only targets ready pods of matching revisions
- Configuration drift between separately managed roles causes subtle failures
- llm-d needs stable service discovery that survives revision changes during rollouts
The Solution
Architecture: Where DisaggregatedSet Fits
Gateway / KServe / llm-d
β
Inference routing / endpoint picking (prefix-cache-aware)
β
Role selectors:
prefill β DisaggregatedSet-managed LWS pods
decode β DisaggregatedSet-managed LWS pods
β
DisaggregatedSet (single CRD)
β
LeaderWorkerSet per role/revision
β
vLLM pods on GPU nodesllm-d / KServe / Gateway API Inference Extension handles model serving APIs, routing, prefix/KV-cache-aware scheduling, and traffic entry. DisaggregatedSet manages the Kubernetes workloads for disaggregated roles, creating and coordinating multiple underlying LeaderWorkerSets. DisaggregatedSet is being co-designed with llm-d (CNCF sandbox project).
Step 1: Install LWS
# Install LeaderWorkerSet via Helm (required before DisaggregatedSet)
CHART_VERSION=0.8.0
helm install lws oci://registry.k8s.io/lws/charts/lws \
--version "${CHART_VERSION}" \
--namespace lws-system \
--create-namespace \
--wait \
--timeout 300s
# Verify installation
kubectl get pods -n lws-system
# NAME READY STATUS RESTARTS AGE
# lws-controller-manager-xxx 1/1 Running 0 30s
kubectl api-resources | grep -i leaderworker
# leaderworkersets lws leaderworkerset.x-k8s.io/v1 true LeaderWorkerSetStep 2: Install DisaggregatedSet Controller
DisaggregatedSet runs as a separate controller in its own namespace (disaggregatedset-system). It must be installed after LWS.
# Install from repo source (kustomize)
kubectl apply --server-side \
-k "github.com/kubernetes-sigs/lws/disaggregatedset/config/default?ref=main"
# Verify
kubectl get pods -n disaggregatedset-system
# NAME READY STATUS RESTARTS AGE
# disaggregatedset-controller-manager-xxx 1/1 Running 0 30s
kubectl api-resources | grep -i disaggregated
# disaggregatedsets ds disaggregatedset.x-k8s.io/v1alpha1 true DisaggregatedSetIf the kustomize path changes, clone and inspect:
git clone https://github.com/kubernetes-sigs/lws.git
cd lws
find disaggregatedset -maxdepth 3 -type f | sort
kubectl apply --server-side -k disaggregatedset/config/defaultStep 3: Deploy DisaggregatedSet for llm-d
Instead of manually managing separate LWS objects:
BEFORE (manual): AFTER (DisaggregatedSet):
βββββββββββββββββββββ βββββββββββββββββββββββββ
LeaderWorkerSet: my-model-prefill DisaggregatedSet: llama-pd
LeaderWorkerSet: my-model-decode role: prefill (LWS template)
Services: manually managed role: decode (LWS template)
Rollouts: manually coordinated
Controller auto-creates:
llama-pd-<revision>-prefill
llama-pd-<revision>-decode
llama-pd-<revision>-prefill-prv
llama-pd-<revision>-decode-prvapiVersion: disaggregatedset.x-k8s.io/v1alpha1
kind: DisaggregatedSet
metadata:
name: llama-pd
namespace: llm-d
spec:
roles:
- name: prefill
metadata:
labels:
app.kubernetes.io/name: llama-pd
llm-d.ai/role: prefill
spec:
replicas: 2
leaderWorkerTemplate:
size: 2 # 1 leader + 1 worker (tensor-parallel across 2 GPUs)
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
args:
- "--model=/models/llama-70b"
- "--tensor-parallel-size=2"
- "--kv-transfer-config=/etc/vllm/kv-transfer.yaml"
- "--enable-disagg"
- "--disagg-role=prefill"
- "--port=8000"
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: models
mountPath: /models
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
- name: models
persistentVolumeClaim:
claimName: model-cache
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
workerTemplate:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
args:
- "--model=/models/llama-70b"
- "--tensor-parallel-size=2"
- "--enable-disagg"
- "--disagg-role=prefill"
env:
- name: NCCL_SOCKET_IFNAME
value: "eth0"
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: models
mountPath: /models
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
- name: models
persistentVolumeClaim:
claimName: model-cache
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
- name: decode
metadata:
labels:
app.kubernetes.io/name: llama-pd
llm-d.ai/role: decode
spec:
replicas: 4 # More decode replicas (throughput bottleneck)
leaderWorkerTemplate:
size: 1 # Single GPU per decode replica
restartPolicy: RecreateGroupOnPodRestart
leaderTemplate:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.8.0
args:
- "--model=/models/llama-70b"
- "--tensor-parallel-size=1"
- "--kv-transfer-config=/etc/vllm/kv-transfer.yaml"
- "--enable-disagg"
- "--disagg-role=decode"
- "--port=8000"
ports:
- containerPort: 8000
name: http
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: models
mountPath: /models
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
- name: models
persistentVolumeClaim:
claimName: model-cache
nodeSelector:
nvidia.com/gpu.product: "NVIDIA-A100-SXM4-80GB"Step 4: Wire llm-d Routing to Generated Services
DisaggregatedSet creates headless Services per role per revision. Use label selectors β not hard-coded service names β because DS creates revisioned names during rollouts.
# Discover auto-created services
kubectl get svc -n llm-d \
-l disaggregatedset.x-k8s.io/name=llama-pd
# NAME TYPE CLUSTER-IP PORT(S)
# llama-pd-abc12-prefill-prv ClusterIP None 8000/TCP
# llama-pd-abc12-decode-prv ClusterIP None 8000/TCP
# Discover managed LeaderWorkerSets
kubectl get lws -n llm-d \
-l disaggregatedset.x-k8s.io/name=llama-pd
# NAME REPLICAS READY AGE
# llama-pd-abc12-prefill 2 2 5m
# llama-pd-abc12-decode 4 4 5mLabels applied to managed resources:
disaggregatedset.x-k8s.io/name β llama-pd
disaggregatedset.x-k8s.io/role β prefill | decode
disaggregatedset.x-k8s.io/revision β abc12345Configure llm-d InferencePool with Label Selectors
# Use role labels for endpoint discovery β survives rollout revision changes
# llm-d InferencePool / EndpointPicker configuration:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: llama-pd-pool
namespace: llm-d
spec:
targetPortNumber: 8000
selector:
matchLabels:
disaggregatedset.x-k8s.io/name: llama-pd
---
# Or separate pools per role for explicit routing control:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: llama-pd-prefill
namespace: llm-d
spec:
targetPortNumber: 8000
selector:
matchLabels:
disaggregatedset.x-k8s.io/name: llama-pd
disaggregatedset.x-k8s.io/role: prefill
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: llama-pd-decode
namespace: llm-d
spec:
targetPortNumber: 8000
selector:
matchLabels:
disaggregatedset.x-k8s.io/name: llama-pd
disaggregatedset.x-k8s.io/role: decodeImportant: Prefer selectors like:
# β
CORRECT β survives rollout revision changes
selector:
matchLabels:
disaggregatedset.x-k8s.io/name: llama-pd
disaggregatedset.x-k8s.io/role: decodeNot hard-coded service names:
# β WRONG β breaks on every rollout (revision hash changes)
serviceName: llama-pd-abc12345-decode-prvStep 5: Validate the Integration
# Check DisaggregatedSet status
kubectl get disaggregatedset -n llm-d
# NAME ROLES READY AGE
# llama-pd 2 True 10m
kubectl describe disaggregatedset llama-pd -n llm-d
# Check underlying resources
kubectl get lws -n llm-d \
-l disaggregatedset.x-k8s.io/name=llama-pd
kubectl get pods -n llm-d \
-l disaggregatedset.x-k8s.io/name=llama-pd -o wide
kubectl get svc -n llm-d \
-l disaggregatedset.x-k8s.io/name=llama-pd
# Test inference through llm-d/KServe/Gateway endpoint
curl -X POST "http://llm-gateway.example.com/v1/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-70b",
"prompt": "Explain disaggregated inference in one paragraph.",
"max_tokens": 128
}'Rolling Update Behavior
When you update the DisaggregatedSet spec (e.g., new vLLM image):
1. Controller creates NEW revision LWS + Services for ALL roles
2. Scales UP new revision (surge replicas)
3. Waits for ALL new pods to be Ready across ALL roles
4. Scales DOWN old revision
5. Cleans up old Services
Key guarantees:
β’ Capacity ratio maintained at every step (prefill:decode = 1:2)
β’ No orphaned single-role workloads during interrupted rollouts
β’ Scale up BEFORE scale down (always maintains serving capacity)
β’ Controller is stateless β safe to restart at any point
Important: Do NOT set rollout strategy on embedded LWS templates.
DisaggregatedSet owns rollouts and does not propagate RolloutStrategy
to underlying LWS resources.When to Use LWS vs DisaggregatedSet
Use LeaderWorkerSet when:
β’ Single multi-node serving pool (one role)
β’ Standard tensor-parallel inference
β’ Simple multi-host deployment
Use DisaggregatedSet when:
β’ Multiple coordinated serving roles (prefill + decode)
β’ Need coordinated rolling updates across roles
β’ Want automatic service orchestration per role/revision
β’ Using llm-d with disaggregated vLLM/SGLangRed Hat AI Inference / RHAI Integration
For Red Hat AI Inference deployments, LWS may already be installed as a dependency. The llm-d Helm chart can install all dependencies including Gateway API, LWS, and KServe:
# Red Hat AI Inference deploys llm-d with LWS as dependency
# Check if LWS is already present
kubectl get crd leaderworkersets.leaderworkerset.x-k8s.io
# If using RHAI Helm chart, LWS is installed automatically
# for wide expert parallelism with llm-d
helm install rhai-llmd redhat-ai/llm-d \
--set lws.enabled=true \
--set gatewayAPI.enabled=true \
--namespace llm-dCommon Issues
DisaggregatedSet controller not found after install
- Cause: Installed before LWS, or kustomize path changed in repo
- Fix: Ensure LWS is running first; clone repo and inspect
disaggregatedset/config/
Pods pending β gang scheduling canβt place all pods
- Cause: Not enough GPU nodes in zone for exclusive topology
- Fix: Ensure sufficient GPU capacity per zone; relax topology if needed
llm-d canβt discover endpoints after rollout
- Cause: Using hard-coded service names instead of label selectors
- Fix: Use
disaggregatedset.x-k8s.io/rolelabel selectors β they survive revision changes
βrole names must be uniqueβ validation error
- Cause: Duplicate role names in spec
- Fix: Each role needs a unique name matching
^[a-z0-9]([-a-z0-9]*[a-z0-9])?$
Rollout stuck β new revision not becoming ready
- Cause: Insufficient GPU headroom for surge replicas during update
- Fix: Ensure at least 1 extra replica worth of GPUs available; pre-pull model images
Best Practices
- Install order: LWS β DisaggregatedSet β llm-d β DS depends on LWS CRDs
- Use label selectors for routing β never hard-code revision-specific service names
- Donβt set RolloutStrategy on embedded LWS β DS owns the rollout lifecycle
- Scale decode > prefill β decode is typically the throughput bottleneck (2:1 or 4:1)
- Pre-pull model images β avoids 10+ minute delays during rollout surge
- RecreateGroupOnPodRestart β ensures tensor-parallel groups restart atomically
- Monitor per-role β track TTFT (prefill latency) and TPS (decode throughput) independently
- Single DisaggregatedSet per model β donβt mix different models in one DS
Key Takeaways
- DisaggregatedSet is the workload orchestration layer underneath llm-dβs routing stack
- Install order matters: LWS first β DisaggregatedSet β then llm-d/KServe
- DS controller runs in
disaggregatedset-systemnamespace, manages LWS in workload namespace - Replaces manually coordinated LWS objects with a single unified CRD
- Auto-creates revisioned headless Services per role for revision-aware routing
- llm-d discovers endpoints via label selectors (
disaggregatedset.x-k8s.io/role) - N-dimensional rolling updates maintain capacity ratios across all roles throughout rollout
- Co-designed with llm-d (CNCF sandbox) β the recommended pattern for production disaggregated inference
- DS is still v1alpha1 β API may change; docs are catching up (GitHub issue #806)

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
