Volcano Job minAvailable Gang Schedule
Volcano batch scheduling with minAvailable gang scheduling on Kubernetes. Job configuration, queue policies, and AI training workload scheduling.
π‘ Quick Answer: Install Volcano for gang scheduling (all-or-nothing pod groups), fair-share queues, and job lifecycle management. Create Volcano
JobswithminAvailableto prevent partial starts of distributed training, andQueueswith weights for fair GPU sharing across teams.
The Problem
Distributed training with 4 workers needs all 4 to start simultaneously β the default scheduler starts them one by one, causing worker 1-3 to idle while waiting for worker 4. Gang scheduling ensures all workers start together or none do. Volcano also provides queue-based job management for multi-tenant GPU clusters.
The Solution
Install Volcano
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/release-1.10/installer/volcano-development.yamlVolcano Queue
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: training-queue
spec:
weight: 3
capability:
nvidia.com/gpu: 32
cpu: 128
memory: 512Gi
reclaimable: true
---
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: inference-queue
spec:
weight: 5
capability:
nvidia.com/gpu: 16
reclaimable: falseGang-Scheduled Training Job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: distributed-llm-train
spec:
schedulerName: volcano
minAvailable: 4
queue: training-queue
policies:
- event: PodEvicted
action: RestartJob
- event: PodFailed
action: RestartJob
- event: TaskCompleted
action: CompleteJob
plugins:
sla:
- --waiting-time=30m
gang:
- --ordered-pod
tasks:
- replicas: 1
name: master
template:
spec:
schedulerName: volcano
containers:
- name: pytorch
image: registry.example.com/training:1.0
command: ["torchrun", "--master_addr=$(VC_MASTER_HOST)", "train.py"]
resources:
limits:
nvidia.com/gpu: 8
env:
- name: RANK
value: "0"
- replicas: 3
name: worker
template:
spec:
schedulerName: volcano
containers:
- name: pytorch
image: registry.example.com/training:1.0
command: ["torchrun", "--master_addr=$(VC_MASTER_HOST)", "train.py"]
resources:
limits:
nvidia.com/gpu: 8Volcano Plugins
| Plugin | Purpose |
|---|---|
gang | All-or-nothing scheduling |
sla | Waiting time limits, job deadlines |
proportion | Fair-share queue allocation |
binpack | Pack pods onto fewer nodes |
nodeorder | Custom node scoring |
tdm | Time-division multiplexing |
Monitor Queue Status
# Queue utilization
kubectl get queue -o wide
# Job status
kubectl get vcjob
kubectl describe vcjob distributed-llm-traingraph TD
subgraph Volcano Scheduler
GANG[Gang Plugin<br/>All-or-nothing]
PROP[Proportion Plugin<br/>Fair sharing]
BINPACK[Binpack Plugin<br/>Node consolidation]
SLA_P[SLA Plugin<br/>Waiting timeout]
end
JOB[Volcano Job<br/>minAvailable: 4] --> GANG
GANG -->|All 4 pods<br/>schedulable?| CHECK{Resources<br/>available?}
CHECK -->|Yes| SCHEDULE[Schedule all 4<br/>simultaneously β
]
CHECK -->|No| QUEUE[Queue job<br/>wait for resources]
QUEUE -->|SLA timeout 30m| FAIL[Fail job<br/>insufficient resources]Common Issues
Job stuck in Pending β queue has capacity
Volcano scheduler might not be running. Check: kubectl get pods -n volcano-system. Ensure schedulerName: volcano is set on all pod templates.
Gang scheduling deadlock β two jobs each have partial allocation
Volcanoβs gang plugin prevents this β it only admits a job when ALL pods can be scheduled. If you see partial allocation, check that minAvailable equals total replicas.
Best Practices
minAvailableequals total pods for gang scheduling β prevents partial starts- SLA plugin with waiting time β fail fast if resources canβt be acquired
- Queue weights for priority β higher weight = more resource share
reclaimable: truefor training queues β inference can reclaim GPU resourcesPodEvicted β RestartJobβ automatic restart on preemption
Key Takeaways
- Volcano provides gang scheduling β all pods start together or none do
- Prevents the #1 distributed training problem: workers idling waiting for peers
- Queue-based resource management with fair-share proportional allocation
- SLA plugin sets waiting time limits β fail fast instead of waiting indefinitely
- Integrates with PyTorch, TensorFlow, MPI, and Spark distributed workloads

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
