GPU Time-Slicing on Kubernetes
Share GPUs across multiple workloads using NVIDIA time-slicing on Kubernetes. Configure the device plugin, set replica counts, and manage fairness.
π‘ Quick Answer: Create a ConfigMap with
sharing.timeSlicing.replicas: 4and reference it in the GPU Operatorβs device plugin config. Each physical GPU appears as 4nvidia.com/gpuresources, letting 4 pods share one GPU via CUDA time-slicing β no MIG hardware partitioning needed.
The Problem
GPUs are expensive. A single NVIDIA A100 costs ~$10,000, yet many workloads (notebooks, dev inference, small models) use only 10-30% of GPU capacity. Without sharing, each pod requesting nvidia.com/gpu: 1 gets exclusive access to an entire GPU, wasting resources. You need GPU sharing that works with any NVIDIA GPU β not just MIG-capable ones.
The Solution
How Time-Slicing Works
CUDA time-slicing shares a physical GPU across multiple processes by rapidly switching execution context. Each workload gets a βsliceβ of GPU time. Unlike MIG (which partitions GPU hardware), time-slicing:
- Works on any NVIDIA GPU (not just A100/H100)
- Shares all GPU memory (no hard memory isolation)
- Provides fair scheduling via CUDA scheduler
- Has minimal overhead (~2-5%)
Step 1: Create Device Plugin Config
# gpu-time-slicing-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: gpu-operator
data:
default: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: true
resources:
- name: nvidia.com/gpu
replicas: 4
dev: |
version: v1
sharing:
timeSlicing:
renameByDefault: false
failRequestsGreaterThanOne: true
resources:
- name: nvidia.com/gpu
replicas: 8Key settings:
replicas: 4β each physical GPU advertised as 4 virtual GPUsfailRequestsGreaterThanOne: trueβ reject pods requesting >1 GPU (prevents accidental full-GPU allocation)renameByDefault: falseβ keepnvidia.com/gpuresource name (set totrueto usenvidia.com/gpu.shared)
Step 2: Apply and Configure GPU Operator
kubectl apply -f gpu-time-slicing-config.yaml
# Update ClusterPolicy to reference the config
kubectl patch clusterpolicy cluster-policy \
--type merge \
-p '{"spec":{"devicePlugin":{"config":{"name":"device-plugin-config","default":"default"}}}}'Or set during Helm install:
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set devicePlugin.config.name=device-plugin-config \
--set devicePlugin.config.default=defaultStep 3: Label Nodes for Different Profiles
# Dev nodes: 8-way sharing (more pods, less GPU per pod)
kubectl label node dev-gpu-node nvidia.com/device-plugin.config=dev
# Production nodes: 4-way sharing (default)
# No label needed β uses "default" profile
# Training nodes: no sharing (exclusive GPU access)
kubectl label node train-gpu-node nvidia.com/device-plugin.config=no-sharingAdd a no-sharing profile:
data:
no-sharing: |
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 1Step 4: Verify Time-Slicing
# Check advertised GPU count (should be physical Γ replicas)
kubectl get node gpu-node -o jsonpath='{.status.allocatable}' | jq '."nvidia.com/gpu"'
# "8" (2 physical GPUs Γ 4 replicas)
# Deploy test pods
for i in $(seq 1 4); do
kubectl run gpu-test-$i --image=nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04 \
--command -- sleep infinity \
--overrides='{"spec":{"containers":[{"name":"gpu-test-'$i'","image":"nvcr.io/nvidia/cuda:12.4.0-base-ubuntu22.04","command":["sleep","infinity"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}'
done
# All 4 pods should be Running on the same GPU
kubectl get pods -o widegraph TD
A[Physical GPU: A100 80GB] -->|Time-Slicing Γ4| B[Pod 1: Notebook<br>20GB shared memory]
A -->|Time-Slicing Γ4| C[Pod 2: Inference<br>20GB shared memory]
A -->|Time-Slicing Γ4| D[Pod 3: Dev Model<br>20GB shared memory]
A -->|Time-Slicing Γ4| E[Pod 4: Fine-tuning<br>20GB shared memory]
F[Scheduling] -->|Round-robin| A
style A fill:#76b900,color:#fff
style B fill:#e8f5e9
style C fill:#e8f5e9
style D fill:#e8f5e9
style E fill:#e8f5e9Choosing the Right Replica Count
| Workload Type | Replicas | Use Case |
|---|---|---|
| 1 | Exclusive | Training, large models |
| 2 | Light sharing | Production inference |
| 4 | Standard | Mixed dev/inference |
| 8 | Heavy sharing | Notebooks, small models |
| 10+ | Maximum | CI/CD, testing |
When to Use Time-Slicing vs MIG
| Feature | Time-Slicing | MIG |
|---|---|---|
| GPU support | Any NVIDIA GPU | A100, H100, H200 only |
| Memory isolation | β Shared | β Hardware-isolated |
| Fault isolation | β Shared | β Independent |
| Configuration | ConfigMap | GPU Operator + device plugin |
| Overhead | ~2-5% | ~0% |
| Flexibility | Easy to change | Requires reconfiguration |
| Best for | Dev, notebooks, small inference | Production, multi-tenant |
Common Issues
OOM When Sharing GPUs
Time-slicing doesnβt isolate GPU memory. If one pod allocates too much VRAM, others get OOM:
# Set CUDA memory limits per container
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"
- name: CUDA_MPS_ACTIVE_THREAD_PERCENTAGE
value: "25" # Limit to 25% of GPU computeOr use framework-level limits:
# PyTorch
torch.cuda.set_per_process_memory_fraction(0.25)
# TensorFlow
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)Pods Stuck Pending After Config Change
The device plugin needs to restart to pick up new config:
kubectl -n gpu-operator delete pods -l app=nvidia-device-plugin-daemonset
# Wait for restart, then check allocatable
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'Uneven GPU Utilization
Time-slicing uses round-robin scheduling. For fairer allocation, use KAI Scheduler:
# See kai-scheduler-gpu-sharing recipe for fair queueing
schedulerName: kai-schedulerBest Practices
- 4 replicas for general use β good balance of sharing and performance
- 8+ replicas for dev/notebooks β maximize density, accept performance variability
- 1 replica for training β never time-slice training workloads
- Set
failRequestsGreaterThanOne: trueβ prevent pods from hogging GPUs - Monitor with DCGM β watch
DCGM_FI_DEV_GPU_UTILto detect oversubscription - Use node labels for profiles β different sharing ratios for dev vs production nodes
- Set framework memory limits β time-slicing doesnβt isolate memory; apps must self-limit
Key Takeaways
- Time-slicing multiplies advertised GPU count by
replicasin device plugin config - Works on any NVIDIA GPU β no MIG-capable hardware required
- No GPU memory isolation β workloads share the full VRAM
- Use per-node labels to assign different sharing profiles (dev: 8x, prod: 4x, training: 1x)
- Combine with DCGM monitoring to detect oversubscription
- For hard memory isolation, use MIG on A100/H100/H200 instead

Recommended
Kubernetes Recipes β The Complete Book100+ production-ready patterns with detailed explanations, best practices, and copy-paste YAML. Everything in one place.
Get the Book βLearn by Doing
CopyPasteLearn β Hands-on Cloud & DevOps CoursesMaster Kubernetes, Ansible, Terraform, and MLOps with interactive, copy-paste-run lessons. Start free.
Browse Courses βπ Deepen Your Skills β Hands-on Courses
Courses by CopyPasteLearn.com β Learn IT by Doing
