KAI-Scheduler vs HAMi: Two Ways to Share GPUs in Kubernetes (Soft vs Hard Isolation)
An engineering-oriented comparison of KAI-Scheduler’s Reservation Pod approach and HAMi’s hard isolation path, including trade-offs, failure modes (noisy neighbor), and how the two layers can complement each other.
Sharing a GPU in Kubernetes sounds simple: run multiple jobs on one physical GPU.
In practice, it quickly turns into a set of very concrete questions:
- How do you ask for “0.2 GPU” when Kubernetes only knows
nvidia.com/gpu: 1? - How do you prevent scheduling conflicts with full-GPU pods?
- And the real pain point: how do you prevent one container from consuming all VRAM and killing its neighbors?
KAI-Scheduler (open-sourced after NVIDIA acquired Run:ai) ships a clever GPU sharing design centered around the idea of a Reservation Pod: allocate a whole GPU to a lightweight placeholder pod, then do fractional accounting inside the scheduler.
HAMi represents a different direction: not only scheduling-level allocation, but runtime-enforced limits (hard isolation) so that “shared” GPUs behave predictably in multi-tenant environments.
This post aims to:
- Explain KAI’s Reservation Pod mechanism clearly.
- Explain what “hard isolation” means in the HAMi path.
- Give practical guidance: when to choose which, and how they can be composed.
1) Why “fractional GPU” is hard in vanilla Kubernetes
Kubernetes device resources are modeled as extended resources.
For NVIDIA GPUs, that’s typically nvidia.com/gpu. The model is intentionally simple and works well for whole-device allocation, but it comes with hard constraints:
- integer only: you can’t request
0.2ofnvidia.com/gpu - the default scheduler doesn’t understand fractional semantics
- there’s no standard way to represent “this GPU is partially used” to the rest of the cluster
A common instinct is:
“Fine, I’ll report one physical GPU as multiple virtual devices via a device plugin.”
That’s a valid route (and it’s close to how hard-isolation stacks often start), but KAI takes a different, scheduler-centric approach.
2) KAI’s key trick: the Reservation Pod
2.1 Intuition: reserve the whole GPU, then split the bill internally
KAI’s approach is basically:
Don’t fight Kubernetes over fractional GPU semantics. Use a standard pod to reserve the whole GPU first.
A fractional user pod expresses intent via an annotation, for example:
metadata:
annotations:
gpu-fraction: "0.2"
Then KAI-Scheduler does this:
- Detects
gpu-fraction. - Selects a node and a physical GPU that can be shared.
- If this GPU has no existing sharing group yet, it creates a Reservation Pod that requests
nvidia.com/gpu: "1". - Once the Reservation Pod is scheduled, Kubernetes sees the GPU as fully allocated, preventing other schedulers/pods from conflicting.
- KAI assigns the fractional pods into the same logical group (for example via a
gpu-grouplabel) and keeps a private accounting of usage.
2.2 What is it “tricking”?
It’s not tricking the GPU.
It’s “tricking” the resource model: Kubernetes only sees that the GPU is owned by the Reservation Pod, while KAI tracks fractional usage out-of-band.
3) How KAI GPU sharing works (engineering view)
3.1 Scheduling entry: parse annotations and pick a GPU
At a high level, the control flow looks like:
- find GPUs on the node that can fit
- pick a GPU that’s preferable for sharing
- attempt to allocate the fractional task
Example-shaped pseudocode:
func AllocateFractionalGPUTaskToNode(ssn *framework.Session, stmt *framework.Statement,
pod *pod_info.PodInfo, node *node_info.NodeInfo, isPipelineOnly bool) bool {
fittingGPUs := ssn.FittingGPUs(node, pod)
gpuForSharing := getNodePreferableGpuForSharing(fittingGPUs, node, pod, isPipelineOnly)
if gpuForSharing == nil {
return false
}
pod.GPUGroups = gpuForSharing.Groups
success := allocateSharedGPUTask(ssn, stmt, node, pod, isPipelineOnly)
if !success {
pod.GPUGroups = nil
}
return success
}
3.2 The critical step: create the Reservation Pod
The Reservation Pod is pinned to a node and requests a whole GPU:
func (rsc *service) createResourceReservationPod(
nodeName, gpuGroup, podName, appName string,
resources v1.ResourceRequirements,
) (*v1.Pod, error) {
podSpec := &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: podName,
Namespace: "runai-reservation",
Labels: map[string]string{
"gpu-group": gpuGroup,
},
},
Spec: v1.PodSpec{
NodeName: nodeName,
Containers: []v1.Container{{
Name: "reservation",
Image: rsc.reservationPodImage,
Resources: resources, // request nvidia.com/gpu: "1"
}},
},
}
return podSpec, rsc.kubeClient.Create(context.Background(), podSpec)
}
Key takeaways:
- It’s node-pinned (not “waiting” for the default scheduler).
- It uses labels to define a logical “sharing group”.
3.3 Internal accounting: tracking VRAM usage per sharing group
KAI tracks sharing state internally, for example:
type GpuSharingNodeInfo struct {
ReleasingSharedGPUs map[string]bool
UsedSharedGPUsMemory map[string]int64
ReleasingSharedGPUsMemory map[string]int64
AllocatedSharedGPUsMemory map[string]int64
}
This implies a fundamental property:
KAI’s GPU sharing is implemented at the scheduler level. Not via runtime enforcement.
3.4 Cleanup: last fractional pod exits -> delete Reservation Pod
When the sharing group no longer has active fractional pods, KAI deletes the Reservation Pod to return the GPU to the cluster:
func (rsc *service) syncForPods(ctx context.Context, pods []*v1.Pod, gpuGroupToSync string) error {
reservationPods := map[string]*v1.Pod{}
fractionPods := map[string][]*v1.Pod{}
for _, pod := range pods {
if pod.Namespace == "runai-reservation" {
reservationPods[gpuGroupToSync] = pod
continue
}
if pod.Status.Phase == v1.PodRunning || pod.Status.Phase == v1.PodPending {
fractionPods[gpuGroupToSync] = append(fractionPods[gpuGroupToSync], pod)
}
}
for gpuGroup, reservationPod := range reservationPods {
if _, found := fractionPods[gpuGroup]; !found {
return rsc.deleteReservationPod(ctx, reservationPod)
}
}
return nil
}
4) KAI’s strengths—and its unavoidable constraint
4.1 Strength: elegant, Kubernetes-native integration
KAI’s design is practical:
- Uses standard K8s primitives (pods, labels, annotations)
- Avoids direct conflict with the default scheduler by reserving the whole GPU
- Allows flexible fractions (not limited to a few predefined profiles)
If your goal is to improve utilization quickly and you can coordinate with application owners, KAI can be a fast win.
4.2 The core limitation: soft isolation
The trade-off is also fundamental:
KAI allocates “fractions” logically, but it can’t enforce those fractions at runtime.
That leads to:
- no hard cap on VRAM/compute
- noisy neighbor risks
- application-side tuning often required (e.g., setting per-process VRAM limits)
Think of it as “sharing a room with house rules” rather than “private rooms with locked doors”.
5) The HAMi path: enforce isolation lower in the stack
Hard isolation solutions aim to avoid relying on application self-discipline.
A typical design combines:
- device plugin logic to expose schedulable units
- runtime/driver-level enforcement to cap VRAM (and sometimes other resources)
Benefits:
- predictable limits per container
- stronger multi-tenant QoS
- less need for application-level configuration
Costs:
- more components to deploy
- more constraints on driver/runtime compatibility
6) Which one should you use?
Choose a KAI-style approach if:
- you need a lightweight path to better utilization
- you can tolerate some performance variance
- you can coordinate with application teams to set VRAM usage limits
Choose a HAMi-style approach if:
- you run a multi-tenant GPU platform
- you need stable QoS and strict resource guarantees
- you want sharing to be transparent to users
7) A realistic future: compose scheduler policies with hard isolation
It’s tempting to treat this as a “winner vs loser” debate.
In practice, they map nicely to two layers:
- scheduling policies and UX (KAI’s strength)
- enforcement and guarantees (HAMi’s strength)
If the ecosystem aligns interfaces between the two, you can get the best of both worlds:
- rich scheduling policies and a clean fractional request UX
- hard caps underneath to eliminate noisy-neighbor failure modes
7.5 Hands-on: running two workloads on one shared GPU (how to survive soft sharing)
A very common situation: you’ve got a 24GB GPU and you want to run two workloads on it:
- Workload A: an online inference service (needs predictable VRAM)
- Workload B: an offline evaluation / small training job (can be slower, but must not crash A)
With scheduler-level sharing (soft isolation), the scheduler can co-locate A and B on the same physical GPU, but preventing conflicts often requires application-side guardrails.
7.5.1 Pod side: request a fraction via annotation
Example (0.5 + 0.5):
apiVersion: v1
kind: Pod
metadata:
name: infer-a
annotations:
gpu-fraction: "0.5"
spec:
containers:
- name: app
image: your-registry/infer:latest
---
apiVersion: v1
kind: Pod
metadata:
name: eval-b
annotations:
gpu-fraction: "0.5"
spec:
containers:
- name: app
image: your-registry/eval:latest
Note: this intentionally doesn’t use resources.limits: nvidia.com/gpu, because fractional GPU is not a native Kubernetes resource. The exact knobs depend on how KAI is integrated in your cluster.
7.5.2 App side: VRAM self-limiting (PyTorch example)
The simplest guardrail is to cap VRAM usage in-process before loading the model:
import torch
# Limit this process to ~50% of GPU 0 VRAM.
# This is not hard isolation, but it reduces noisy-neighbor incidents in many real deployments.
torch.cuda.set_per_process_memory_fraction(0.5, device=0)
# load model *after* the cap
Practical gotchas:
- call it early (before allocating tensors / loading weights)
- in containers,
device=0means the first visible GPU (respectingCUDA_VISIBLE_DEVICES)
7.5.3 How to tell if it’s a noisy-neighbor problem
Typical symptoms:
- inference latency spikes without code changes
- occasional
CUDA out of memory, and a restart “fixes” it - VRAM usage of one pod oscillates and aligns with another pod’s peaks on the same GPU
Quick checks:
- use
nvidia-smito confirm multiple processes share the same GPU and whether VRAM is saturated - align pod-level GPU VRAM metrics with the incident timeline (who ramped up first?)
- if you can’t require every workload to be “well-behaved”, you usually want hard isolation underneath
Closing thoughts
KAI-Scheduler’s Reservation Pod idea is a very Kubernetes-savvy way to bring fractional GPU semantics into a system that only understands whole devices.
HAMi’s direction is a platform-engineering answer to the uncomfortable truth: sharing without enforcement turns into operational risk.
A pragmatic approach is:
- start with scheduler-level sharing to lift utilization
- add hard isolation when you need production-grade multi-tenant guarantees
If you want, I can also write a follow-up that focuses on the real ops side:
- observability (who is eating VRAM/SM?)
- failure modes (why OOM happens in “shared GPU”, and how to mitigate)
- rollout strategies for mixed sharing + exclusive workloads