KAI-Scheduler vs HAMi: Two Ways to Share GPUs in Kubernetes (Soft vs Hard Isolation)

An engineering-oriented comparison of KAI-Scheduler’s Reservation Pod approach and HAMi’s hard isolation path, including trade-offs, failure modes (noisy neighbor), and how the two layers can complement each other.

Sharing a GPU in Kubernetes sounds simple: run multiple jobs on one physical GPU.

In practice, it quickly turns into a set of very concrete questions:

How do you ask for “0.2 GPU” when Kubernetes only knows nvidia.com/gpu: 1?
How do you prevent scheduling conflicts with full-GPU pods?
And the real pain point: how do you prevent one container from consuming all VRAM and killing its neighbors?

KAI-Scheduler (open-sourced after NVIDIA acquired Run:ai) ships a clever GPU sharing design centered around the idea of a Reservation Pod: allocate a whole GPU to a lightweight placeholder pod, then do fractional accounting inside the scheduler.

HAMi represents a different direction: not only scheduling-level allocation, but runtime-enforced limits (hard isolation) so that “shared” GPUs behave predictably in multi-tenant environments.

This post aims to:

Explain KAI’s Reservation Pod mechanism clearly.
Explain what “hard isolation” means in the HAMi path.
Give practical guidance: when to choose which, and how they can be composed.

1) Why “fractional GPU” is hard in vanilla Kubernetes

Kubernetes device resources are modeled as extended resources.

For NVIDIA GPUs, that’s typically nvidia.com/gpu. The model is intentionally simple and works well for whole-device allocation, but it comes with hard constraints:

integer only: you can’t request 0.2 of nvidia.com/gpu
the default scheduler doesn’t understand fractional semantics
there’s no standard way to represent “this GPU is partially used” to the rest of the cluster

A common instinct is:

“Fine, I’ll report one physical GPU as multiple virtual devices via a device plugin.”

That’s a valid route (and it’s close to how hard-isolation stacks often start), but KAI takes a different, scheduler-centric approach.

2) KAI’s key trick: the Reservation Pod

2.1 Intuition: reserve the whole GPU, then split the bill internally

KAI’s approach is basically:

Don’t fight Kubernetes over fractional GPU semantics. Use a standard pod to reserve the whole GPU first.

A fractional user pod expresses intent via an annotation, for example:

metadata:
  annotations:
    gpu-fraction: "0.2"

Then KAI-Scheduler does this:

Detects gpu-fraction.
Selects a node and a physical GPU that can be shared.
If this GPU has no existing sharing group yet, it creates a Reservation Pod that requests nvidia.com/gpu: "1".
Once the Reservation Pod is scheduled, Kubernetes sees the GPU as fully allocated, preventing other schedulers/pods from conflicting.
KAI assigns the fractional pods into the same logical group (for example via a gpu-group label) and keeps a private accounting of usage.

2.2 What is it “tricking”?

It’s not tricking the GPU.

It’s “tricking” the resource model: Kubernetes only sees that the GPU is owned by the Reservation Pod, while KAI tracks fractional usage out-of-band.

3.1 Scheduling entry: parse annotations and pick a GPU

At a high level, the control flow looks like:

find GPUs on the node that can fit
pick a GPU that’s preferable for sharing
attempt to allocate the fractional task

Example-shaped pseudocode:

func AllocateFractionalGPUTaskToNode(ssn *framework.Session, stmt *framework.Statement,
  pod *pod_info.PodInfo, node *node_info.NodeInfo, isPipelineOnly bool) bool {

  fittingGPUs := ssn.FittingGPUs(node, pod)
  gpuForSharing := getNodePreferableGpuForSharing(fittingGPUs, node, pod, isPipelineOnly)
  if gpuForSharing == nil {
    return false
  }

  pod.GPUGroups = gpuForSharing.Groups
  success := allocateSharedGPUTask(ssn, stmt, node, pod, isPipelineOnly)
  if !success {
    pod.GPUGroups = nil
  }
  return success
}

3.2 The critical step: create the Reservation Pod

The Reservation Pod is pinned to a node and requests a whole GPU:

func (rsc *service) createResourceReservationPod(
  nodeName, gpuGroup, podName, appName string,
  resources v1.ResourceRequirements,
) (*v1.Pod, error) {

  podSpec := &v1.Pod{
    ObjectMeta: metav1.ObjectMeta{
      Name:      podName,
      Namespace: "runai-reservation",
      Labels: map[string]string{
        "gpu-group": gpuGroup,
      },
    },
    Spec: v1.PodSpec{
      NodeName: nodeName,
      Containers: []v1.Container{{
        Name:      "reservation",
        Image:     rsc.reservationPodImage,
        Resources: resources, // request nvidia.com/gpu: "1"
      }},
    },
  }

  return podSpec, rsc.kubeClient.Create(context.Background(), podSpec)
}

Key takeaways:

It’s node-pinned (not “waiting” for the default scheduler).
It uses labels to define a logical “sharing group”.

KAI tracks sharing state internally, for example:

type GpuSharingNodeInfo struct {
  ReleasingSharedGPUs       map[string]bool
  UsedSharedGPUsMemory      map[string]int64
  ReleasingSharedGPUsMemory map[string]int64
  AllocatedSharedGPUsMemory map[string]int64
}

This implies a fundamental property:

KAI’s GPU sharing is implemented at the scheduler level. Not via runtime enforcement.

3.4 Cleanup: last fractional pod exits -> delete Reservation Pod

When the sharing group no longer has active fractional pods, KAI deletes the Reservation Pod to return the GPU to the cluster:

func (rsc *service) syncForPods(ctx context.Context, pods []*v1.Pod, gpuGroupToSync string) error {
  reservationPods := map[string]*v1.Pod{}
  fractionPods := map[string][]*v1.Pod{}

  for _, pod := range pods {
    if pod.Namespace == "runai-reservation" {
      reservationPods[gpuGroupToSync] = pod
      continue
    }
    if pod.Status.Phase == v1.PodRunning || pod.Status.Phase == v1.PodPending {
      fractionPods[gpuGroupToSync] = append(fractionPods[gpuGroupToSync], pod)
    }
  }

  for gpuGroup, reservationPod := range reservationPods {
    if _, found := fractionPods[gpuGroup]; !found {
      return rsc.deleteReservationPod(ctx, reservationPod)
    }
  }
  return nil
}

4) KAI’s strengths—and its unavoidable constraint

4.1 Strength: elegant, Kubernetes-native integration

KAI’s design is practical:

Uses standard K8s primitives (pods, labels, annotations)
Avoids direct conflict with the default scheduler by reserving the whole GPU
Allows flexible fractions (not limited to a few predefined profiles)

If your goal is to improve utilization quickly and you can coordinate with application owners, KAI can be a fast win.

4.2 The core limitation: soft isolation

The trade-off is also fundamental:

KAI allocates “fractions” logically, but it can’t enforce those fractions at runtime.

That leads to:

no hard cap on VRAM/compute
noisy neighbor risks
application-side tuning often required (e.g., setting per-process VRAM limits)

Think of it as “sharing a room with house rules” rather than “private rooms with locked doors”.

5) The HAMi path: enforce isolation lower in the stack

Hard isolation solutions aim to avoid relying on application self-discipline.

A typical design combines:

device plugin logic to expose schedulable units
runtime/driver-level enforcement to cap VRAM (and sometimes other resources)

Benefits:

predictable limits per container
stronger multi-tenant QoS
less need for application-level configuration

Costs:

more components to deploy
more constraints on driver/runtime compatibility

6) Which one should you use?

Choose a KAI-style approach if:

you need a lightweight path to better utilization
you can tolerate some performance variance
you can coordinate with application teams to set VRAM usage limits

Choose a HAMi-style approach if:

you run a multi-tenant GPU platform
you need stable QoS and strict resource guarantees
you want sharing to be transparent to users

7) A realistic future: compose scheduler policies with hard isolation

It’s tempting to treat this as a “winner vs loser” debate.

In practice, they map nicely to two layers:

scheduling policies and UX (KAI’s strength)
enforcement and guarantees (HAMi’s strength)

If the ecosystem aligns interfaces between the two, you can get the best of both worlds:

rich scheduling policies and a clean fractional request UX
hard caps underneath to eliminate noisy-neighbor failure modes

A very common situation: you’ve got a 24GB GPU and you want to run two workloads on it:

Workload A: an online inference service (needs predictable VRAM)
Workload B: an offline evaluation / small training job (can be slower, but must not crash A)

With scheduler-level sharing (soft isolation), the scheduler can co-locate A and B on the same physical GPU, but preventing conflicts often requires application-side guardrails.

7.5.1 Pod side: request a fraction via annotation

Example (0.5 + 0.5):

apiVersion: v1
kind: Pod
metadata:
  name: infer-a
  annotations:
    gpu-fraction: "0.5"
spec:
  containers:
    - name: app
      image: your-registry/infer:latest
---
apiVersion: v1
kind: Pod
metadata:
  name: eval-b
  annotations:
    gpu-fraction: "0.5"
spec:
  containers:
    - name: app
      image: your-registry/eval:latest

Note: this intentionally doesn’t use resources.limits: nvidia.com/gpu, because fractional GPU is not a native Kubernetes resource. The exact knobs depend on how KAI is integrated in your cluster.

7.5.2 App side: VRAM self-limiting (PyTorch example)

The simplest guardrail is to cap VRAM usage in-process before loading the model:

import torch

# Limit this process to ~50% of GPU 0 VRAM.
# This is not hard isolation, but it reduces noisy-neighbor incidents in many real deployments.
torch.cuda.set_per_process_memory_fraction(0.5, device=0)

# load model *after* the cap

Practical gotchas:

call it early (before allocating tensors / loading weights)
in containers, device=0 means the first visible GPU (respecting CUDA_VISIBLE_DEVICES)

7.5.3 How to tell if it’s a noisy-neighbor problem

Typical symptoms:

inference latency spikes without code changes
occasional CUDA out of memory, and a restart “fixes” it
VRAM usage of one pod oscillates and aligns with another pod’s peaks on the same GPU

Quick checks:

use nvidia-smi to confirm multiple processes share the same GPU and whether VRAM is saturated
align pod-level GPU VRAM metrics with the incident timeline (who ramped up first?)
if you can’t require every workload to be “well-behaved”, you usually want hard isolation underneath

Closing thoughts

KAI-Scheduler’s Reservation Pod idea is a very Kubernetes-savvy way to bring fractional GPU semantics into a system that only understands whole devices.

HAMi’s direction is a platform-engineering answer to the uncomfortable truth: sharing without enforcement turns into operational risk.

A pragmatic approach is:

start with scheduler-level sharing to lift utilization
add hard isolation when you need production-grade multi-tenant guarantees

If you want, I can also write a follow-up that focuses on the real ops side:

observability (who is eating VRAM/SM?)
failure modes (why OOM happens in “shared GPU”, and how to mitigate)
rollout strategies for mixed sharing + exclusive workloads