CFN Cloud
Cloud Future New Life
en zh
2026-01-20 · 4 views

hetGPU: Chasing Cross-Vendor GPU Binary Compatibility

An engineering-oriented guide to hetGPU: how a compiler + runtime stack can make one GPU binary run across NVIDIA/AMD/Intel/Tenstorrent, including SIMT vs MIMD, memory model gaps, and live kernel migration.

GPU fleets are getting more heterogeneous: NVIDIA, AMD, Intel, and even newer designs like Tenstorrent can show up in the same cluster.

But the software reality hasn’t caught up: GPU binaries are not portable across vendors.

CUDA code compiled for NVIDIA won’t run on AMD or Intel GPUs because the mismatch is not just APIs. It’s deeper:

  • different ISAs and toolchains
  • different execution models (SIMT vs MIMD)
  • different memory hierarchies and consistency rules
  • different driver/runtime stacks

This becomes real operational pain:

  • vendor lock-in for large CUDA codebases
  • inflexible scheduling in mixed clusters
  • live migration/failover is basically “stop and restart”

1. Motivation: why binary portability matters

Source-level portability exists (OpenCL, HIP, SYCL/oneAPI), but it still requires rebuilding for each target. That doesn’t help when you:

  • ship prebuilt GPU libraries
  • want elastic scheduling across mixed GPU pools
  • want to checkpoint and resume a running kernel on a different device type

The core issue is straightforward: lack of binary compatibility blocks heterogeneous scheduling and makes adoption of new accelerators expensive.

2. The hetGPU idea: compiler + runtime as one system

hetGPU is designed as a full stack:

  • a compiler that emits an architecture-neutral GPU IR (instead of vendor machine code)
  • a runtime that dynamically lowers that IR to the detected target GPU (JIT / translation)
  • an abstraction layer that normalizes threads, memory, and synchronization semantics

Put simply:

Don’t bind the binary to NVIDIA SASS or AMD GCN. Bind it to a portable IR, and let the runtime specialize at load time.

2.1 Minimal example: CUDA → hetIR (pseudo)

Let’s use a classic vector-add style kernel to keep things concrete:

__global__ void vadd(const float* A, const float* B, float* C, int N) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < N) C[i] = A[i] + B[i];
}

Instead of emitting NVIDIA SASS or AMD ISA directly, hetGPU emits an architecture-neutral IR (pseudo hetIR style):

tid = GET_GLOBAL_ID(0)
pred = (tid < N)
@pred {
  a = LD_GLOBAL_F32(A + tid * 4)
  b = LD_GLOBAL_F32(B + tid * 4)
  c = FADD_F32(a, b)
  ST_GLOBAL_F32(C + tid * 4, c)
}
RET

What matters here:

  • No hard-coded warp size.
  • Divergence is represented as predicate / structured control-flow so it can map to SIMT hardware or be simulated on non-SIMT designs.

2.2 How runtime lowering typically looks

The runtime behaves like a multi-backend JIT/translator:

  • NVIDIA: hetIR -> PTX -> driver JIT -> SASS
  • AMD/Intel: hetIR -> SPIR-V -> driver JIT -> native ISA
  • Tenstorrent: hetIR -> Metalium/TT-MLIR/custom backend -> device binary

That’s why it helps to treat hetGPU as a system, not a one-off converter.

2.3 What does “binary compatibility” actually mean? (It’s not just API compatibility)

When people hear “GPU binary compatibility”, the knee-jerk interpretation is often:

So… I can keep my CUDA API calls and it’ll run on AMD/Intel without changes?

But what hetGPU is really pushing toward is closer to a different “shipping artifact”.

If you think like an engineer who has to deliver a GPU component (a prebuilt kernel library, a custom inference op, a proprietary plugin), the problem naturally splits into three layers:

  • API compatibility: can I call the same functions with the same parameters? (surface-level)
  • semantic compatibility: do threads/sync/memory visibility behave the same way? (the real hard part)
  • instruction-level compatibility: can NVIDIA SASS run on AMD/Intel/Tenstorrent ISAs? (almost never)

hetGPU’s bet is: don’t try to force instruction-level compatibility. Instead, ship a portable IR (hetIR) and do the device-specific “landing” in the runtime.

A very practical packaging story: you want to ship 1 build, not 4

Imagine you’re shipping one GPU operator to customers:

  • Customer A: NVIDIA box
  • Customer B: AMD box
  • Customer C: Intel GPU
  • Customer D: Tenstorrent

A traditional approach typically ends up in one of these:

  1. Ship source and ask users to build (portable, but painful for delivery)
  2. Ship a fat binary (you build 4 targets, user picks the right one)

A fat-binary package looks like:

my-kernel.pkg
  ├─ nvidia.cubin
  ├─ amd.hsaco
  ├─ intel.spv
  └─ tt.bin

It works, but maintenance cost grows quickly as devices and versions multiply.

hetGPU tries to make the package look more like:

my-kernel.pkg
  ├─ kernel.hetir       # the portable IR is the primary artifact
  └─ runtime/           # runtime lowers hetIR to the detected device

In one line of pseudo-code, the runtime does:

native = JIT(hetIR, detectedGPU)
launch(native)

That’s the key point: you ship a single portable intermediate form, not vendor machine code.

And yes—once you do that, your biggest debt moves to “semantic alignment”, especially the execution-model gap (SIMT vs MIMD). That’s the hard part, and it’s exactly what the next section dives into.

3. The hardest gap: SIMT (warp/wave) vs MIMD (multi-core RISC-V)

NVIDIA/AMD: SIMT is baked into hardware

NVIDIA and AMD GPUs implement SIMT:

  • warps (NVIDIA, typically 32 threads)
  • wavefronts (AMD, 32 or 64 threads)

Hardware handles warp scheduling, divergence masks, and block-level barriers.

Tenstorrent: MIMD cores, no warp scheduler

Tenstorrent’s Tensix cores behave more like many small RISC-V CPUs with vector/matrix units. There’s no hardware “warp” concept.

To run CUDA-like kernels, there are two common mapping strategies:

  1. Vectorize a warp inside one core: map 32 threads to vector lanes, handle divergence through mask registers.
  2. Split a warp across cores: sync cores explicitly and use a protocol to emulate warp-level coordination.

This is effectively a bridge between “warp-centric” and “core-centric” designs.

3.3 Divergence on a MIMD core (vectorized-warp pseudo)

If a Tenstorrent core can execute vector lanes, a common strategy is: “one core simulates one warp”.

Key idea: single instruction stream + lane masks.

mask = ALL_ONES

// if (x < 0) y = -x; else y = x;
pred = (x < 0)          // per-lane predicate

mask_then = mask & pred
if (ANY(mask_then)) {
  y[mask_then] = -x[mask_then]
}

mask_else = mask & !pred
if (ANY(mask_else)) {
  y[mask_else] = x[mask_else]
}

mask = mask_then | mask_else

This is basically SIMT divergence/convergence expressed in a way that MIMD hardware can execute.

4. Memory model mismatches: what is ‘shared memory’ on a different architecture?

On NVIDIA/AMD, shared memory (CUDA shared / AMD LDS) is a fast on-chip scratchpad visible to threads in the same block, with clear semantics after a barrier.

On Tenstorrent, local memory is per-core, and cross-core sharing requires explicit DMA or mesh communication.

hetGPU’s abstraction layer must present a consistent model:

  • map to real shared memory when hardware supports it
  • emulate it when it doesn’t (with global memory or by constraining block placement)

5. ISA gaps: vote, shuffle, tensor ops

Vendor-specific binaries contain operations that don’t have 1:1 equivalents elsewhere:

  • warp vote / ballot
  • shuffle
  • tensor core instructions

The approach is to capture these as IR-level primitives, then:

  • use native ops when available
  • fall back to software implementations when not

This is one reason the system prefers a semantic IR over raw machine-code translation.

5.1 Example: vote/ballot without a native instruction (pseudo)

Take ballot(pred) which returns a bitmask of lanes where pred=true.

On NVIDIA/AMD you can map this to native vote instructions. On a backend without an equivalent, a conservative fallback is “scratch + reduce”:

// team_size = 32, tid in [0..31]
scratch[tid] = pred ? 1 : 0
barrier()

mask = 0
for i in 0..31 {
  if scratch[i] == 1 {
    mask |= (1 << i)
  }
}
barrier()
return mask

It’s slower than hardware vote, but semantics stay consistent across devices.

5.2 Example: shuffle without a native instruction (pseudo)

For shfl(val, srcLane) (read a register value from another lane), you can also fall back to a shared scratch buffer:

scratch[tid] = val
barrier()

out = scratch[srcLane]
barrier()

return out

6. State capture and live migration: the “you can’t fake it later” part

Live migration needs device-independent state:

  • registers
  • program counter / control flow state
  • divergence masks
  • shared memory state
  • global memory snapshot

There’s a real practical problem here: optimized low-level machine state often cannot be cleanly mapped back to a high-level state view.

So hetGPU leans on co-design:

  • IR metadata for execution state
  • well-defined safe points (e.g., barriers) for checkpointing
  • serialization in an abstract, device-neutral format

This makes the problem tractable by limiting “where you can pause” to places where semantics are well-defined.

6.1 What a checkpoint needs to capture (shape only)

For cross-device resume, the state must be device-neutral. A simplified shape looks like:

Checkpoint {
  kernel_id
  grid_dim, block_dim
  threads: [ { pc_ir, regs_ir[], pred_mask } ]
  shared_mem_dump
  global_mem_regions: [ {addr, size, bytes...} ]
}

The important part isn’t the exact fields; it’s that the checkpoint is defined at the IR level. Otherwise you only have vendor-specific machine state that can’t be reconstructed elsewhere.

6.2 Why “safe points” (like barriers) matter

In practice, “freeze at any instruction” is unrealistic across architectures.

A practical approach is:

  • define resume-safe points in IR (barriers, explicit sync points)
  • checkpoint only at those points

Because only then the state is well-defined enough to serialize and restore.

6.3 Example: safe-point metadata in IR (pseudo)

If you want migration to be reliable, you typically need explicit “resume-safe” locations at the IR level.

// ... normal instructions ...
SAFE_POINT(id=42, reason="block_barrier")
barrier()
// ... normal instructions ...

Then the runtime can say: “checkpoint at the next safe point” instead of trying to freeze at an arbitrary instruction.

6.4 Example: split a long kernel into resumable segments (pseudo)

An engineering-friendly pattern is to split a long-running kernel into segments, with explicit boundaries:

kernel_seg0(input, state_out) {
  // do work...
  barrier()
  dump_state(state_out)
}

kernel_seg1(state_in, output) {
  load_state(state_in)
  // continue work...
}

Migration becomes a simple control flow:

  1. run kernel_seg0 on GPU A and export state
  2. move state_out (and required global memory) to GPU B
  3. run kernel_seg1 on GPU B

You give up “freeze-anywhere” flexibility, but you gain a state model that’s actually explainable and reproducible.

7. Why existing approaches don’t fully solve it

  • OpenCL/SYCL/HIP: source portability, not binary portability
  • fat binaries: bundle multiple targets (still N builds)
  • API interception (e.g., ZLUDA/Ocelot): complex surface area, partial coverage and correctness challenges
  • checkpoint tools (CRIUgpu-like): usually same-vendor assumptions

hetGPU’s point is combining: portable IR + dynamic lowering + cross-device state management.

8. What to take away (value and boundaries)

I read this direction as a system architecture statement:

  • cross-vendor GPU binaries require a semantic IR, not just instruction translation
  • SIMT vs MIMD is a first-class design problem
  • live migration requires explicit state modeling and safe points

But the trade-offs are real:

  • JIT cost
  • portability often means giving up some vendor-specific peak optimizations
  • some instructions/features may degrade to slower “portable semantics”

References