hetGPU: Chasing Cross-Vendor GPU Binary Compatibility
An engineering-oriented guide to hetGPU: how a compiler + runtime stack can make one GPU binary run across NVIDIA/AMD/Intel/Tenstorrent, including SIMT vs MIMD, memory model gaps, and live kernel migration.
GPU fleets are getting more heterogeneous: NVIDIA, AMD, Intel, and even newer designs like Tenstorrent can show up in the same cluster.
But the software reality hasn’t caught up: GPU binaries are not portable across vendors.
CUDA code compiled for NVIDIA won’t run on AMD or Intel GPUs because the mismatch is not just APIs. It’s deeper:
- different ISAs and toolchains
- different execution models (SIMT vs MIMD)
- different memory hierarchies and consistency rules
- different driver/runtime stacks
This becomes real operational pain:
- vendor lock-in for large CUDA codebases
- inflexible scheduling in mixed clusters
- live migration/failover is basically “stop and restart”
1. Motivation: why binary portability matters
Source-level portability exists (OpenCL, HIP, SYCL/oneAPI), but it still requires rebuilding for each target. That doesn’t help when you:
- ship prebuilt GPU libraries
- want elastic scheduling across mixed GPU pools
- want to checkpoint and resume a running kernel on a different device type
The core issue is straightforward: lack of binary compatibility blocks heterogeneous scheduling and makes adoption of new accelerators expensive.
2. The hetGPU idea: compiler + runtime as one system
hetGPU is designed as a full stack:
- a compiler that emits an architecture-neutral GPU IR (instead of vendor machine code)
- a runtime that dynamically lowers that IR to the detected target GPU (JIT / translation)
- an abstraction layer that normalizes threads, memory, and synchronization semantics
Put simply:
Don’t bind the binary to NVIDIA SASS or AMD GCN. Bind it to a portable IR, and let the runtime specialize at load time.
2.1 Minimal example: CUDA → hetIR (pseudo)
Let’s use a classic vector-add style kernel to keep things concrete:
__global__ void vadd(const float* A, const float* B, float* C, int N) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < N) C[i] = A[i] + B[i];
}
Instead of emitting NVIDIA SASS or AMD ISA directly, hetGPU emits an architecture-neutral IR (pseudo hetIR style):
tid = GET_GLOBAL_ID(0)
pred = (tid < N)
@pred {
a = LD_GLOBAL_F32(A + tid * 4)
b = LD_GLOBAL_F32(B + tid * 4)
c = FADD_F32(a, b)
ST_GLOBAL_F32(C + tid * 4, c)
}
RET
What matters here:
- No hard-coded warp size.
- Divergence is represented as predicate / structured control-flow so it can map to SIMT hardware or be simulated on non-SIMT designs.
2.2 How runtime lowering typically looks
The runtime behaves like a multi-backend JIT/translator:
- NVIDIA:
hetIR -> PTX -> driver JIT -> SASS - AMD/Intel:
hetIR -> SPIR-V -> driver JIT -> native ISA - Tenstorrent:
hetIR -> Metalium/TT-MLIR/custom backend -> device binary
That’s why it helps to treat hetGPU as a system, not a one-off converter.
2.3 What does “binary compatibility” actually mean? (It’s not just API compatibility)
When people hear “GPU binary compatibility”, the knee-jerk interpretation is often:
So… I can keep my CUDA API calls and it’ll run on AMD/Intel without changes?
But what hetGPU is really pushing toward is closer to a different “shipping artifact”.
If you think like an engineer who has to deliver a GPU component (a prebuilt kernel library, a custom inference op, a proprietary plugin), the problem naturally splits into three layers:
- API compatibility: can I call the same functions with the same parameters? (surface-level)
- semantic compatibility: do threads/sync/memory visibility behave the same way? (the real hard part)
- instruction-level compatibility: can NVIDIA SASS run on AMD/Intel/Tenstorrent ISAs? (almost never)
hetGPU’s bet is: don’t try to force instruction-level compatibility. Instead, ship a portable IR (hetIR) and do the device-specific “landing” in the runtime.
A very practical packaging story: you want to ship 1 build, not 4
Imagine you’re shipping one GPU operator to customers:
- Customer A: NVIDIA box
- Customer B: AMD box
- Customer C: Intel GPU
- Customer D: Tenstorrent
A traditional approach typically ends up in one of these:
- Ship source and ask users to build (portable, but painful for delivery)
- Ship a fat binary (you build 4 targets, user picks the right one)
A fat-binary package looks like:
my-kernel.pkg
├─ nvidia.cubin
├─ amd.hsaco
├─ intel.spv
└─ tt.bin
It works, but maintenance cost grows quickly as devices and versions multiply.
hetGPU tries to make the package look more like:
my-kernel.pkg
├─ kernel.hetir # the portable IR is the primary artifact
└─ runtime/ # runtime lowers hetIR to the detected device
In one line of pseudo-code, the runtime does:
native = JIT(hetIR, detectedGPU)
launch(native)
That’s the key point: you ship a single portable intermediate form, not vendor machine code.
And yes—once you do that, your biggest debt moves to “semantic alignment”, especially the execution-model gap (SIMT vs MIMD). That’s the hard part, and it’s exactly what the next section dives into.
3. The hardest gap: SIMT (warp/wave) vs MIMD (multi-core RISC-V)
NVIDIA/AMD: SIMT is baked into hardware
NVIDIA and AMD GPUs implement SIMT:
- warps (NVIDIA, typically 32 threads)
- wavefronts (AMD, 32 or 64 threads)
Hardware handles warp scheduling, divergence masks, and block-level barriers.
Tenstorrent: MIMD cores, no warp scheduler
Tenstorrent’s Tensix cores behave more like many small RISC-V CPUs with vector/matrix units. There’s no hardware “warp” concept.
To run CUDA-like kernels, there are two common mapping strategies:
- Vectorize a warp inside one core: map 32 threads to vector lanes, handle divergence through mask registers.
- Split a warp across cores: sync cores explicitly and use a protocol to emulate warp-level coordination.
This is effectively a bridge between “warp-centric” and “core-centric” designs.
3.3 Divergence on a MIMD core (vectorized-warp pseudo)
If a Tenstorrent core can execute vector lanes, a common strategy is: “one core simulates one warp”.
Key idea: single instruction stream + lane masks.
mask = ALL_ONES
// if (x < 0) y = -x; else y = x;
pred = (x < 0) // per-lane predicate
mask_then = mask & pred
if (ANY(mask_then)) {
y[mask_then] = -x[mask_then]
}
mask_else = mask & !pred
if (ANY(mask_else)) {
y[mask_else] = x[mask_else]
}
mask = mask_then | mask_else
This is basically SIMT divergence/convergence expressed in a way that MIMD hardware can execute.
4. Memory model mismatches: what is ‘shared memory’ on a different architecture?
On NVIDIA/AMD, shared memory (CUDA shared / AMD LDS) is a fast on-chip scratchpad visible to threads in the same block, with clear semantics after a barrier.
On Tenstorrent, local memory is per-core, and cross-core sharing requires explicit DMA or mesh communication.
hetGPU’s abstraction layer must present a consistent model:
- map to real shared memory when hardware supports it
- emulate it when it doesn’t (with global memory or by constraining block placement)
5. ISA gaps: vote, shuffle, tensor ops
Vendor-specific binaries contain operations that don’t have 1:1 equivalents elsewhere:
- warp vote / ballot
- shuffle
- tensor core instructions
The approach is to capture these as IR-level primitives, then:
- use native ops when available
- fall back to software implementations when not
This is one reason the system prefers a semantic IR over raw machine-code translation.
5.1 Example: vote/ballot without a native instruction (pseudo)
Take ballot(pred) which returns a bitmask of lanes where pred=true.
On NVIDIA/AMD you can map this to native vote instructions. On a backend without an equivalent, a conservative fallback is “scratch + reduce”:
// team_size = 32, tid in [0..31]
scratch[tid] = pred ? 1 : 0
barrier()
mask = 0
for i in 0..31 {
if scratch[i] == 1 {
mask |= (1 << i)
}
}
barrier()
return mask
It’s slower than hardware vote, but semantics stay consistent across devices.
5.2 Example: shuffle without a native instruction (pseudo)
For shfl(val, srcLane) (read a register value from another lane), you can also fall back to a shared scratch buffer:
scratch[tid] = val
barrier()
out = scratch[srcLane]
barrier()
return out
6. State capture and live migration: the “you can’t fake it later” part
Live migration needs device-independent state:
- registers
- program counter / control flow state
- divergence masks
- shared memory state
- global memory snapshot
There’s a real practical problem here: optimized low-level machine state often cannot be cleanly mapped back to a high-level state view.
So hetGPU leans on co-design:
- IR metadata for execution state
- well-defined safe points (e.g., barriers) for checkpointing
- serialization in an abstract, device-neutral format
This makes the problem tractable by limiting “where you can pause” to places where semantics are well-defined.
6.1 What a checkpoint needs to capture (shape only)
For cross-device resume, the state must be device-neutral. A simplified shape looks like:
Checkpoint {
kernel_id
grid_dim, block_dim
threads: [ { pc_ir, regs_ir[], pred_mask } ]
shared_mem_dump
global_mem_regions: [ {addr, size, bytes...} ]
}
The important part isn’t the exact fields; it’s that the checkpoint is defined at the IR level. Otherwise you only have vendor-specific machine state that can’t be reconstructed elsewhere.
6.2 Why “safe points” (like barriers) matter
In practice, “freeze at any instruction” is unrealistic across architectures.
A practical approach is:
- define resume-safe points in IR (barriers, explicit sync points)
- checkpoint only at those points
Because only then the state is well-defined enough to serialize and restore.
6.3 Example: safe-point metadata in IR (pseudo)
If you want migration to be reliable, you typically need explicit “resume-safe” locations at the IR level.
// ... normal instructions ...
SAFE_POINT(id=42, reason="block_barrier")
barrier()
// ... normal instructions ...
Then the runtime can say: “checkpoint at the next safe point” instead of trying to freeze at an arbitrary instruction.
6.4 Example: split a long kernel into resumable segments (pseudo)
An engineering-friendly pattern is to split a long-running kernel into segments, with explicit boundaries:
kernel_seg0(input, state_out) {
// do work...
barrier()
dump_state(state_out)
}
kernel_seg1(state_in, output) {
load_state(state_in)
// continue work...
}
Migration becomes a simple control flow:
- run
kernel_seg0on GPU A and export state - move
state_out(and required global memory) to GPU B - run
kernel_seg1on GPU B
You give up “freeze-anywhere” flexibility, but you gain a state model that’s actually explainable and reproducible.
7. Why existing approaches don’t fully solve it
- OpenCL/SYCL/HIP: source portability, not binary portability
- fat binaries: bundle multiple targets (still N builds)
- API interception (e.g., ZLUDA/Ocelot): complex surface area, partial coverage and correctness challenges
- checkpoint tools (CRIUgpu-like): usually same-vendor assumptions
hetGPU’s point is combining: portable IR + dynamic lowering + cross-device state management.
8. What to take away (value and boundaries)
I read this direction as a system architecture statement:
- cross-vendor GPU binaries require a semantic IR, not just instruction translation
- SIMT vs MIMD is a first-class design problem
- live migration requires explicit state modeling and safe points
But the trade-offs are real:
- JIT cost
- portability often means giving up some vendor-specific peak optimizations
- some instructions/features may degrade to slower “portable semantics”
References
- hetGPU code: https://github.com/Multi-V-VM/hetGPU