Stack buffer overflows have been around since the 1980s, but it took the Morris worm (1988) to wake the industry up. The response was piecemeal — a compiler patch here, a kernel feature there — and each layer tells a story about what attackers were doing at the time.
I’ve spent more nights than I care to count staring at core dumps from production systems where one of these protections was silently disabled by a bad build flag. This article walks through each protection layer from the perspective of someone who’s debugged them failing, disabled them for testing, and watched attackers work around them.
The attack surface: what gets protected
A classic stack overflow overwrites the return address of a function:
[buf (64 bytes)] [saved ebp (4)] [return addr (4)] [args...]
↑ write past here ↑ overwritten ↑ hijacked
Each protection targets a different part of this chain, and the order they were introduced matters:
- Canaries — catch corruption before
retexecutes. First line of defense, but trivially bypassed if you can leak. - NX — stop shellcode on the stack dead. This single feature forced the entire ROP ecosystem into existence.
- ASLR — make it hard to find anything. Forces attackers to leak addresses first.
- PIE — close the ASLR gap on the executable itself. Without it, ASLR is half useless.
- RELRO — lock down the GOT so you can’t hijake function pointers as a workaround.
- CFI/CET — the hardware-assisted endgame that’s still rolling out.
1. Compiler-based protections
StackGuard / Stack canary (-fstack-protector)
StackGuard was originally a GCC patch from Immunix. The idea is simple: put a random value (the canary) between local variables and the saved frame pointer/return address, check it before ret, and abort if it changed.
GCC gives you three levels:
-fstack-protector— only protect functions with local arrays of 8+ bytes.-fstack-protector-strong— protect any function with local variables (much broader, modest code size increase).-fstack-protector-all— every single function. Don’t use this in production unless you enjoy explaining 15% performance regressions to your manager.
The canary itself is stored in %gs:0x14 (thread-local storage) and seeded from /dev/urandom with a time-based fallback. It contains null bytes (0x00, 0x0a, 0x0d, 0xff) specifically to defeat string-based overflows — strcpy stops at null, so the canary is hard to overwrite in one shot.
push %ebp
mov %esp,%ebp
sub $0x??,%esp
mov %gs:0x14,%eax ; load canary from TLS
mov %eax,-0x??(%ebp) ; place it before return address
; ... function body ...
mov -0x??(%ebp),%eax ; load canary back
xor %gs:0x14,%eax ; compare with original
jne .L_stack_chk_fail ; if different, abort
leave
ret
What actually bypasses canaries in practice:
I’ve seen three patterns in the wild:
-
Leak the canary first. Format string bugs, uninitialized stack reads, or any info leak that reaches the canary position. Once you have the value, your overflow can reproduce it byte-for-byte. This is the most common bypass and it’s why canary + non-PIE binary is such a dangerous combination — the attacker leaks once and reuses the canary across connections.
-
Skip the return address entirely. Overwrite a function pointer (GOT entry,
vforkcallback, C++ vtable) that gets called before the current function returns. The canary check never fires because you never reach it. This is why RELRO matters — without it, you’re one GOT overwrite away from owning the process regardless of canary. -
Target
__stack_chk_failitself. If the binary has partial RELRO (default on most systems), the GOT entry for__stack_chk_failis writable. Overwrite it with a gadget address, trigger the canary check, and your “abort” becomes a trampoline to your ROP chain. I found this in a CTF once and spent an embarrassing amount of time convincing myself it would actually work in production. It does.
Performance reality: -fstack-protector-strong adds roughly 1-3% code size overhead and negligible CPU cost on modern hardware (the canary load is in L1 cache). -fstack-protector-all can hit 10-15% on hot-path functions and I’ve seen it push functions past icache line boundaries, causing measurable slowdowns. Stick with -strong unless you have a specific reason not to.
StackShield
StackShield takes a different approach — copies the return address to a separate protected stack at function entry and restores it at exit. It uses the regular frame pointer for everything else. The main limitation: single-byte overflows that overwrite only the LSB of %ebp won’t be caught (the saved return address on the protected stack is still valid, but the frame pointer itself is corrupted, which can still be exploited through frame-pointer-dependent code paths).
I’ve never seen StackShield in production. It’s largely of historical interest.
2. Library-based protections
FormatGuard
A glibc patch that adds compile-time checking to *printf() format strings — compares format specifier count against argument count. If they mismatch, it logs and aborts. Won’t protect code that calls write() or syscall() directly, and it never made it into mainline glibc. You’ll only encounter it on heavily customized embedded systems.
Libsafe
Intercepts dangerous functions (strcpy, strcat, sprintf, gets, vsprintf) via LD_PRELOAD and replaces them with bounds-checked versions. The catch: it only protects stack smashing and format strings. Heap overflows are untouched. Legacy systems running RHEL 4-era distributions sometimes still have this.
3. Non-executable stack (NX / DEP)
This is the single most impactful protection ever added. It’s also the one that caused the most breakage when it shipped.
Software-based: Linux kernel patches
Before CPUs had NX support, kernel patches tried to get the same effect through creative means:
-
Solar Designer’s kernel patch (1997): reduced the code segment limit so that the stack at high addresses fell outside the executable range. Any
retto a stack address triggered a GPF. Clever, but it meant you couldn’t have more than ~2.7GB of RAM total (the CS limit was 0xABBBFFFF or similar). I inherited a system with this patch once and spent a day figuring out why we could only address 2.7GB of 4GB RAM. -
Exec-shield (Red Hat): maintained a per-process “executable bound.” Everything above it (stack, mmap, heap) was non-executable by default. You could toggle it with
sysctl -w kernel.exec-shield=0. This shipped in RHEL3 and broke a lot of JIT compilers, nested function trampolines, and old BSD-ported code that usedmprotecton stack pages. -
kNoX, RSX: similar approaches for earlier 2.4 kernels. If you’re running anything this old, you have bigger problems.
Hardware-based: NX bit (AMD) / XD bit (Intel)
AMD added the NX bit in K8 (2003). Intel followed with Prescott (2004). The kernel sets a single bit in the page table entry. If the page has the NX bit set and %rip points into it, you get a page fault.
Check NX status:
$ dmesg | grep -i 'execute|NX'
NX (Execute Disable) protection: active
The ELF GNU_STACK header controls whether the loader marks the stack executable at runtime:
$ readelf -lW ./a.out | grep GNU_STACK
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RWE 0x10
RWE = executable stack. RW- = non-executable. If your binary has RWE in production, you’d better have a very good reason (JVM with closed-source JIT, old Lisp machine runtime, that one vendor library from 2004).
The NX bypass that changed everything: ROP
NX forced attackers to stop injecting shellcode on the stack. The immediate response was return-to-libc — call system("/bin/sh") by pointing the return address directly at system() in libc with the argument set up on the stack.
But return-to-libc is limited. You can chain at most a few libc calls. ROP (Return-Oriented Programming) blew that open: find short code sequences ending in ret anywhere in the binary or libraries (gadgets), chain them to execute arbitrary logic. Every ret pops the next gadget address from the stack and jumps to it.
The NX → ROP transition is the single most important evolution in exploitation technique in the last 20 years. Every bypass after this is a variation on ROP.
Practical NX gotchas I’ve hit:
-
JIT code on the heap: If you have a JIT that writes to mmap’d memory, make sure it uses
mmap(PROT_READ | PROT_WRITE | PROT_EXEC)explicitly. Some JITs assume the heap is executable (it hasn’t been, by default, for a decade). If your JIT crashes onSIGSEGVwith the instruction pointer in a heap page, this is why. -
mprotectas an attack primitive: If the attacker controls anmprotectcall (via ROP), they can mark any page executable and skip NX entirely. This is hard to prevent without PaX-level MPROTECT restrictions. -
Signal handler trampolines: Old kernels used the stack for
sigreturntrampolines. If the stack was non-executable, signal delivery broke. Fixed in 2.6.x by moving trampolines to the vsyscall page.
4. ASLR (Address Space Layout Randomization)
NX forced ROP, and ROP forced knowledge of where code lives. ASLR denies that knowledge.
ASLR levels on Linux
Controlled by /proc/sys/kernel/randomize_va_space:
| Value | Meaning |
|---|---|
| 0 | Disabled |
| 1 | Partial — stack, mmap, shared libraries randomized |
| 2 | Full — above + brk-based heap randomized |
Default is 2 since kernel 2.6.12. You’ll practically never see 1 in the wild unless someone explicitly set it.
$ cat /proc/sys/kernel/randomize_va_space
2
Disable for testing (don’t do this on a production system unless you enjoy explaining security incidents):
$ sudo sysctl -w kernel.randomize_va_space=0
What actually gets randomized
- Stack: random offset applied at
exec. 22 bits of entropy on x86-64, 11 bits on 32-bit. - mmap base: where
ld.so, shared libraries,vdso, andmmap()allocations land. ~28 bits on x86-64, 8 bits on 32-bit. - Heap:
brkbase gets a random offset only at level 2. - Executable: only randomized if compiled as PIE. This is the most commonly overlooked gap — a non-PIE binary means
readelf -hshows a fixed text address.
Checking on a running process
$ cat /proc/1234/maps | head
559a7a0e9000-559a7a0ea000 r-xp ... /bin/pie-binary # moves between runs
7f8c4a0e9000-7f8c4a0ea000 r-xp ... /lib/libc.so # moves between runs
7ffc9a0e9000-7ffc9a0ea000 rw-p ... [stack] # moves between runs
Without ASLR, these addresses are identical every time. A non-PIE binary shows a fixed address for its text segment, which means the attacker knows gadget addresses without any leak.
ASLR bypasses I’ve seen in the wild
-
Information leak is the only game in town. Format strings, uninitialized stack reads, side-channel timing, speculative execution (Spectre). If the attacker can read any address from the process, ASLR collapses. This is why hardening guides that only enable ASLR without also enabling PIE and Full RELRO are dangerously incomplete.
-
32-bit entropy is laughable. 8-11 bits of mmap entropy on x86 means guessing a libc base address takes at most 2048 tries. On a forking server (Apache prefork, old SSH), the attacker can brute-force across connection attempts. This is why 32-bit systems were abandoned for security-sensitive deployments years ago.
-
Partial overwrite. A single-byte overflow on the return address only changes the low byte. The high bytes (which carry ASLR entropy) stay intact. If the attacker can reach a nearby function that does something useful, they don’t need to know the full address.
-
vsyscall page. Older kernels mapped a vsyscall page at a fixed address (
0xffffffffff600000). Even with ASLR and PIE, this page was always at the same address. It contained useful gadgets (ret,syscall). Attacked by the ret2vsyscall technique. Modern kernels replaced vsyscall with vdso (properly randomized) or use emulation mode (trap-and-emulate, very slow).
5. PIE (Position Independent Executable)
Without PIE, the executable’s text segment is at a fixed address. ASLR randomizes libraries and stack, but the main binary sits at the same spot every time. The attacker reads gadgets from the binary (prologue, __libc_csu_init, etc.) without needing any leak.
Check:
$ readelf -h ./a.out | grep Type
Type: DYN (Shared object file) ← PIE
Type: EXEC (Executable file) ← not PIE
Performance cost of PIE:
PIE costs roughly 5-10% on x86-64 and 10-15% on x86-32 (due to register pressure — PIC on 32-bit needs an extra register for the GOT base). This is measurable but rarely critical. The one exception: databases and latency-sensitive network services where every microsecond counts. I’ve seen organizations pin binaries to specific addresses to avoid PIE overhead. And then get owned because a non-PIE binary made ASLR bypass trivial.
Modern distributions default to PIE. If you’re building from source with an old toolchain, -no-pie is the default and you must explicitly pass -fpie -pie. I’ve audited Docker images where the base image was Debian Buster but the app was compiled on Ubuntu 16.04 with no PIE flags. Check your build containers.
6. RELRO (RELocation Read-Only)
RELRO controls GOT writability. This matters because GOT overwrite is the standard way to bypass canary + NX: you overwrite a frequently-called function’s GOT entry (say, strlen or free) to point at system(), and the next call is your shell.
Three states:
- No RELRO — entire GOT writable.
-Wl,-z,norelro. Almost no modern toolchain defaults to this. - Partial RELRO —
.got(global variables) is made read-only, but.got.plt(function pointers used by lazy binding) remains writable. This is the GCC/Clang default. - Full RELRO —
-Wl,-z,relro,-z,now. All relocations resolved at load time, then the entire GOT is mprotected to read-only. Lazy binding is disabled.
Check:
$ readelf -l ./a.out | grep GNU_RELRO
GNU_RELRO 0x000e78 0x00000000000e78 ... R 0x1
When Full RELRO breaks things:
- Startup time increases because all symbols are resolved at load time instead of on first call. For large C++ applications with thousands of symbols, this can add 200-500ms to startup. Usually fine. Not fine for CLI tools that need to start and exit in <50ms.
- Plugins or dlopen’d libraries that depend on lazy binding will break. If you have a plugin system that loads
.sofiles withRTLD_LAZY, Full RELRO on the main binary forces immediate resolution, which can fail if the plugin introduces new symbol dependencies. - Glibc’s
dlsymstill works — Full RELRO doesn’t prevent runtime symbol lookup, it just makes the GOT read-only after initialization.
A gotcha I hit: Partial RELRO + Full ASLR + PIE + NX looks like a hardened binary. But one format string bug in a server that uses lazy binding gives the attacker a writable .got.plt. They overwrite free → system, wait for the next free(buf) where buf contains /bin/sh, and you’re done. Full RELRO closes this, but most build systems don’t add -z now by default.
7. Kernel-level MAC and hardening
PaX
PaX is the comprehensive kernel patch that goes beyond what mainline Linux offers. It’s part of Grsecurity (the maintained version). Key features that matter:
- MPROTECT: you cannot create a mapping that is both writable and executable at the same time. This stops the common “allocate RWX memory, write shellcode, jump there” pattern used by JIT spray and shellcode loaders.
- RANDMMAP: randomizes the mmap base address independently per call, not just per-exec. Even if an attacker leaks one allocation address, they can’t infer others.
- Kernel const hardening: syscall table, IDT, GDT marked read-only after initialization.
Standalone PaX is dead. Grsecurity is the only way to get it, and it requires a paid subscription for modern kernels.
PaX in practice: I’ve run Grsecurity kernels in production for about three years. The only breakage was a JVM that used mprotect to set JIT code pages to RWX. We had to add a PaX exception for the Java process. If you don’t have legacy JIT or binary blob drivers, PaX is remarkably compatible.
Grsecurity
PaX + MAC (Mandatory Access Control). Key features:
/tmphardening against symlink races.ptracerestricted to child processes only.- Chroot jail hardening.
- Network stack hardening.
SELinux
Not a buffer overflow protection per se, but it limits blast radius. If a process is compromised, SELinux restricts what it can do (execute code in certain directories, write to certain files, connect to certain ports). The Chrome sandbox relies heavily on SELinux (or AppArmor) to contain renderer compromise.
SELinux is not a substitute for the protections above — it’s a complement. A binary with no canary, no NX, no ASLR, and no PIE is still trivially exploitable even under SELinux. You need both.
8. Modern CFI (Control Flow Integrity)
Hardware-assisted forward-edge and backward-edge CFI. This is where the industry is heading.
Shadow Stack (CET)
A separate, hardware-protected stack that stores return addresses. On CALL, the return address is pushed to both the regular stack and the shadow stack. On RET, the CPU compares them. Mismatch → #CP (Control Protection) fault.
No software exploit can corrupt the shadow stack — it’s in protected memory that the CPU manages. Even with arbitrary read/write, you can’t change shadow stack entries.
IBT (Indirect Branch Tracking)
Marks valid indirect jump targets with ENDBRANCH instructions (ENDBR64 or ENDBR32). If an indirect jump or call lands on an instruction that isn’t ENDBRANCH, the CPU faults. This stops ROP/JOP chains that jump into the middle of gadgets.
Current state: Intel CET ships in Tiger Lake and later (2020+). AMD supports Shadow Stack since Zen 3 (2022). Linux kernel support: 5.18+ for IBT, 6.1+ for Shadow Stack. glibc 2.33+ supports CET binaries.
Reality check: Most x86-64 systems in production as of 2026 don’t have CET-capable CPUs. Shadow Stack especially is rare in cloud environments — EC2 instances with CET support are still a minority. That said, IBT is becoming more common (Intel requires it for Sapphire Rapids Xeon). If you’re building for a modern fleet, enable CET at compile time (-fcf-protection=full on GCC). It’s forward-compatible: binaries with CET markings run fine on pre-CET hardware (the markings are NOPs on older CPUs).
9. Checking protections on a binary
I keep this checklist handy — I’ve lost count of how many times a “hardened” binary was missing one of these:
# PIE
readelf -h ./a.out | grep Type
# NX stack
readelf -lW ./a.out | grep GNU_STACK
# RELRO (check if NOW is set — Partial vs Full)
readelf -a ./a.out | grep BIND_NOW
readelf -l ./a.out | grep GNU_RELRO
# Canary
objdump -d ./a.out | grep __stack_chk_fail
# ASLR (system-wide)
cat /proc/sys/kernel/randomize_va_space
Summary table:
| Protection | Check | Disable (testing only) |
|---|---|---|
| Canary | `objdump -d | grep __stack_chk_fail` |
| NX | `readelf -lW | grep GNU_STACK` |
| PIE | `readelf -h | grep Type` |
| RELRO | `readelf -l | grep GNU_RELRO+BIND_NOW` |
| ASLR | /proc/sys/kernel/randomize_va_space |
sysctl -w kernel.randomize_va_space=0 |
Putting it all together: what a production binary should have
A modern, hardened x86-64 binary:
- PIE (Type DYN)
- Full RELRO (GNU_RELRO segment + BIND_NOW)
- NX stack (GNU_STACK marked RW-)
- Canary (
-fstack-protector-strong) - CET (
-fcf-protection=full) if your target supports it - ASLR must be at level 2 system-wide
If any of these are missing, you have a gap. Whether that gap is exploitable depends on the rest of your attack surface, but I’ve learned the hard way that attackers find the missing piece faster than you think.
References
- PaX documentation: https://pax.grsecurity.net/
- Grsecurity features: https://grsecurity.net/features.php
- Intel CET spec: https://www.intel.com/content/www/us/en/developer/articles/technical/technical-look-control-flow-enforcement-technology.html
- Linux ASLR docs: https://www.kernel.org/doc/html/latest/admin-guide/sysctl/kernel.html#randomize-va-space