Dynamic linking is something no one thinks about until their binary won’t start, a library upgrade breaks production, or a seemingly innocent function call turns into a performance hotspot. I’ve debugged each of those scenarios enough times that the ELF structures and loader behavior are burned into my memory. This article walks through the machinery from the ground up — not as a reference manual, but as someone who has traced through PLT stubs with gdb more times than I’d like to admit.
You will hit these problems
A few scenarios where knowing dynamic linking saves your day:
lddshows the library is found, but the binary crashes withundefined symbol. The library was updated and the SONAME changed, but the old symlink wasn’t updated.- A shared library function mysteriously adds 50ms latency on first call because it triggers lazy binding in a latency-sensitive path.
- You
LD_PRELOADa debug library, but the symbols don’t intercept. The application calls through PLT, and the preloaded library doesn’t export the expected symbol names. - A GOT overwrite exploit works in your test environment but fails on the production kernel because you tested with Partial RELRO but prod has Full RELRO.
These are not theoretical. I’ve hit each one.
Why dynamic linking costs real performance
Static linking merges all code into one binary. Every function call is a direct call instruction. Every global variable access is a direct memory reference. The instruction encoding is minimal: call offset takes 5 bytes, mov addr, %reg takes 5–7 bytes.
Dynamic linking adds indirection. For every symbol imported from a shared library:
- Global variables: compiled to
load from GOT → dereferenceinstead ofload from absolute address. That’s one extra memory load per access — on a cache miss, that’s 50–100ns. - Function calls: compiled to
call PLT_stub → jmp *(GOT+n). After resolution, that’s one extra indirect jump (which the CPU’s branch predictor may or may not handle well). On first call, it’s hundreds of instructions in the dynamic linker.
The often-quoted 1–5% overhead is real but misleading. In CPU-bound code that calls library functions in a hot loop, the overhead can be 10–15% because of GOT register pressure (you lose one register to hold the GOT base) and PLT-related icache pollution. In I/O-bound or startup-dominated code, you won’t notice. The worst case is a latency-sensitive path that triggers lazy binding at runtime — the first call can take tens of microseconds as the dynamic linker resolves symbols.
If you’re building high-performance shared libraries and every instruction matters, consider whether those functions really need to be exported, or whether you can use -fvisibility=hidden and export only the API surface. Many projects export every symbol by default (-fvisibility=default), and the linker has to generate PLT entries and GOT slots for all of them, even if they’re only used internally.
PIC: Position Independent Code
Why PIC exists
ASLR randomizes where shared libraries are loaded. A library cannot know its base address at compile time. Without PIC, every absolute address would need a relocation entry, and the loader would have to patch every instruction at load time. PIC avoids that by making all memory references relative to the current instruction pointer or indirect through a table.
What it costs you
Let me show the actual cost with real instruction sequences. Consider a simple variable access:
extern int top;
void increment(void) { top++; }
Without PIC (compiled for a fixed-address executable):
mov 0x804a010, %eax ; 5 bytes, absolute address encoded in instruction
add $0x1, %eax
mov %eax, 0x804a010 ; 5 bytes
With PIC (compiled as position-independent):
mov -0xc(%ebx), %eax ; 3 bytes, GOT offset from base register
mov (%eax), %eax ; load real address from GOT
lea 0x1(%eax), %edx ; increment
mov -0xc(%ebx), %eax ; reload GOT entry (register pressure!)
mov %edx, (%eax) ; store through GOT
Five instructions instead of three. Two memory loads instead of one. And %ebx is pinned to the GOT base for the entire function, so you lose a general-purpose register. In register-pressure-heavy code, this forces spilling, which adds more memory traffic.
On x86-64, rip-relative addressing improves this somewhat:
mov top(%rip), %rax ; single instruction, RIP-relative
add $0x1, %rax
mov %rax, top(%rip) ; RIP-relative store
No GOT base register is needed, but the indirection still exists — the linker generates a relocation entry for top(%rip) and the loader patches the offset at load time.
The __i686.get_pc_thunk hack
On 32-bit x86, there is no instruction to read eip directly. Compilers work around this with a trick:
call __i686.get_pc_thunk.bx ; pushes return address = current EIP
add $0x1b6c, %ebx ; now ebx = GOT base
The thunk function is trivial:
__i686.get_pc_thunk.bx:
mov (%esp), %ebx ; copy return address (which IS eip) to ebx
ret
Every function that accesses global data or calls an imported function pays this overhead at entry. On x86-64, the call; pop sequence is not needed because [rip + offset] addressing is native.
GOT and PLT: the runtime resolution layer
GOT (Global Offset Table)
The GOT is an array of pointers, one per imported symbol and per internal data reference that needs relocation. It lives in its own section (.got and .got.plt). The key thing to understand about the GOT is:
- It is writable (under Partial RELRO). The dynamic linker writes resolved addresses into it.
- It is an attack surface. If an attacker can write to the GOT, they can redirect any library call. This is why FULL RELRO exists.
- It lives in the binary, not in the library. Each process that loads a shared library gets its own copy of the GOT in its own address space.
For variables, the GOT entry is filled in at load time by the dynamic linker (before main() runs). These are R_*_GLOB_DAT relocations. The loader reads the symbol’s actual address from the library’s data section and writes it into the GOT.
For functions, the GOT entry may be resolved lazily — or not, depending on the BIND_NOW setting.
PLT (Procedure Linkage Table)
The PLT is a set of trampoline stubs, one per imported function. On x86-64, a typical entry:
push@plt:
jmp *GOT[push](%rip) ; via GOT
push $reloc_index ; identify function to resolver
jmp plt0 ; call resolver
PLT[0] (the first entry, also called the resolver stub) is shared by all functions. It pushes the GOT[1] link map pointer and jumps to the resolver function pointed to by GOT[2].
When I first traced this with gdb, the penny-drop moment was seeing the GOT entry point back into the PLT itself — a self-referential pointer that makes lazy binding a single atomic swap to activate.
Lazy binding, step by step
Here is the exact sequence when main() calls push() from libstack.so for the first time:
Step 1: Call to PLT
main:
call 80483d8 <push@plt> ; call PLT stub
Step 2: PLT stub reads GOT
80483d8 jmp *0x804a008 ; GOT entry for push
GOT[push] currently holds 0x80483de — the address of the next instruction in the PLT itself. The jmp falls through.
Step 3: Push relocation index and call resolver
80483de push $0x10 ; relocation index (offset into .rel.plt)
80483e3 jmp 80483a8 ; PLT[0] → dynamic linker
Step 4: Dynamic linker resolves
The dynamic linker (ld-linux.so.2 or ld-linux-x86-64.so.2) receives the relocation index, looks it up in the .rel.plt table, finds the symbol name push, searches the loaded shared objects, finds libstack.so, computes the runtime address, and writes it to GOT[push]:
Before resolution: GOT[push] = 0x80483de (garbage, points back to PLT)
After resolution: GOT[push] = 0xb803f47c (real address)
Then the linker jumps to the resolved address.
Step 5: Second call skips all this
The next call to push@plt executes jmp *0x804a008, which now reads 0xb803f47c, and jumps directly to push() in libstack.so. No more resolver invocations.
Tracing it in gdb
$ gdb ./main
(gdb) start
(gdb) s ; step into push() call
0x080483d8 in push@plt ()
(gdb) x/gx 0x804a008 ; examine GOT entry
0x804a008: 0x080483de ; ← not resolved yet, points back
(gdb) si ; jmp *GOT — falls through
0x080483de in push@plt ()
(gdb) si ; push relocation index
0x080483e3 in push@plt ()
(gdb) si ; jmp to resolver
0x080483a8 in ?? ()
(gdb) si ; into ld-linux
0xb806a080 in ?? () from /lib/ld-linux.so.2
(gdb) finish ; let resolver do its work
(gdb) x/gx 0x804a008 ; examine GOT entry again
0x804a008: 0xb803f47c ; ← now resolved!
(gdb) x/5i 0xb803f47c
0xb803f47c: push %ebp ; real function, real code
A key observation: the jmp *GOT[push] is an indirect jump through memory. The CPU has to wait for the memory load to complete before it can fetch the next instruction. On a cold cache, this costs. In tight loops, this matters.
The dynamic linker’s relocation tables
Use readelf -r to see what the dynamic linker will resolve:
$ readelf -r libstack.so
Relocation section '.rel.dyn' at offset 0x2bc contains 4 entries:
Offset Info Type Sym.Value Sym. Name
000016bc 00000606 R_386_GLOB_DAT 00000000 g_share
000016c0 00000806 R_386_GLOB_DAT 00000000 __cxa_finalize
Relocation section '.rel.plt' at offset 0x2ec contains 2 entries:
Offset Info Type Sym.Value Sym. Name
000016d8 00000407 R_386_JUMP_SLOT 00000000 g_func
000016dc 00000807 R_386_JUMP_SLOT 00000000 __cxa_finalize
The four important relocation types for dynamic linking:
| Type | Table | When resolved | What it contains |
|---|---|---|---|
R_*_GLOB_DAT |
.got |
Load time | Address of a global variable |
R_*_JUMP_SLOT |
.got.plt |
Lazily (or at load time with BIND_NOW) |
Address of a function |
R_*_RELATIVE |
.got |
Load time | base_address + addend (for data within the library itself) |
R_*_TPOFF |
.got or .tls |
Load time | Thread-local storage offset |
The offset column in readelf -r output is the actual GOT slot address. The dynamic linker writes the resolved address there.
One weird detail: RELRO in practice
Most binaries on modern Linux are compiled with Partial RELRO by default. Here is what that means for you as someone who might need to evaluate exploitability:
-
No RELRO — the entire GOT
.gotand.got.pltare in the same writable page. A single write primitive can overwrite any function’s GOT entry. This is rare on modern systems but common in embedded or legacy code. -
Partial RELRO — the compiler reorders sections so that
.got(used byR_*_GLOB_DAT— variable addresses) is placed before.dataand marked read-only after initial relocation. But.got.plt(used byR_*_JUMP_SLOT— function addresses) is left writable because lazy binding needs to write to it at runtime. This is the default because it preserves lazy binding and the startup time benefit. -
Full RELRO (
-Wl,-z,relro,-z,now) — the dynamic linker resolves allR_*_JUMP_SLOTentries beforemain()starts. Then it callsmprotect()to make the entire GOT read-only. Lazy binding is disabled. Startup is slower (all symbols resolved upfront), but the GOT overwrite attack surface is eliminated.
Check which one a binary has:
$ readelf -l ./a.out | grep -A1 GNU_RELRO
GNU_RELRO 0x000e78 0x00000000000e78 ... R 0x1
$ readelf -S ./a.out | grep -E 'got|plt'
[22] .got PROGBITS 0000000e78 ...
[23] .got.plt PROGBITS 0000000f08 ...
If .got.plt falls within the GNU_RELRO segment’s range, it’s Full RELRO. If it’s after, it’s Partial.
In practice: most distro binaries (Ubuntu 22.04+, Fedora 35+) use Full RELRO. But if you compile with just gcc -o test test.c locally, you get Partial RELRO. This mismatch has caused plenty of exploit demos to work in “lab conditions” and fail on production systems.
Performance: real advice
A few things I’ve learned from profiling:
-
Lazy binding latency spikes are real. In a server that calls
dlopenat runtime for plugins, the first invocation of each plugin function triggers resolution. If that happens on a request path, you’ll see latency outliers in the 1–10ms range (depending on symbol table size). The fix: pre-call the functions during initialization to warm up the GOT, or useLD_BIND_NOW. -
PLT in hot paths steals icache. Each PLT entry is 16 bytes on x86-64. With hundreds of imported functions, the PLT can consume multiple cache lines. The indirect jump at the start of each entry also interferes with branch prediction — the CPU’s indirect branch predictor has limited entries (typically 4K–16K BTB). If your hot path calls many library functions, you will see branch mispredictions at the PLT site.
-
GOT indirection in tight loops costs. A loop that reads a global variable from a shared library pays an extra load through the GOT every iteration. The fix: load the value into a local at loop entry:
// Slow: reloads from GOT every iteration
for (int i = 0; i < n; i++)
buf[i] = global_config.max_size;
// Fast: capture once
size_t max = global_config.max_size;
for (int i = 0; i < n; i++)
buf[i] = max;
- Diagnose PLT overhead with perf: If you see
PLTentries inperf report, the library boundary is costing you. Consider:-fvisibility=hidden+ explicit__attribute__((visibility("default")))on the API- Move the function to a header as
static inline - Use
-flto(link-time optimization) to let the linker inline across translation units
Disable lazy binding at runtime for diagnosis:
LD_BIND_NOW=1 ./a.out # resolve everything at startup
Compare startup time and steady-state latency between LD_BIND_NOW=1 and default. If steady-state improves, lazy binding was interfering with your hot path (maybe through icache pollution from the resolver). If it gets worse, you’re paying resolution cost for rarely-used functions.
FAQ
Q1: Why does the PLT use jmp instead of call for the first instruction?
A: jmp *GOT[n] is a tail-call. If it goes to the real function (after resolution), the ret in the target returns directly to the original caller — not through the PLT. If it falls through (before resolution), the resolver eventually jumps to the target, maintaining the same behavior.
Q2: Can I manually call the dynamic linker to resolve a symbol?
A: Yes, via dlsym(RTLD_NEXT, "funcname") or dlvsym. This is how LD_PRELOAD wrappers call the original function after intercepting it.
Q3: How does LD_PRELOAD interact with PLT resolution?
A: LD_PRELOAD libraries are loaded first and their symbols take priority. When the dynamic linker resolves R_*_JUMP_SLOT func, it finds the preloaded library’s func first and writes its address into the GOT.
Q4: What is __gmon_start__ in PLT[0]?
A: A profiling hook used by gprof. It’s normally unresolved. If you see it in your PLT, it’s harmless — just an unused entry.