Dynamic linking is something no one thinks about until their binary won’t start, a library upgrade breaks production, or a seemingly innocent function call turns into a performance hotspot. I’ve debugged each of those scenarios enough times that the ELF structures and loader behavior are burned into my memory. This article walks through the machinery from the ground up — not as a reference manual, but as someone who has traced through PLT stubs with gdb more times than I’d like to admit.

You will hit these problems

A few scenarios where knowing dynamic linking saves your day:

  • ldd shows the library is found, but the binary crashes with undefined symbol. The library was updated and the SONAME changed, but the old symlink wasn’t updated.
  • A shared library function mysteriously adds 50ms latency on first call because it triggers lazy binding in a latency-sensitive path.
  • You LD_PRELOAD a debug library, but the symbols don’t intercept. The application calls through PLT, and the preloaded library doesn’t export the expected symbol names.
  • A GOT overwrite exploit works in your test environment but fails on the production kernel because you tested with Partial RELRO but prod has Full RELRO.

These are not theoretical. I’ve hit each one.

Why dynamic linking costs real performance

Static linking merges all code into one binary. Every function call is a direct call instruction. Every global variable access is a direct memory reference. The instruction encoding is minimal: call offset takes 5 bytes, mov addr, %reg takes 5–7 bytes.

Dynamic linking adds indirection. For every symbol imported from a shared library:

  • Global variables: compiled to load from GOT → dereference instead of load from absolute address. That’s one extra memory load per access — on a cache miss, that’s 50–100ns.
  • Function calls: compiled to call PLT_stub → jmp *(GOT+n). After resolution, that’s one extra indirect jump (which the CPU’s branch predictor may or may not handle well). On first call, it’s hundreds of instructions in the dynamic linker.

The often-quoted 1–5% overhead is real but misleading. In CPU-bound code that calls library functions in a hot loop, the overhead can be 10–15% because of GOT register pressure (you lose one register to hold the GOT base) and PLT-related icache pollution. In I/O-bound or startup-dominated code, you won’t notice. The worst case is a latency-sensitive path that triggers lazy binding at runtime — the first call can take tens of microseconds as the dynamic linker resolves symbols.

If you’re building high-performance shared libraries and every instruction matters, consider whether those functions really need to be exported, or whether you can use -fvisibility=hidden and export only the API surface. Many projects export every symbol by default (-fvisibility=default), and the linker has to generate PLT entries and GOT slots for all of them, even if they’re only used internally.

PIC: Position Independent Code

Why PIC exists

ASLR randomizes where shared libraries are loaded. A library cannot know its base address at compile time. Without PIC, every absolute address would need a relocation entry, and the loader would have to patch every instruction at load time. PIC avoids that by making all memory references relative to the current instruction pointer or indirect through a table.

What it costs you

Let me show the actual cost with real instruction sequences. Consider a simple variable access:

extern int top;
void increment(void) { top++; }

Without PIC (compiled for a fixed-address executable):

mov    0x804a010, %eax    ; 5 bytes, absolute address encoded in instruction
add    $0x1, %eax
mov    %eax, 0x804a010     ; 5 bytes

With PIC (compiled as position-independent):

mov    -0xc(%ebx), %eax   ; 3 bytes, GOT offset from base register
mov    (%eax), %eax       ; load real address from GOT
lea    0x1(%eax), %edx    ; increment
mov    -0xc(%ebx), %eax   ; reload GOT entry (register pressure!)
mov    %edx, (%eax)       ; store through GOT

Five instructions instead of three. Two memory loads instead of one. And %ebx is pinned to the GOT base for the entire function, so you lose a general-purpose register. In register-pressure-heavy code, this forces spilling, which adds more memory traffic.

On x86-64, rip-relative addressing improves this somewhat:

mov    top(%rip), %rax    ; single instruction, RIP-relative
add    $0x1, %rax
mov    %rax, top(%rip)    ; RIP-relative store

No GOT base register is needed, but the indirection still exists — the linker generates a relocation entry for top(%rip) and the loader patches the offset at load time.

The __i686.get_pc_thunk hack

On 32-bit x86, there is no instruction to read eip directly. Compilers work around this with a trick:

call    __i686.get_pc_thunk.bx   ; pushes return address = current EIP
add     $0x1b6c, %ebx            ; now ebx = GOT base

The thunk function is trivial:

__i686.get_pc_thunk.bx:
    mov    (%esp), %ebx           ; copy return address (which IS eip) to ebx
    ret

Every function that accesses global data or calls an imported function pays this overhead at entry. On x86-64, the call; pop sequence is not needed because [rip + offset] addressing is native.

GOT and PLT: the runtime resolution layer

GOT (Global Offset Table)

The GOT is an array of pointers, one per imported symbol and per internal data reference that needs relocation. It lives in its own section (.got and .got.plt). The key thing to understand about the GOT is:

  • It is writable (under Partial RELRO). The dynamic linker writes resolved addresses into it.
  • It is an attack surface. If an attacker can write to the GOT, they can redirect any library call. This is why FULL RELRO exists.
  • It lives in the binary, not in the library. Each process that loads a shared library gets its own copy of the GOT in its own address space.

For variables, the GOT entry is filled in at load time by the dynamic linker (before main() runs). These are R_*_GLOB_DAT relocations. The loader reads the symbol’s actual address from the library’s data section and writes it into the GOT.

For functions, the GOT entry may be resolved lazily — or not, depending on the BIND_NOW setting.

PLT (Procedure Linkage Table)

The PLT is a set of trampoline stubs, one per imported function. On x86-64, a typical entry:

push@plt:
    jmp    *GOT[push](%rip)       ; via GOT
    push   $reloc_index           ; identify function to resolver
    jmp    plt0                   ; call resolver

PLT[0] (the first entry, also called the resolver stub) is shared by all functions. It pushes the GOT[1] link map pointer and jumps to the resolver function pointed to by GOT[2].

When I first traced this with gdb, the penny-drop moment was seeing the GOT entry point back into the PLT itself — a self-referential pointer that makes lazy binding a single atomic swap to activate.

Lazy binding, step by step

Here is the exact sequence when main() calls push() from libstack.so for the first time:

Step 1: Call to PLT

main:
    call   80483d8 <push@plt>     ; call PLT stub

Step 2: PLT stub reads GOT

80483d8  jmp    *0x804a008        ; GOT entry for push

GOT[push] currently holds 0x80483de — the address of the next instruction in the PLT itself. The jmp falls through.

Step 3: Push relocation index and call resolver

80483de  push   $0x10             ; relocation index (offset into .rel.plt)
80483e3  jmp    80483a8           ; PLT[0] → dynamic linker

Step 4: Dynamic linker resolves The dynamic linker (ld-linux.so.2 or ld-linux-x86-64.so.2) receives the relocation index, looks it up in the .rel.plt table, finds the symbol name push, searches the loaded shared objects, finds libstack.so, computes the runtime address, and writes it to GOT[push]:

Before resolution: GOT[push] = 0x80483de (garbage, points back to PLT) After resolution: GOT[push] = 0xb803f47c (real address)

Then the linker jumps to the resolved address.

Step 5: Second call skips all this The next call to push@plt executes jmp *0x804a008, which now reads 0xb803f47c, and jumps directly to push() in libstack.so. No more resolver invocations.

Tracing it in gdb

$ gdb ./main
(gdb) start
(gdb) s                                  ; step into push() call
0x080483d8 in push@plt ()

(gdb) x/gx 0x804a008                     ; examine GOT entry
0x804a008: 0x080483de                    ; ← not resolved yet, points back

(gdb) si                                 ; jmp *GOT — falls through
0x080483de in push@plt ()

(gdb) si                                 ; push relocation index
0x080483e3 in push@plt ()

(gdb) si                                 ; jmp to resolver
0x080483a8 in ?? ()

(gdb) si                                 ; into ld-linux
0xb806a080 in ?? () from /lib/ld-linux.so.2

(gdb) finish                             ; let resolver do its work
(gdb) x/gx 0x804a008                     ; examine GOT entry again
0x804a008: 0xb803f47c                    ; ← now resolved!
(gdb) x/5i 0xb803f47c
0xb803f47c: push   %ebp                  ; real function, real code

A key observation: the jmp *GOT[push] is an indirect jump through memory. The CPU has to wait for the memory load to complete before it can fetch the next instruction. On a cold cache, this costs. In tight loops, this matters.

The dynamic linker’s relocation tables

Use readelf -r to see what the dynamic linker will resolve:

$ readelf -r libstack.so

Relocation section '.rel.dyn' at offset 0x2bc contains 4 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
000016bc  00000606 R_386_GLOB_DAT   00000000   g_share
000016c0  00000806 R_386_GLOB_DAT   00000000   __cxa_finalize

Relocation section '.rel.plt' at offset 0x2ec contains 2 entries:
 Offset     Info    Type            Sym.Value  Sym. Name
000016d8  00000407 R_386_JUMP_SLOT  00000000   g_func
000016dc  00000807 R_386_JUMP_SLOT  00000000   __cxa_finalize

The four important relocation types for dynamic linking:

Type Table When resolved What it contains
R_*_GLOB_DAT .got Load time Address of a global variable
R_*_JUMP_SLOT .got.plt Lazily (or at load time with BIND_NOW) Address of a function
R_*_RELATIVE .got Load time base_address + addend (for data within the library itself)
R_*_TPOFF .got or .tls Load time Thread-local storage offset

The offset column in readelf -r output is the actual GOT slot address. The dynamic linker writes the resolved address there.

One weird detail: RELRO in practice

Most binaries on modern Linux are compiled with Partial RELRO by default. Here is what that means for you as someone who might need to evaluate exploitability:

  • No RELRO — the entire GOT .got and .got.plt are in the same writable page. A single write primitive can overwrite any function’s GOT entry. This is rare on modern systems but common in embedded or legacy code.

  • Partial RELRO — the compiler reorders sections so that .got (used by R_*_GLOB_DAT — variable addresses) is placed before .data and marked read-only after initial relocation. But .got.plt (used by R_*_JUMP_SLOT — function addresses) is left writable because lazy binding needs to write to it at runtime. This is the default because it preserves lazy binding and the startup time benefit.

  • Full RELRO (-Wl,-z,relro,-z,now) — the dynamic linker resolves all R_*_JUMP_SLOT entries before main() starts. Then it calls mprotect() to make the entire GOT read-only. Lazy binding is disabled. Startup is slower (all symbols resolved upfront), but the GOT overwrite attack surface is eliminated.

Check which one a binary has:

$ readelf -l ./a.out | grep -A1 GNU_RELRO
  GNU_RELRO      0x000e78 0x00000000000e78 ... R   0x1
$ readelf -S ./a.out | grep -E 'got|plt'
  [22] .got               PROGBITS  0000000e78 ...
  [23] .got.plt           PROGBITS  0000000f08 ...

If .got.plt falls within the GNU_RELRO segment’s range, it’s Full RELRO. If it’s after, it’s Partial.

In practice: most distro binaries (Ubuntu 22.04+, Fedora 35+) use Full RELRO. But if you compile with just gcc -o test test.c locally, you get Partial RELRO. This mismatch has caused plenty of exploit demos to work in “lab conditions” and fail on production systems.

Performance: real advice

A few things I’ve learned from profiling:

  1. Lazy binding latency spikes are real. In a server that calls dlopen at runtime for plugins, the first invocation of each plugin function triggers resolution. If that happens on a request path, you’ll see latency outliers in the 1–10ms range (depending on symbol table size). The fix: pre-call the functions during initialization to warm up the GOT, or use LD_BIND_NOW.

  2. PLT in hot paths steals icache. Each PLT entry is 16 bytes on x86-64. With hundreds of imported functions, the PLT can consume multiple cache lines. The indirect jump at the start of each entry also interferes with branch prediction — the CPU’s indirect branch predictor has limited entries (typically 4K–16K BTB). If your hot path calls many library functions, you will see branch mispredictions at the PLT site.

  3. GOT indirection in tight loops costs. A loop that reads a global variable from a shared library pays an extra load through the GOT every iteration. The fix: load the value into a local at loop entry:

// Slow: reloads from GOT every iteration
for (int i = 0; i < n; i++)
    buf[i] = global_config.max_size;

// Fast: capture once
size_t max = global_config.max_size;
for (int i = 0; i < n; i++)
    buf[i] = max;
  1. Diagnose PLT overhead with perf: If you see PLT entries in perf report, the library boundary is costing you. Consider:
    • -fvisibility=hidden + explicit __attribute__((visibility("default"))) on the API
    • Move the function to a header as static inline
    • Use -flto (link-time optimization) to let the linker inline across translation units

Disable lazy binding at runtime for diagnosis:

LD_BIND_NOW=1 ./a.out     # resolve everything at startup

Compare startup time and steady-state latency between LD_BIND_NOW=1 and default. If steady-state improves, lazy binding was interfering with your hot path (maybe through icache pollution from the resolver). If it gets worse, you’re paying resolution cost for rarely-used functions.

FAQ

Q1: Why does the PLT use jmp instead of call for the first instruction?
A: jmp *GOT[n] is a tail-call. If it goes to the real function (after resolution), the ret in the target returns directly to the original caller — not through the PLT. If it falls through (before resolution), the resolver eventually jumps to the target, maintaining the same behavior.

Q2: Can I manually call the dynamic linker to resolve a symbol?
A: Yes, via dlsym(RTLD_NEXT, "funcname") or dlvsym. This is how LD_PRELOAD wrappers call the original function after intercepting it.

Q3: How does LD_PRELOAD interact with PLT resolution?
A: LD_PRELOAD libraries are loaded first and their symbols take priority. When the dynamic linker resolves R_*_JUMP_SLOT func, it finds the preloaded library’s func first and writes its address into the GOT.

Q4: What is __gmon_start__ in PLT[0]?
A: A profiling hook used by gprof. It’s normally unresolved. If you see it in your PLT, it’s harmless — just an unused entry.

References