Every C programmer knows malloc() returns a pointer to usable memory. What happens before that pointer arrives is a journey through at least three layers: glibc’s internal bins, a system call (either brk or mmap), and finally the kernel’s page allocator. Each layer has its own constraints, and understanding them is the difference between “my process RSS keeps growing” and knowing why.

This article focuses on the system call boundary: when glibc decides it needs more memory from the kernel, what actually happens.

Before malloc: where memory comes from

When the kernel loads an ELF binary, the process’s address space is carved into regions that the kernel tracks in struct mm_struct:

struct mm_struct {
    unsigned long start_code, end_code;   // text segment bounds
    unsigned long start_data, end_data;   // data segment bounds
    unsigned long start_brk, brk;         // heap boundaries
    unsigned long start_stack;            // stack top
    unsigned long arg_start, arg_end;     // argv
    unsigned long env_start, env_end;     // environ
    struct vm_area_struct *mmap;          // linked list of VMAs
    pgd_t *pgd;                           // page table root
};

Two fields matter for heap allocation:

  • start_brk — the initial program break, set once at exec time. With ASLR off, start_brk == end_data (the end of the BSS segment). With ASLR level 2, there is a random offset between them.
  • brk — the current program break, i.e., the top of the heap. It starts equal to start_brk and moves upward as the heap grows. It can move downward, but only from the top — you can’t punch a hole in the middle.

The heap lives between the data segment and the mmap region:

High addresses
+-----------------------------+
| Kernel space                |  (3GB+ on 32-bit)
+-----------------------------+
| Stack (grows down)          |
| mmap region (libs, threads) |  ← libc.so, ld.so, thread stacks
| Heap (grows up via brk)     |  ← brk pointer moves up
+-----------------------------+
| BSS (uninitialized globals) |
+-----------------------------+
| Data (initialized globals)  |
+-----------------------------+
| Text (code)                 |
+-----------------------------+
Low addresses

brk: the simple allocator

brk is one of the oldest Unix syscalls. It does one thing: move the program break pointer. That’s it. The kernel responds by mapping or unmapping pages between the old and new break.

What brk looks like at the syscall level

void *sbrk(intptr_t increment);   // library wrapper
int brk(void *addr);              // set break to absolute address

sbrk(0) returns the current break without changing it — useful for introspection. brk(new_addr) sets the break to an arbitrary value. If the new break is above the old one, the kernel maps anonymous pages. If below, it unmaps them. There is no granularity finer than a page: moving the break by 1 byte still maps a full 4KB page.

Experiment

#include <stdio.h>
#include <unistd.h>

int main() {
    void *curr;

    printf("PID: %d\n", getpid());
    curr = sbrk(0);
    printf("Initial program break: %p\n", curr);
    getchar();  // ← check /proc/PID/maps here

    brk(curr + 4096);        // extend by one page
    curr = sbrk(0);
    printf("After brk(+4KB): %p\n", curr);
    getchar();  // ← check again

    brk(curr - 4096);        // shrink back
    curr = sbrk(0);
    printf("After shrink: %p\n", curr);
    getchar();
    return 0;
}

Before brk: the process has no independent [heap] segment. The last mapped region is end_data:

0804a000-0804b000 rw-p 00001000 08:01 539624  /home/user/sbrk_test

After brk(+4096): a [heap] entry appears:

0804b000-0804c000 rw-p 00000000 00:00 0       [heap]

start_brk = 0x0804b000 = end_data. brk = 0x0804c000.

After shrink: the [heap] entry disappears. The pages are unmapped. If we then called sbrk(0), it would return 0x0804b000.

The limitation that matters: brk is a stack

You can only shrink from the top. If thread A allocates at address X, thread B allocates at address X+1000, and thread A frees its memory, the brk cannot be lowered because thread B’s allocation is still active at the top. This is why glibc’s free() rarely calls brk with a negative increment. The memory is kept in glibc’s bins and reused for future allocations, but the kernel doesn’t get it back.

The practical consequence: a process that allocates 100MB, frees 99MB from the middle, and then goes idle still shows ~100MB RSS. The heap can’t shrink because the remaining 1MB sits at the top.

How glibc uses brk for the main arena

When main() calls malloc(1000) for the first time, glibc does not call sbrk(1000). It calls sbrk(132*1024) (the default arena size on 32-bit; on 64-bit it is smaller but still much larger than the request). This is the main arena.

Why allocate 132KB when the program only asked for 1KB? Because system calls are expensive (a context switch, plus kernel page table and VMA operations). glibc treats the kernel as a wholesale supplier and handles retail distribution itself. The extra 131KB sits in glibc’s internal free lists and serves subsequent malloc() calls without touching the kernel again.

This explains a common “problem”: you malloc(100), check /proc/pid/maps, and see 132KB of heap. You haven’t leaked anything — glibc just pre-allocated.

Only when the 132KB arena is exhausted does glibc call sbrk again to extend it. And when you free() most of the memory, the brk almost never moves back. The kernel’s RSS tracking shows the pages as still resident, which is why long-running processes with bursty allocation patterns keep their high-water RSS.

When brk fails

brk cannot extend into the mmap region. If the shared libraries and thread stacks are mapped close to the heap, the program break may hit the mmap base. In practice, this happens on 32-bit systems where the 3GB user address space is tight. On 64-bit, the address space is large enough that this is not a concern.

When brk cannot extend, glibc falls back to mmap for the allocation — even for small requests. This is why a process may have multiple heap-like regions in its maps after heavy allocation.

mmap: the direct allocator

For allocations above a threshold (default 128KB, controlled by M_MMAP_THRESHOLD), glibc bypasses brk entirely and calls mmap to create an anonymous mapping. This has different characteristics:

  • The memory is immediately released back to the kernel on free() (via munmap).
  • The virtual address is randomized by ASLR (independent of the brk heap).
  • No fragmentation in the main arena — each large allocation is its own mapping.
  • Overhead: each mmap/munmap requires a VMA creation/destruction and a TLB flush on some architectures. This is slower than allocating from the arena.

Thread arenas are also allocated via mmap. When a new thread calls malloc(), glibc creates a 1MB mmap region (only 132KB of which is active arena; the rest is reserved).

Experiment

#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>

int main() {
    printf("PID: %d\n", getpid());
    getchar();  // check maps

    char *p = mmap(NULL, 132*1024,
                   PROT_READ | PROT_WRITE,
                   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    printf("mmap'd at: %p\n", p);
    getchar();  // check maps again

    munmap(p, 132*1024);
    printf("freed\n");
    getchar();  // check maps after free
    return 0;
}

Before:

08048000-0804b000 r-xp/rw-p  ... /home/user/mmap_test  (binary)
b7e21000-b7e22000 rw-p ... (anonymous, likely from ld.so)

After mmap:

b7e00000-b7e22000 rw-p 00000000 00:00 0

A 136KB region appears (0x22000 = 136KB, rounded up from 132KB for alignment). The address (0xb7e00000) is in the mmap region, far from the brk-based heap.

After munmap:

b7e21000-b7e22000 rw-p 00000000 00:00 0

The 132KB region is gone, leaving only a small remnant from the pre-existing mapping. This is the key difference from brk: munmap genuinely returns memory to the OS. RSS drops.

Reading /proc/pid/maps correctly

address           perms  offset   dev   inode  pathname
559a7a0e9000-559a7a0ea000 r-xp 00000000 08:01 539691  /bin/somebinary
7f8c4a0e9000-7f8c4a0ea000 r-xp 00000000 08:01 123456  /lib/x86_64-linux-gnu/libc.so
7ffc9a0e9000-7ffc9a0ea000 rw-p 00000000 00:00 0       [stack]
Field What it tells you
559a... Virtual address range (ASLR-randomized on modern systems)
r-xp Permissions: r/w/x, p=private or s=shared
offset Where in the backing file this maps to (0 for anonymous)
dev:inode Device and inode of backing file (0:0 for anonymous)
pathname [heap], [stack], [vdso], file path, or blank

A mapping with no pathname and rw-p is anonymous memory — typically mmap’d by malloc, thread stacks, or the loader itself.

Kernel-side allocation: what happens after the syscall

When user-space calls brk or mmap, the kernel does not immediately allocate physical pages. It only sets up virtual memory area (VMA) structures in mm_struct. Physical pages are allocated lazily, on first access, via the page fault handler.

The buddy allocator (page-level)

The kernel’s page allocator uses the buddy system to manage physical pages. Free pages are grouped into lists of order-0 (4KB), order-1 (8KB), order-2 (16KB), etc., up to order-10 (4MB). Allocation requests are satisfied by splitting a larger block; freeing merges adjacent buddies back.

The user-facing API:

struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
// Returns 2^order contiguous physical pages

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
// Same, but returns the kernel virtual address

These return physically contiguous memory. The virtual address has a fixed relationship to the physical address:

#define __pa(x)    ((unsigned long)(x) - PAGE_OFFSET)   // virt → phys
#define __va(x)    ((void *)((unsigned long)(x) + PAGE_OFFSET))  // phys → virt

The slab allocator (sub-page)

kmalloc is the kernel equivalent of malloc. It is built on the slab allocator, which manages caches of fixed-size objects:

void *kmalloc(size_t size, gfp_t flags);
void kfree(const void *p);

kmalloc returns physically contiguous memory, suitable for DMA and hardware interaction. It is limited to ~4MB per allocation on most architectures.

The slab cache serves objects of a specific size without internal fragmentation. For example, a 64-byte cache always returns 64-byte objects, even if you ask for 20 bytes. kmalloc dispatches to the appropriate cache based on size:

static inline void *kmalloc(size_t size, gfp_t flags)
{
    if (__builtin_constant_p(size)) {
        // compile-time known size: direct cache lookup
        int i = 0;
        if (size <= 16) i = 0;
        else if (size <= 32) i = 1;
        // ... dispatch to kmem_cache_alloc
        return kmem_cache_alloc(cache[i], flags);
    }
    return __kmalloc(size, flags);  // runtime path
}

vmalloc (non-contiguous)

When physically contiguous memory is not needed (most data buffers), vmalloc allocates from a reserved virtual address range:

void *vmalloc(unsigned long size);
void vfree(const void *addr);

The virtual addresses are contiguous, but the physical pages are scattered. Page table manipulation is required to map them, which makes vmalloc slower than kmalloc or __get_free_pages. It is used for large allocations (kernel modules, framebuffers, large data structures) where contiguity is not required.

kmem_cache_create / kmem_cache_alloc

For frequently allocated kernel objects of the same size (like struct inode, struct task_struct, struct file), the slab allocator provides a dedicated interface:

struct kmem_cache *c = kmem_cache_create("my_object", sizeof(struct my_obj), 0, 0, NULL);
struct my_obj *o = kmem_cache_alloc(c, GFP_KERNEL);
// ... use o ...
kmem_cache_free(c, o);

This avoids the overhead of size-class dispatch in kmalloc and keeps hot objects in the per-CPU slab cache. The inode cache alone can save millions of allocation cycles on a busy filesystem.

Why this matters for user-space developers

The kernel’s allocators are not directly callable from user space, but their behavior affects you:

  • Page allocation is lazy. malloc + write touches physical pages. malloc + read (from a zero page) maps to a shared read-only zero page until written. This is why a process that mallocs 1GB but only writes to 10MB shows only 10MB RSS.
  • Large allocations fragment the buddy allocator. If you malloc(2MB+1) on a system that has been running for months, the kernel may fail to find 2MB+1 contiguous physical pages (even though enough total memory is free). This is why vmalloc exists, but user-space can’t use it — hence the COMPACTION mechanism in the kernel, which rearranges pages to satisfy large contiguous requests.
  • Overcommit. By default, Linux overcommits memory: malloc(1GB) succeeds even if only 100MB of physical memory is free. The kernel promises to kill the process (OOM killer) if it actually uses the memory and the system runs out. This is controlled by vm.overcommit_memory and vm.overcommit_ratio.

The full path from malloc to physical page

malloc(1000)
glibc: check fastbin → small bin → unsorted bin → large bin → top chunk
top chunk too small → need more memory
if size <= MMAP_THRESHOLD (default 128KB):
    sbrk(size)  →  brk syscall  →  kernel extends VMA  →  physical pages on page fault
  else:
    mmap(...)   →  mmap syscall →  kernel creates VMA  →  physical pages on page fault
glibc splits the new region, returns chunk to user

The “physical pages on page fault” part is the key: the syscall itself only updates metadata. The physical RAM is allocated when your process first touches the returned pointer.

FAQ

Q1: My process shows high RSS even after freeing. What happened?
A: glibc didn’t return the memory to the kernel. The freed chunks went into glibc’s bins for reuse. Only brk top-hole releases or mmap/munmap reduce RSS. This is normal and usually not a leak.

Q2: Can I force glibc to return memory to the OS?
A: malloc_trim(0) tells glibc to call sbrk with a negative value if possible (only works for top-of-heap free space). For mmap’d allocations, free always calls munmap.

Q3: How do I know if my allocation used brk or mmap?
A: Allocations > 128KB use mmap. Smaller ones use the brk arena. Check /proc/pid/maps: the [heap] entry is the brk heap; anonymous mappings outside it are mmap’d.

Q4: What is M_MMAP_THRESHOLD and should I change it?
A: It’s the size threshold that determines the brk/mmap split. Default 128KB. Setting it higher forces more allocations into the brk heap (fewer syscalls, but more fragmentation). Setting it lower forces more mmap usage (immediate release on free, but more TLB flushes). API: mallopt(M_MMAP_THRESHOLD, value).

Q5: Why does my heap address have no [heap] label in /proc/pid/maps?
A: On modern kernels with ASLR level 2, the brk-based heap still gets labeled [heap]. But thread arenas (mmap’d) are anonymous mappings without labels. If the main arena was created but the heap label doesn’t appear, check that ASLR level is 2.

References