Every C programmer knows malloc() returns a pointer to usable memory. What happens before that pointer arrives is a journey through at least three layers: glibc’s internal bins, a system call (either brk or mmap), and finally the kernel’s page allocator. Each layer has its own constraints, and understanding them is the difference between “my process RSS keeps growing” and knowing why.
This article focuses on the system call boundary: when glibc decides it needs more memory from the kernel, what actually happens.
Before malloc: where memory comes from
When the kernel loads an ELF binary, the process’s address space is carved into regions that the kernel tracks in struct mm_struct:
struct mm_struct {
unsigned long start_code, end_code; // text segment bounds
unsigned long start_data, end_data; // data segment bounds
unsigned long start_brk, brk; // heap boundaries
unsigned long start_stack; // stack top
unsigned long arg_start, arg_end; // argv
unsigned long env_start, env_end; // environ
struct vm_area_struct *mmap; // linked list of VMAs
pgd_t *pgd; // page table root
};
Two fields matter for heap allocation:
start_brk— the initial program break, set once at exec time. With ASLR off,start_brk==end_data(the end of the BSS segment). With ASLR level 2, there is a random offset between them.brk— the current program break, i.e., the top of the heap. It starts equal tostart_brkand moves upward as the heap grows. It can move downward, but only from the top — you can’t punch a hole in the middle.
The heap lives between the data segment and the mmap region:
High addresses
+-----------------------------+
| Kernel space | (3GB+ on 32-bit)
+-----------------------------+
| Stack (grows down) |
| mmap region (libs, threads) | ← libc.so, ld.so, thread stacks
| Heap (grows up via brk) | ← brk pointer moves up
+-----------------------------+
| BSS (uninitialized globals) |
+-----------------------------+
| Data (initialized globals) |
+-----------------------------+
| Text (code) |
+-----------------------------+
Low addresses
brk: the simple allocator
brk is one of the oldest Unix syscalls. It does one thing: move the program break pointer. That’s it. The kernel responds by mapping or unmapping pages between the old and new break.
What brk looks like at the syscall level
void *sbrk(intptr_t increment); // library wrapper
int brk(void *addr); // set break to absolute address
sbrk(0) returns the current break without changing it — useful for introspection. brk(new_addr) sets the break to an arbitrary value. If the new break is above the old one, the kernel maps anonymous pages. If below, it unmaps them. There is no granularity finer than a page: moving the break by 1 byte still maps a full 4KB page.
Experiment
#include <stdio.h>
#include <unistd.h>
int main() {
void *curr;
printf("PID: %d\n", getpid());
curr = sbrk(0);
printf("Initial program break: %p\n", curr);
getchar(); // ← check /proc/PID/maps here
brk(curr + 4096); // extend by one page
curr = sbrk(0);
printf("After brk(+4KB): %p\n", curr);
getchar(); // ← check again
brk(curr - 4096); // shrink back
curr = sbrk(0);
printf("After shrink: %p\n", curr);
getchar();
return 0;
}
Before brk: the process has no independent [heap] segment. The last mapped region is end_data:
0804a000-0804b000 rw-p 00001000 08:01 539624 /home/user/sbrk_test
After brk(+4096): a [heap] entry appears:
0804b000-0804c000 rw-p 00000000 00:00 0 [heap]
start_brk = 0x0804b000 = end_data. brk = 0x0804c000.
After shrink: the [heap] entry disappears. The pages are unmapped. If we then called sbrk(0), it would return 0x0804b000.
The limitation that matters: brk is a stack
You can only shrink from the top. If thread A allocates at address X, thread B allocates at address X+1000, and thread A frees its memory, the brk cannot be lowered because thread B’s allocation is still active at the top. This is why glibc’s free() rarely calls brk with a negative increment. The memory is kept in glibc’s bins and reused for future allocations, but the kernel doesn’t get it back.
The practical consequence: a process that allocates 100MB, frees 99MB from the middle, and then goes idle still shows ~100MB RSS. The heap can’t shrink because the remaining 1MB sits at the top.
How glibc uses brk for the main arena
When main() calls malloc(1000) for the first time, glibc does not call sbrk(1000). It calls sbrk(132*1024) (the default arena size on 32-bit; on 64-bit it is smaller but still much larger than the request). This is the main arena.
Why allocate 132KB when the program only asked for 1KB? Because system calls are expensive (a context switch, plus kernel page table and VMA operations). glibc treats the kernel as a wholesale supplier and handles retail distribution itself. The extra 131KB sits in glibc’s internal free lists and serves subsequent malloc() calls without touching the kernel again.
This explains a common “problem”: you malloc(100), check /proc/pid/maps, and see 132KB of heap. You haven’t leaked anything — glibc just pre-allocated.
Only when the 132KB arena is exhausted does glibc call sbrk again to extend it. And when you free() most of the memory, the brk almost never moves back. The kernel’s RSS tracking shows the pages as still resident, which is why long-running processes with bursty allocation patterns keep their high-water RSS.
When brk fails
brk cannot extend into the mmap region. If the shared libraries and thread stacks are mapped close to the heap, the program break may hit the mmap base. In practice, this happens on 32-bit systems where the 3GB user address space is tight. On 64-bit, the address space is large enough that this is not a concern.
When brk cannot extend, glibc falls back to mmap for the allocation — even for small requests. This is why a process may have multiple heap-like regions in its maps after heavy allocation.
mmap: the direct allocator
For allocations above a threshold (default 128KB, controlled by M_MMAP_THRESHOLD), glibc bypasses brk entirely and calls mmap to create an anonymous mapping. This has different characteristics:
- The memory is immediately released back to the kernel on
free()(viamunmap). - The virtual address is randomized by ASLR (independent of the brk heap).
- No fragmentation in the main arena — each large allocation is its own mapping.
- Overhead: each
mmap/munmaprequires a VMA creation/destruction and a TLB flush on some architectures. This is slower than allocating from the arena.
Thread arenas are also allocated via mmap. When a new thread calls malloc(), glibc creates a 1MB mmap region (only 132KB of which is active arena; the rest is reserved).
Experiment
#include <stdio.h>
#include <sys/mman.h>
#include <unistd.h>
int main() {
printf("PID: %d\n", getpid());
getchar(); // check maps
char *p = mmap(NULL, 132*1024,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
printf("mmap'd at: %p\n", p);
getchar(); // check maps again
munmap(p, 132*1024);
printf("freed\n");
getchar(); // check maps after free
return 0;
}
Before:
08048000-0804b000 r-xp/rw-p ... /home/user/mmap_test (binary)
b7e21000-b7e22000 rw-p ... (anonymous, likely from ld.so)
After mmap:
b7e00000-b7e22000 rw-p 00000000 00:00 0
A 136KB region appears (0x22000 = 136KB, rounded up from 132KB for alignment). The address (0xb7e00000) is in the mmap region, far from the brk-based heap.
After munmap:
b7e21000-b7e22000 rw-p 00000000 00:00 0
The 132KB region is gone, leaving only a small remnant from the pre-existing mapping. This is the key difference from brk: munmap genuinely returns memory to the OS. RSS drops.
Reading /proc/pid/maps correctly
address perms offset dev inode pathname
559a7a0e9000-559a7a0ea000 r-xp 00000000 08:01 539691 /bin/somebinary
7f8c4a0e9000-7f8c4a0ea000 r-xp 00000000 08:01 123456 /lib/x86_64-linux-gnu/libc.so
7ffc9a0e9000-7ffc9a0ea000 rw-p 00000000 00:00 0 [stack]
| Field | What it tells you |
|---|---|
559a... |
Virtual address range (ASLR-randomized on modern systems) |
r-xp |
Permissions: r/w/x, p=private or s=shared |
offset |
Where in the backing file this maps to (0 for anonymous) |
dev:inode |
Device and inode of backing file (0:0 for anonymous) |
pathname |
[heap], [stack], [vdso], file path, or blank |
A mapping with no pathname and rw-p is anonymous memory — typically mmap’d by malloc, thread stacks, or the loader itself.
Kernel-side allocation: what happens after the syscall
When user-space calls brk or mmap, the kernel does not immediately allocate physical pages. It only sets up virtual memory area (VMA) structures in mm_struct. Physical pages are allocated lazily, on first access, via the page fault handler.
The buddy allocator (page-level)
The kernel’s page allocator uses the buddy system to manage physical pages. Free pages are grouped into lists of order-0 (4KB), order-1 (8KB), order-2 (16KB), etc., up to order-10 (4MB). Allocation requests are satisfied by splitting a larger block; freeing merges adjacent buddies back.
The user-facing API:
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
// Returns 2^order contiguous physical pages
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
// Same, but returns the kernel virtual address
These return physically contiguous memory. The virtual address has a fixed relationship to the physical address:
#define __pa(x) ((unsigned long)(x) - PAGE_OFFSET) // virt → phys
#define __va(x) ((void *)((unsigned long)(x) + PAGE_OFFSET)) // phys → virt
The slab allocator (sub-page)
kmalloc is the kernel equivalent of malloc. It is built on the slab allocator, which manages caches of fixed-size objects:
void *kmalloc(size_t size, gfp_t flags);
void kfree(const void *p);
kmalloc returns physically contiguous memory, suitable for DMA and hardware interaction. It is limited to ~4MB per allocation on most architectures.
The slab cache serves objects of a specific size without internal fragmentation. For example, a 64-byte cache always returns 64-byte objects, even if you ask for 20 bytes. kmalloc dispatches to the appropriate cache based on size:
static inline void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
// compile-time known size: direct cache lookup
int i = 0;
if (size <= 16) i = 0;
else if (size <= 32) i = 1;
// ... dispatch to kmem_cache_alloc
return kmem_cache_alloc(cache[i], flags);
}
return __kmalloc(size, flags); // runtime path
}
vmalloc (non-contiguous)
When physically contiguous memory is not needed (most data buffers), vmalloc allocates from a reserved virtual address range:
void *vmalloc(unsigned long size);
void vfree(const void *addr);
The virtual addresses are contiguous, but the physical pages are scattered. Page table manipulation is required to map them, which makes vmalloc slower than kmalloc or __get_free_pages. It is used for large allocations (kernel modules, framebuffers, large data structures) where contiguity is not required.
kmem_cache_create / kmem_cache_alloc
For frequently allocated kernel objects of the same size (like struct inode, struct task_struct, struct file), the slab allocator provides a dedicated interface:
struct kmem_cache *c = kmem_cache_create("my_object", sizeof(struct my_obj), 0, 0, NULL);
struct my_obj *o = kmem_cache_alloc(c, GFP_KERNEL);
// ... use o ...
kmem_cache_free(c, o);
This avoids the overhead of size-class dispatch in kmalloc and keeps hot objects in the per-CPU slab cache. The inode cache alone can save millions of allocation cycles on a busy filesystem.
Why this matters for user-space developers
The kernel’s allocators are not directly callable from user space, but their behavior affects you:
- Page allocation is lazy.
malloc+ write touches physical pages.malloc+ read (from a zero page) maps to a shared read-only zero page until written. This is why a process thatmallocs 1GB but only writes to 10MB shows only 10MB RSS. - Large allocations fragment the buddy allocator. If you
malloc(2MB+1)on a system that has been running for months, the kernel may fail to find 2MB+1 contiguous physical pages (even though enough total memory is free). This is whyvmallocexists, but user-space can’t use it — hence theCOMPACTIONmechanism in the kernel, which rearranges pages to satisfy large contiguous requests. - Overcommit. By default, Linux overcommits memory:
malloc(1GB)succeeds even if only 100MB of physical memory is free. The kernel promises to kill the process (OOM killer) if it actually uses the memory and the system runs out. This is controlled byvm.overcommit_memoryandvm.overcommit_ratio.
The full path from malloc to physical page
malloc(1000)
↓
glibc: check fastbin → small bin → unsorted bin → large bin → top chunk
↓
top chunk too small → need more memory
↓
if size <= MMAP_THRESHOLD (default 128KB):
sbrk(size) → brk syscall → kernel extends VMA → physical pages on page fault
else:
mmap(...) → mmap syscall → kernel creates VMA → physical pages on page fault
↓
glibc splits the new region, returns chunk to user
The “physical pages on page fault” part is the key: the syscall itself only updates metadata. The physical RAM is allocated when your process first touches the returned pointer.
FAQ
Q1: My process shows high RSS even after freeing. What happened?
A: glibc didn’t return the memory to the kernel. The freed chunks went into glibc’s bins for reuse. Only brk top-hole releases or mmap/munmap reduce RSS. This is normal and usually not a leak.
Q2: Can I force glibc to return memory to the OS?
A: malloc_trim(0) tells glibc to call sbrk with a negative value if possible (only works for top-of-heap free space). For mmap’d allocations, free always calls munmap.
Q3: How do I know if my allocation used brk or mmap?
A: Allocations > 128KB use mmap. Smaller ones use the brk arena. Check /proc/pid/maps: the [heap] entry is the brk heap; anonymous mappings outside it are mmap’d.
Q4: What is M_MMAP_THRESHOLD and should I change it?
A: It’s the size threshold that determines the brk/mmap split. Default 128KB. Setting it higher forces more allocations into the brk heap (fewer syscalls, but more fragmentation). Setting it lower forces more mmap usage (immediate release on free, but more TLB flushes). API: mallopt(M_MMAP_THRESHOLD, value).
Q5: Why does my heap address have no [heap] label in /proc/pid/maps?
A: On modern kernels with ASLR level 2, the brk-based heap still gets labeled [heap]. But thread arenas (mmap’d) are anonymous mappings without labels. If the main arena was created but the heap label doesn’t appear, check that ASLR level is 2.
References
man 2 brk,man 2 mmap,man 3 mallopt- Anatomy of a Program in Memory
- Understanding the Linux Kernel (Bovet & Cesati) — Chapter 8: Memory Management
- Sploitfun: Syscalls used by malloc
- Linux kernel source:
mm/mmap.c,mm/slab.c,mm/vmalloc.c