Linux Kernel Memory Management Internals

The Linux memory management subsystem is one of the most complex parts of the kernel. It is responsible for mapping physical RAM into a form that processes and the kernel itself can use safely and efficiently — handling everything from raw page frame allocation to per-object caching, NUMA-aware placement, and memory reclaim under pressure.

Memory models

Linux abstracts physical memory diversity using one of two memory models selected at build time: FLATMEM and SPARSEMEM. FLATMEM suits non-NUMA systems with contiguous physical memory. A global mem_map array maps every page frame number (PFN) directly:

/* PFN to struct page in FLATMEM */
struct page *page = mem_map + (pfn - ARCH_PFN_OFFSET);

SPARSEMEM is the more capable model and the only one that supports memory hot-plug/remove, non-volatile memory devices, and deferred memory map initialization. Physical memory is divided into fixed-size sections, each represented by struct mem_section. With CONFIG_SPARSEMEM_VMEMMAP enabled, a virtually contiguous vmemmap array makes pfn_to_page() as cheap as an array index.

SPARSEMEM with VMEMMAP is the default on most 64-bit architectures, including x86-64 and arm64.

Memory zones

Within each NUMA node the kernel divides physical memory into zones, each constraining what kind of allocations it satisfies:

Zone	Purpose
`ZONE_DMA`	Pages reachable by legacy ISA DMA (first 16 MiB on x86)
`ZONE_DMA32`	Pages reachable by 32-bit DMA devices
`ZONE_NORMAL`	Directly mapped kernel memory; the primary allocation zone
`ZONE_HIGHMEM`	Memory above the kernel’s direct mapping (32-bit only)
`ZONE_MOVABLE`	Physically movable pages for memory hot-remove
`ZONE_DEVICE`	Memory-mapped device ranges (persistent memory, GPU memory)

When a zone cannot satisfy a request the kernel falls back through a zonelist ordered by NUMA distance, preferring the same zone type on a remote node before trying a different zone type locally.

Page allocator

The buddy allocator is the foundation of physical memory allocation. It tracks free memory in power-of-two orders (order 0 = 4 KiB, order 1 = 8 KiB, …, order MAX_ORDER = 4 MiB by default).

/* Allocate a single page, kernel use */
struct page *page = alloc_page(GFP_KERNEL);

/* Allocate 2^order contiguous pages */
struct page *pages = alloc_pages(GFP_KERNEL, order);

/* Allocate and return the kernel virtual address */
unsigned long addr = __get_free_pages(GFP_KERNEL, order);

/* Free pages back to the buddy allocator */
__free_pages(page, order);

Common GFP flags control allocation behaviour:

Flag	Meaning
`GFP_KERNEL`	May sleep; standard kernel allocation
`GFP_ATOMIC`	Must not sleep; for interrupt context
`GFP_USER`	Userspace allocation; may be reclaimed
`GFP_NOWAIT`	Non-blocking; fail rather than wait
`__GFP_ZERO`	Zero the allocated page(s)
`__GFP_NOFAIL`	Retry until successful (use sparingly)

Per-CPU page caches (PCP lists) short-circuit the buddy allocator for single-page allocations, reducing lock contention on busy systems.

Slab allocator

The slab layer sits above the buddy allocator and provides efficient, cache-friendly allocation of fixed-size kernel objects. The current implementation is SLUB (the default since 2.6.23).

/* General-purpose allocators */
void *buf = kmalloc(size, GFP_KERNEL);   /* allocate */
void *buf = kzalloc(size, GFP_KERNEL);   /* allocate and zero */
kfree(buf);                              /* free */

/* Per-type caches for frequently allocated objects */
struct kmem_cache *cache = kmem_cache_create(
    "my_object",           /* name */
    sizeof(struct my_obj), /* object size */
    0,                     /* alignment */
    SLAB_HWCACHE_ALIGN,    /* flags */
    NULL);                 /* constructor */

struct my_obj *obj = kmem_cache_alloc(cache, GFP_KERNEL);
kmem_cache_free(cache, obj);
kmem_cache_destroy(cache);

SLUB groups objects into slabs (one or more pages). It maintains per-CPU freelists to avoid locking on the fast path, and falls back to per-NUMA-node partial lists before requesting new pages from the buddy allocator.

Use kmem_cache_create() for objects allocated and freed at high frequency. The named cache appears in /proc/slabinfo and /sys/kernel/slab/, making it easy to monitor with slabtop.

Virtual memory areas and mm_struct

Every process has an mm_struct that describes its entire virtual address space. Individual mappings — anonymous memory, file-backed pages, stack, heap — are each represented by a struct vm_area_struct (VMA).

/* Defined in include/linux/mm_types.h */
struct mm_struct {
    struct maple_tree   mm_mt;       /* VMA tree (maple tree, replaces rbtree) */
    unsigned long       mmap_base;   /* base of mmap area */
    unsigned long       task_size;   /* size of task VM space */
    pgd_t               *pgd;        /* page global directory */
    atomic_t            mm_users;    /* How many users with user space? */
    atomic_t            mm_count;    /* How many references to "struct mm_struct" */
    /* ... */
};

VMAs are stored in a maple tree (replacing the red-black tree introduced before kernel 6.1), enabling efficient range queries during page fault handling, mmap(), and munmap().

Page fault handling flow

When a userspace access hits an unmapped or not-present address, the CPU raises a page fault. The kernel’s fault handler (do_page_fault() on x86) looks up the faulting address in the VMA tree:

If no VMA covers the address → SIGSEGV.
If the VMA is present but the page table entry is absent → allocate a physical page, map it, and return.
If the page is in swap → read it back from swap and remap it.
For copy-on-write (COW) faults → allocate a new page, copy the content, and update the PTE.

mmap() and anonymous memory

mmap(MAP_ANONYMOUS) creates a new VMA without backing storage. Pages are not allocated until first access (demand paging). The kernel uses zero pages for the initial read and copy-on-write for the first write.File-backed mappings (mmap(fd, ...)) integrate with the page cache: the same physical page can be shared between multiple processes mapping the same file.

NUMA memory topology

On NUMA systems Linux divides hardware into nodes, each with CPUs and local memory. Allocations default to the node of the CPU executing the request (local allocation), minimising cross-interconnect traffic.

/* Allocate from a specific node */
struct page *page = alloc_pages_node(nid, GFP_KERNEL, order);

/* Get the memory node for the current CPU */
int nid = numa_mem_id();

Each node maintains independent free-page lists and zone statistics. The kernel’s zonelist for each node is ordered so that fallback visits the nearest nodes (by NUMA distance) first. System administrators can pin allocations using numactl(1) or the MPOL_BIND memory policy, and can inspect topology via /sys/devices/system/node/.

Memory reclaim and OOM killer

When free memory falls below a watermark, kswapd wakes and scans the LRU lists for pages to reclaim. The multi-generational LRU (MGLRU, merged in 6.1) tracks access age across multiple generations to make better eviction decisions.

The OOM killer is a last resort. It selects and kills a process using a score based on memory consumption, swap usage, and oom_score_adj. Setting /proc/PID/oom_score_adj to -1000 shields a process from OOM killing.

Memory reclaim paths:

Anonymous pages: written to swap if a swap device exists.
File-backed clean pages: simply discarded; re-read from disk on next access.
File-backed dirty pages: written back (flushed) before being freed.

Transparent huge pages

Transparent Huge Pages (THP) allow the kernel to use 2 MiB (or larger) pages for anonymous and file-backed memory without requiring application changes. THP reduces TLB pressure on workloads with large working sets.

# Check THP mode
cat /sys/kernel/mm/transparent_hugepage/enabled
# always [madvise] never

# Advise the kernel to use huge pages for a region
madvise(addr, length, MADV_HUGEPAGE);

When the allocator cannot find a contiguous 2 MiB region, khugepaged scans for suitable 4 KiB page clusters and promotes them in the background.

DAMON — data access monitor

DAMON (Data Access MONitor) provides lightweight, accurate monitoring of actual memory access patterns. It operates in kernel space but exposes results through a sysfs interface (/sys/kernel/mm/damon/) and can drive memory management actions such as reclaim, THP promotion/demotion, and NUMA migration.

# List DAMON contexts
ls /sys/kernel/mm/damon/admin/kdamonds/

DAMON uses region-based sampling with an adaptive algorithm that keeps overhead below a configurable target ratio regardless of address space size.

Process scheduling

How the kernel schedules tasks across CPUs using CFS, real-time policies, and NUMA-aware load balancing.

Locking primitives

Spinlocks, mutexes, RCU, and other synchronisation mechanisms used throughout the MM subsystem.

Filesystems

How the page cache and VFS layer interact with filesystem implementations.

Networking

Socket buffers and how the networking stack allocates and manages memory.

Get Started

Development Guide

Kernel Internals

Administration

Linux Kernel Memory Management Internals

Memory models

Memory zones

Page allocator

Slab allocator

Virtual memory areas and mm_struct

NUMA memory topology

Memory reclaim and OOM killer

Transparent huge pages

DAMON — data access monitor

Process scheduling

Locking primitives

Filesystems

Networking

Build docs developers (and LLMs) love

Get Started

Development Guide

Kernel Internals

Administration

Documentation Index

​Memory models

​Memory zones

​Page allocator

​Slab allocator

​Virtual memory areas and mm_struct

​NUMA memory topology

​Memory reclaim and OOM killer

​Transparent huge pages

​DAMON — data access monitor

Process scheduling

Locking primitives

Filesystems

Networking

Build docs developers (and LLMs) love

Memory models

Memory zones

Page allocator

Slab allocator

Virtual memory areas and mm_struct

NUMA memory topology

Memory reclaim and OOM killer

Transparent huge pages

DAMON — data access monitor