The Linux memory management subsystem is responsible for every byte of RAM on the system. It partitions physical memory into zones, maintains free-page lists per node and zone, routes allocation requests through a hierarchy of allocators, and reclaims pages under pressure. Understanding these layers is essential before writing kernel code that allocates or frees memory.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/deelerdev/linux/llms.txt
Use this file to discover all available pages before exploring further.
Memory zones
Linux divides physical memory into zones whose boundaries are set by architectural constraints. Each zone is described bystruct zone and belongs to a pg_data_t node.
| Zone | Description |
|---|---|
ZONE_DMA | Memory below 16 MB on x86. Required by legacy ISA DMA devices. Enabled with CONFIG_ZONE_DMA. |
ZONE_DMA32 | Memory addressable by 32-bit DMA engines on 64-bit platforms. Enabled with CONFIG_ZONE_DMA32. |
ZONE_NORMAL | Directly mapped kernel memory. Always enabled and the most performance-critical zone. |
ZONE_HIGHMEM | Physical memory not permanently mapped by the kernel on 32-bit architectures. Enabled with CONFIG_HIGHMEM. |
ZONE_MOVABLE | Normal memory whose pages may be migrated or reclaimed, used mainly for memory hot-plug. |
ZONE_DEVICE | Memory residing on devices such as persistent memory (PMEM) and GPU. Enabled with CONFIG_ZONE_DEVICE. |
Many kernel operations require
ZONE_NORMAL memory. Requesting memory from ZONE_DMA or ZONE_DMA32 exhausts a scarce resource — avoid those zones unless the hardware explicitly requires them.GFP flags
Every allocation in the kernel carries agfp_t bitmask that tells the allocator which zones are acceptable, whether it may sleep, and how hard it should try to reclaim memory. The acronym stands for get free pages, the name of the underlying page allocator function.
The high-level composite flags defined in include/linux/gfp_types.h are:
GFP_KERNEL
GFP_KERNEL
The default flag for most kernel allocations. The allocator may sleep, perform direct reclaim, start I/O, and call into the filesystem. Use this in any process context that can block.
GFP_ATOMIC
GFP_ATOMIC
Must not sleep. Grants access to atomic reserves via
__GFP_HIGH. Use inside interrupt handlers, spinlock-protected sections, or any context where sleeping is forbidden.GFP_NOIO
GFP_NOIO
May reclaim clean pages but must not start physical I/O. Use inside block-layer code paths to prevent recursion into I/O submission.
GFP_NOFS
GFP_NOFS
May start physical I/O but must not call into the filesystem. Use inside filesystem code to prevent re-entrancy.
GFP_DMA / GFP_DMA32
GFP_DMA / GFP_DMA32
Restrict allocation to
ZONE_DMA or ZONE_DMA32. Use only when the hardware cannot address higher memory.__GFP_ZERO (return a zeroed allocation), __GFP_NOWARN (suppress allocation failure messages), __GFP_NOFAIL (retry infinitely — use with extreme care), and __GFP_NORETRY (fail quickly without invoking the OOM killer).
Page allocator
The page allocator (also called the buddy allocator) is the lowest-level allocator. It works in units of physically contiguous pages, with sizes expressed as a power-of-two order.alloc_pages() returns a struct page * pointer; __get_free_pages() returns the virtual address of the first page. order is the log₂ of the number of pages — order 0 is one page, order 1 is two pages, order 2 is four pages, and so on.
NUMA-aware page allocation
On NUMA systems you can constrain the page allocator to a specific node:NUMA_NO_NODE to allow allocation from the current CPU’s local node with automatic fallback. Use numa_node_id() or cpu_to_node() to obtain the node ID of the calling CPU. Linux builds an ordered zonelist per node so that, when the local zone overflows, it falls back to the nearest node before trying remote nodes.
Slab allocator
The slab allocator sits above the page allocator and carves pages into fixed-size object caches. The current implementation is SLUB. It reduces fragmentation, improves cache locality through per-CPU free lists, and optionally validates allocation/free patterns whenCONFIG_SLUB_DEBUG is enabled.
kmalloc and friends
kmalloc is the general-purpose slab allocation function, defined in include/linux/slab.h:
kzalloc is equivalent to kmalloc(...) + memset(0) but expressed as a single call. Prefer it whenever you need a zeroed buffer.
kmalloc has a maximum allocation size of KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_MAX). SLUB maps requests directly to a slab cache for sizes up to KMALLOC_MAX_CACHE_SIZE (two pages); larger requests fall through to the page allocator.
Per-type object caches
When you allocate many objects of the same type, create a dedicated cache withkmem_cache_create:
KMEM_CACHE macro provides a convenient shorthand:
| Flag | Effect |
|---|---|
SLAB_HWCACHE_ALIGN | Align objects on cache-line boundaries. |
SLAB_PANIC | Panic on allocation failure during cache creation. |
SLAB_TYPESAFE_BY_RCU | Delay page freeing by an RCU grace period (does not delay object freeing). |
SLAB_RECLAIM_ACCOUNT | Objects are reclaimable; pages are charged to SReclaimable in /proc/meminfo. |
vmalloc
vmalloc allocates virtually contiguous but physically non-contiguous memory. It is slower than kmalloc because it requires individual page allocations and must map them together with ioremap-style page table entries.
vmalloc when:
- The allocation is large (many megabytes) and physical contiguity is not required.
- The allocation is long-lived and
kmallocwould fragment the buddy allocator. - You need to map I/O memory or firmware buffers into the kernel virtual address space.
Memory reclaim and the OOM killer
When free memory falls below a watermark, the kernel attempts to reclaim pages:kswapd wakes
When the low watermark of a zone is crossed,
kswapd (the background reclaim daemon) wakes and scans the LRU lists looking for reclaimable pages.Direct reclaim
If
kswapd cannot keep up and an allocation is failing, the allocating task itself enters direct reclaim. This is triggered by GFP_KERNEL (and other flags that include __GFP_DIRECT_RECLAIM)./proc/<pid>/oom_score and influence it with /proc/<pid>/oom_score_adj (range −1000 to +1000). Kernel threads and processes that set oom_score_adj to −1000 are protected from the OOM killer.
Further reading
Locking and concurrency
Learn which locking primitives to use when protecting shared memory structures.
Networking stack
See how
sk_buff manages packet memory and interacts with the allocator.