Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/deelerdev/linux/llms.txt

Use this file to discover all available pages before exploring further.

The Linux memory management subsystem is responsible for every byte of RAM on the system. It partitions physical memory into zones, maintains free-page lists per node and zone, routes allocation requests through a hierarchy of allocators, and reclaims pages under pressure. Understanding these layers is essential before writing kernel code that allocates or frees memory.

Memory zones

Linux divides physical memory into zones whose boundaries are set by architectural constraints. Each zone is described by struct zone and belongs to a pg_data_t node.
ZoneDescription
ZONE_DMAMemory below 16 MB on x86. Required by legacy ISA DMA devices. Enabled with CONFIG_ZONE_DMA.
ZONE_DMA32Memory addressable by 32-bit DMA engines on 64-bit platforms. Enabled with CONFIG_ZONE_DMA32.
ZONE_NORMALDirectly mapped kernel memory. Always enabled and the most performance-critical zone.
ZONE_HIGHMEMPhysical memory not permanently mapped by the kernel on 32-bit architectures. Enabled with CONFIG_HIGHMEM.
ZONE_MOVABLENormal memory whose pages may be migrated or reclaimed, used mainly for memory hot-plug.
ZONE_DEVICEMemory residing on devices such as persistent memory (PMEM) and GPU. Enabled with CONFIG_ZONE_DEVICE.
On a 32-bit x86 system with 2 GB of RAM the layout looks like this:
/* Documentation/mm/physical_memory.rst */
0         16M                    896M                        2G
+----------+-----------------------+--------------------------+
| ZONE_DMA |      ZONE_NORMAL      |       ZONE_HIGHMEM       |
+----------+-----------------------+--------------------------+
Many kernel operations require ZONE_NORMAL memory. Requesting memory from ZONE_DMA or ZONE_DMA32 exhausts a scarce resource — avoid those zones unless the hardware explicitly requires them.

GFP flags

Every allocation in the kernel carries a gfp_t bitmask that tells the allocator which zones are acceptable, whether it may sleep, and how hard it should try to reclaim memory. The acronym stands for get free pages, the name of the underlying page allocator function. The high-level composite flags defined in include/linux/gfp_types.h are:
/* include/linux/gfp_types.h */
#define GFP_ATOMIC   (__GFP_HIGH | __GFP_KSWAPD_RECLAIM)
#define GFP_KERNEL   (__GFP_RECLAIM | __GFP_IO | __GFP_FS)
#define GFP_NOIO     (__GFP_RECLAIM)
#define GFP_NOFS     (__GFP_RECLAIM | __GFP_IO)
#define GFP_NOWAIT   (__GFP_KSWAPD_RECLAIM | __GFP_NOWARN)
#define GFP_DMA      __GFP_DMA
#define GFP_DMA32    __GFP_DMA32
The default flag for most kernel allocations. The allocator may sleep, perform direct reclaim, start I/O, and call into the filesystem. Use this in any process context that can block.
Must not sleep. Grants access to atomic reserves via __GFP_HIGH. Use inside interrupt handlers, spinlock-protected sections, or any context where sleeping is forbidden.
May reclaim clean pages but must not start physical I/O. Use inside block-layer code paths to prevent recursion into I/O submission.
May start physical I/O but must not call into the filesystem. Use inside filesystem code to prevent re-entrancy.
Restrict allocation to ZONE_DMA or ZONE_DMA32. Use only when the hardware cannot address higher memory.
Useful modifier flags include __GFP_ZERO (return a zeroed allocation), __GFP_NOWARN (suppress allocation failure messages), __GFP_NOFAIL (retry infinitely — use with extreme care), and __GFP_NORETRY (fail quickly without invoking the OOM killer).

Page allocator

The page allocator (also called the buddy allocator) is the lowest-level allocator. It works in units of physically contiguous pages, with sizes expressed as a power-of-two order.
/* include/linux/gfp.h */
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
unsigned long get_zeroed_page(gfp_t gfp_mask);

void __free_pages(struct page *page, unsigned int order);
void free_pages(unsigned long addr, unsigned int order);
alloc_pages() returns a struct page * pointer; __get_free_pages() returns the virtual address of the first page. order is the log₂ of the number of pages — order 0 is one page, order 1 is two pages, order 2 is four pages, and so on.
Order-0 is the only allocation order that is always expected to succeed under memory pressure. Higher orders may fail even with __GFP_NOFAIL. If you need large contiguous buffers, consider vmalloc or the DMA API instead.

NUMA-aware page allocation

On NUMA systems you can constrain the page allocator to a specific node:
/* include/linux/gfp.h */
struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order);
Pass NUMA_NO_NODE to allow allocation from the current CPU’s local node with automatic fallback. Use numa_node_id() or cpu_to_node() to obtain the node ID of the calling CPU. Linux builds an ordered zonelist per node so that, when the local zone overflows, it falls back to the nearest node before trying remote nodes.

Slab allocator

The slab allocator sits above the page allocator and carves pages into fixed-size object caches. The current implementation is SLUB. It reduces fragmentation, improves cache locality through per-CPU free lists, and optionally validates allocation/free patterns when CONFIG_SLUB_DEBUG is enabled.

kmalloc and friends

kmalloc is the general-purpose slab allocation function, defined in include/linux/slab.h:
/* include/linux/slab.h */
void *kmalloc(size_t size, gfp_t flags);
void *kzalloc(size_t size, gfp_t flags);    /* zero-initialised */
void *krealloc(const void *p, size_t new_size, gfp_t flags);
void  kfree(const void *objp);
void  kfree_sensitive(const void *objp);    /* zeroes memory before freeing */
size_t ksize(const void *objp);             /* actual usable size */
kzalloc is equivalent to kmalloc(...) + memset(0) but expressed as a single call. Prefer it whenever you need a zeroed buffer.
/* Typical allocation pattern */
struct my_data *data = kzalloc(sizeof(*data), GFP_KERNEL);
if (!data)
    return -ENOMEM;

/* ... use data ... */

kfree(data);
Always check the return value of kmalloc/kzalloc. Under memory pressure both can return NULL even with GFP_KERNEL. Do not use __GFP_NOFAIL to paper over missing error handling.
kmalloc has a maximum allocation size of KMALLOC_MAX_SIZE (1UL << KMALLOC_SHIFT_MAX). SLUB maps requests directly to a slab cache for sizes up to KMALLOC_MAX_CACHE_SIZE (two pages); larger requests fall through to the page allocator.

Per-type object caches

When you allocate many objects of the same type, create a dedicated cache with kmem_cache_create:
/* include/linux/slab.h — public macro, two calling conventions: */
/* New form:    kmem_cache_create(name, object_size, args, flags) */
/* Legacy form: kmem_cache_create(name, object_size, align, flags, ctor) */
struct kmem_cache *kmem_cache_create(const char *name,
                                     unsigned int object_size,
                                     struct kmem_cache_args *args, /* or align */
                                     slab_flags_t flags);

void  kmem_cache_destroy(struct kmem_cache *s);
void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags);
void  kmem_cache_free(struct kmem_cache *s, void *objp);
The KMEM_CACHE macro provides a convenient shorthand:
/* include/linux/slab.h */
#define KMEM_CACHE(__struct, __flags)   \
    __kmem_cache_create_args(#__struct, sizeof(struct __struct), \
            &(struct kmem_cache_args) {                          \
                .align = __alignof__(struct __struct),           \
            }, (__flags))

/* Usage */
static struct kmem_cache *my_cache;

my_cache = KMEM_CACHE(my_struct, SLAB_HWCACHE_ALIGN);

struct my_struct *obj = kmem_cache_alloc(my_cache, GFP_KERNEL);
kmem_cache_free(my_cache, obj);
kmem_cache_destroy(my_cache);
Commonly used slab flags:
FlagEffect
SLAB_HWCACHE_ALIGNAlign objects on cache-line boundaries.
SLAB_PANICPanic on allocation failure during cache creation.
SLAB_TYPESAFE_BY_RCUDelay page freeing by an RCU grace period (does not delay object freeing).
SLAB_RECLAIM_ACCOUNTObjects are reclaimable; pages are charged to SReclaimable in /proc/meminfo.

vmalloc

vmalloc allocates virtually contiguous but physically non-contiguous memory. It is slower than kmalloc because it requires individual page allocations and must map them together with ioremap-style page table entries.
/* include/linux/vmalloc.h */
void *vmalloc(unsigned long size);
void *vzalloc(unsigned long size);   /* zero-initialised */
void  vfree(const void *addr);
Use vmalloc when:
  • The allocation is large (many megabytes) and physical contiguity is not required.
  • The allocation is long-lived and kmalloc would fragment the buddy allocator.
  • You need to map I/O memory or firmware buffers into the kernel virtual address space.
Memory returned by vmalloc cannot be used for DMA. DMA engines require physically contiguous memory; use the DMA API (dma_alloc_coherent) instead.

Memory reclaim and the OOM killer

When free memory falls below a watermark, the kernel attempts to reclaim pages:
1

kswapd wakes

When the low watermark of a zone is crossed, kswapd (the background reclaim daemon) wakes and scans the LRU lists looking for reclaimable pages.
2

Direct reclaim

If kswapd cannot keep up and an allocation is failing, the allocating task itself enters direct reclaim. This is triggered by GFP_KERNEL (and other flags that include __GFP_DIRECT_RECLAIM).
3

OOM killer

If direct reclaim fails to free enough memory after repeated retries, the OOM (out-of-memory) killer selects a task to kill based on oom_score. The selected process receives SIGKILL, freeing its memory and allowing the stalled allocation to proceed.
You can inspect OOM scoring via /proc/<pid>/oom_score and influence it with /proc/<pid>/oom_score_adj (range −1000 to +1000). Kernel threads and processes that set oom_score_adj to −1000 are protected from the OOM killer.

Further reading

Locking and concurrency

Learn which locking primitives to use when protecting shared memory structures.

Networking stack

See how sk_buff manages packet memory and interacts with the allocator.

Build docs developers (and LLMs) love