Linux Kernel Security Architecture Overview

The Linux kernel implements security as a layered system built from multiple complementary mechanisms. Rather than relying on any single control, it combines discretionary access control, POSIX capabilities, namespace isolation, syscall filtering, and the Linux Security Modules framework to achieve defense in depth. Understanding how these layers interact is essential for anyone building, configuring, or auditing a Linux-based system.

Discretionary access control

Linux inherits UNIX discretionary access control (DAC): every file and process has an owner (UID) and a group (GID), and permissions are enforced based on three classes — owner, group, and others. The kernel evaluates read, write, and execute bits on every filesystem access. Beyond the traditional UNIX permission mask, Linux supports POSIX Access Control Lists (ACLs), which allow filesystem objects to carry per-user and per-group permission entries that are more expressive than the three fixed classes. DAC is the baseline check that runs before any MAC policy. Even with a mandatory access control module active, DAC denials are enforced first.

Linux capabilities

The traditional root/non-root split is too coarse for production systems. A process running as UID 0 has unrestricted access to the kernel; a process running as any other UID has almost none. The Linux capabilities system divides the privileges historically associated with root into discrete units that can be granted independently.

Commonly used capabilities

Capability	Purpose
`CAP_NET_ADMIN`	Configure network interfaces, routing tables, and firewall rules
`CAP_SYS_ADMIN`	A wide range of administrative operations — treat as near-equivalent to root
`CAP_SYS_PTRACE`	Trace or inspect arbitrary processes
`CAP_DAC_OVERRIDE`	Bypass file read, write, and execute permission checks
`CAP_SETUID` / `CAP_SETGID`	Change UID/GID of the current process
`CAP_NET_BIND_SERVICE`	Bind to TCP/UDP ports below 1024
`CAP_SYS_MODULE`	Load and unload kernel modules
`CAP_SYS_CHROOT`	Use `chroot()`
`CAP_AUDIT_WRITE`	Write records to the kernel audit log

Capability sets

Each task carries four sets:

Permitted — the capabilities the process may grant to its effective set
Effective — the capabilities currently active and checked by the kernel
Inheritable — the capabilities that may be passed across execve()
Bounding — an upper bound that limits which capabilities can be inherited across execve(), particularly relevant when executing setuid-root binaries

The capset() system call and file capability extended attributes (security.capability) allow fine-grained privilege assignment without requiring a setuid binary.

Avoid granting CAP_SYS_ADMIN to containers or services unless strictly necessary. Its scope is so broad that it effectively restores root-level kernel access.

Namespaces as security boundaries

Linux namespaces allow the kernel to present different views of global resources to different sets of processes. They are the foundation of container isolation.

User namespaces

User namespaces map UIDs and GIDs inside the namespace to a different range outside. A process that appears to be UID 0 inside a user namespace has no kernel-level privilege outside it. This allows unprivileged users to create isolated environments without requiring CAP_SYS_ADMIN in the root namespace.

PID namespaces

PID namespaces give a process tree its own PID numbering. Process 1 inside the namespace is the namespace init — if it exits, all other processes in the namespace receive SIGKILL. Processes in the parent namespace can still see and signal namespace-internal processes via their global PIDs.

Network namespaces

Each network namespace has its own network interfaces, routing tables, firewall rules, and socket table. Containers use this to provide isolated network stacks. The veth device pair is the standard mechanism for connecting a namespace to the root network.

Mount, UTS, and IPC namespaces

Mount namespaces provide an independent view of the filesystem hierarchy. UTS namespaces give each namespace its own hostname. IPC namespaces isolate System V IPC objects and POSIX message queues.

User namespaces significantly expand the kernel attack surface available to unprivileged users. Some distributions restrict unprivileged user namespace creation via kernel.unprivileged_userns_clone or user.max_user_namespaces.

Seccomp for syscall filtering

Seccomp (secure computing) lets a process restrict the system calls it can make. The BPF-based filter mode (SECCOMP_MODE_FILTER) allows a process to install a Berkeley Packet Filter program that inspects each syscall number and its arguments, then returns an action.

/* Install a seccomp filter using prctl */
prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

The process must call prctl(PR_SET_NO_NEW_PRIVS, 1) first (or hold CAP_SYS_ADMIN) to prevent privilege escalation through the filter. A filter can return one of:

Return value	Effect
`SECCOMP_RET_ALLOW`	Syscall proceeds normally
`SECCOMP_RET_ERRNO`	Syscall returns the specified errno
`SECCOMP_RET_KILL_PROCESS`	Process exits immediately with SIGSYS
`SECCOMP_RET_TRAP`	Kernel sends SIGSYS to the process
`SECCOMP_RET_USER_NOTIF`	Notification sent to a supervisor process

Filters are inherited across fork() and execve(), and additional filters can be layered — the strictest matching rule always wins. Seccomp is most effective when combined with an LSM policy.

Linux Security Modules framework

The Linux Security Modules (LSM) framework provides a hook-based infrastructure for implementing mandatory access control and other security policies inside the kernel. At critical points in the kernel — file opens, socket connections, process credential changes — LSM calls registered hook functions that can allow or deny the operation. The active LSMs on a running system are listed at:

cat /sys/kernel/security/lsm

A typical output might be:

capability,landlock,lockdown,yama,selinux

See Linux Security Modules Framework for a detailed explanation of how the framework works, how to configure individual LSMs, and how to write a custom LSM.

Available LSMs

SELinux

Policy-based mandatory access control developed by the NSA. Every subject and object receives a label; a policy database determines which label combinations are allowed. Used by default on RHEL, Fedora, and Android.

AppArmor

Path-based MAC that confines programs to a declared set of files, capabilities, and network access via per-application profiles. Used by default on Ubuntu and Debian.

Smack

Simplified Mandatory Access Control Kernel. Uses short labels on files and processes; a simple rule set controls which labels can read or write to which other labels.

TOMOYO

Pathname-based MAC that builds a learning profile of allowed operations over time and can then enforce that profile. Focused on reducing false positives via training mode.

Landlock

User-space sandboxing API. Any process — including unprivileged ones — can voluntarily restrict its own filesystem and network access by constructing and enforcing a ruleset.

BPF LSM

Allows eBPF programs to be attached to LSM hooks, enabling policy enforcement that can be loaded and updated at runtime without recompiling the kernel.

Kernel hardening mechanisms

Beyond access control, the kernel contains a suite of self-protection mechanisms that make exploitation of kernel bugs significantly harder.

KASLR (CONFIG_RANDOMIZE_BASE) — randomizes the kernel’s load address at each boot so that an attacker cannot rely on fixed memory addresses.
SMEP — Supervisor Mode Execution Prevention, a CPU feature that prevents the kernel from executing code in user-space pages.
SMAP — Supervisor Mode Access Prevention, a CPU feature that prevents the kernel from reading or writing user-space memory without explicit intent.
Stack canaries (CONFIG_STACKPROTECTOR) — a secret value placed between stack variables and the return address that is checked before a function returns.
KASAN — Kernel Address Sanitizer, a runtime detector for out-of-bounds accesses and use-after-free bugs.
CONFIG_STRICT_KERNEL_RWX — enforces that kernel text is not writable and kernel data is not executable.

See Kernel Self-Protection and Hardening for a full breakdown of hardening Kconfig options and deployment guidance.

Reporting vulnerabilities

The Linux kernel security team handles embargoed vulnerability reports at security@kernel.org. Reports should include a description of the vulnerability, affected kernel versions, and reproduction steps if available. The team coordinates with distributors and upstream maintainers to prepare fixes before public disclosure.

LSM framework

How LSM hooks work, configuring SELinux and AppArmor, and writing a custom LSM.

Kernel hardening

KASLR, stack protection, heap integrity, and the Kconfig hardening checklist.

Core APIs

Driver Development

Security

Linux Kernel Security Architecture Overview

Discretionary access control

Linux capabilities

Namespaces as security boundaries

Seccomp for syscall filtering

Linux Security Modules framework

Available LSMs

SELinux

AppArmor

Smack

TOMOYO

Landlock

BPF LSM

Kernel hardening mechanisms

Reporting vulnerabilities

LSM framework

Kernel hardening

Build docs developers (and LLMs) love

Core APIs

Driver Development

Security

Documentation Index

​Discretionary access control

​Linux capabilities

​Namespaces as security boundaries

​Seccomp for syscall filtering

​Linux Security Modules framework

​Available LSMs

SELinux

AppArmor

Smack

TOMOYO

Landlock

BPF LSM

​Kernel hardening mechanisms

​Reporting vulnerabilities

LSM framework

Kernel hardening

Build docs developers (and LLMs) love

Discretionary access control

Linux capabilities

Namespaces as security boundaries

Seccomp for syscall filtering

Linux Security Modules framework

Available LSMs

Kernel hardening mechanisms

Reporting vulnerabilities