The Linux scheduler decides which task runs on which CPU at any given moment. Its design has evolved considerably: the current architecture uses a hierarchy of scheduling classes, each implementing a distinct policy. The Completely Fair Scheduler (CFS) handles the majority of tasks, while dedicated classes serve real-time and deadline workloads.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/DeelerDev/linux/llms.txt
Use this file to discover all available pages before exploring further.
Scheduling classes
Scheduling classes are implemented throughstruct sched_class and are ordered by priority. The scheduler always picks the highest-priority non-empty class:
| Class | Policy | Priority order |
|---|---|---|
stop_sched_class | Internal stop-machine tasks | Highest |
dl_sched_class | SCHED_DEADLINE | 2 |
rt_sched_class | SCHED_FIFO, SCHED_RR | 3 |
fair_sched_class | SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE | 4 |
idle_sched_class | CPU idle loop | Lowest |
enqueue_task, dequeue_task, pick_next_task, task_tick, etc.) that the core scheduler calls without needing to know policy details.
Completely Fair Scheduler
CFS models an “ideal, precise multi-tasking CPU” that runs all tasks in parallel, each at1/nr_running speed. On real hardware — where only one task runs per CPU at a time — it approximates this ideal using a virtual runtime (vruntime).
Virtual runtime and the red-black tree
Each task accumulates vruntime as it runs, normalised by its weight (derived from its nice value). CFS maintains a time-ordered red-black tree of all runnable tasks, keyed byp->se.vruntime. It always picks the leftmost node — the task with the least vruntime — as the next to run.
CFS uses nanosecond-granularity accounting and does not rely on the HZ timer tick. The only exposed tunable is
/sys/kernel/debug/sched/base_slice_ns, which adjusts the scheduling granularity between “low-latency desktop” and “high-throughput server” workloads.EEVDF
The Earliest Eligible Virtual Deadline First (EEVDF) scheduler, merged in kernel 6.6, is gradually replacing CFS’s pick-next logic. EEVDF assigns each task a virtual deadline in addition to vruntime, allowing it to better handle latency-sensitive tasks without sacrificing fairness.Nice levels and weights
Nice values (−20 to +19) map to task weights. A one-unit nice increase results in roughly a 10% reduction in CPU share relative to a nice-0 task. The weight table is defined inkernel/sched/core.c.
Real-time scheduling
Real-time tasks bypass CFS entirely and are handled by the RT scheduling class. There are two policies:SCHED_FIFO — first in, first out
SCHED_FIFO — first in, first out
A
SCHED_FIFO task runs until it voluntarily yields, blocks, or is preempted by a higher-priority RT task. There are no timeslices; the task holds the CPU indefinitely while runnable at its priority level.SCHED_RR — round robin
SCHED_RR — round robin
SCHED_RR is identical to SCHED_FIFO but adds a fixed timeslice. When the slice expires the task is moved to the back of the run queue for its priority level, giving other tasks at the same priority a turn./proc/sys/kernel/sched_rr_timeslice_ms.SCHED_DEADLINE — sporadic task model
SCHED_DEADLINE — sporadic task model
SCHED_DEADLINE implements the Earliest Deadline First (EDF) algorithm with CBS (Constant Bandwidth Server) admission control. Tasks declare a runtime, deadline, and period; the kernel guarantees that each task receives its declared runtime within each period.CPU affinity
CPU affinity restricts which CPUs a task may run on. The affinity mask is stored intask_struct.cpus_mask.
Load balancing and CPU topology
The scheduler models hardware topology as scheduling domains (sched-domains), which form a hierarchy from the CPU core level up through NUMA nodes. Load balancing runs periodically within each domain to migrate tasks from overloaded CPUs to idle ones.Scheduler tunables
Key tunables are exposed under/proc/sys/kernel/ and /sys/kernel/debug/sched/:
| Tunable | Description |
|---|---|
/sys/kernel/debug/sched/base_slice_ns | CFS scheduling granularity |
/proc/sys/kernel/sched_rt_runtime_us | RT tasks’ CPU budget per period |
/proc/sys/kernel/sched_rt_period_us | RT throttle period |
/proc/sys/kernel/sched_min_granularity_ns | Minimum CFS task runtime before preemption |
/proc/sys/kernel/numa_balancing | Enable automatic NUMA balancing |
/proc/schedstat and per-task in /proc/PID/schedstat.
cgroups and CPU isolation
The cgroupcpu controller integrates with CFS group scheduling (CONFIG_FAIR_GROUP_SCHED). Tasks in a cgroup share a CPU allocation determined by cpu.shares (v1) or cpu.weight / cpu.max (v2).
isolcpus= on the kernel command line removes CPUs from the scheduler’s general-purpose pool, and nohz_full= disables the scheduler tick on those CPUs to eliminate jitter.
Memory management
Page allocator, slab caches, and NUMA memory placement that the scheduler interacts with for per-CPU data structures.
Locking primitives
The per-CPU runqueue spinlocks and RCU usage that make the scheduler fast and safe.
Networking
NAPI polling and softirq processing that compete with tasks for CPU time.
Filesystems
I/O scheduling and how blocking on filesystem operations interacts with task state.
