Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/DeelerDev/linux/llms.txt

Use this file to discover all available pages before exploring further.

The Linux scheduler decides which task runs on which CPU at any given moment. Its design has evolved considerably: the current architecture uses a hierarchy of scheduling classes, each implementing a distinct policy. The Completely Fair Scheduler (CFS) handles the majority of tasks, while dedicated classes serve real-time and deadline workloads.

Scheduling classes

Scheduling classes are implemented through struct sched_class and are ordered by priority. The scheduler always picks the highest-priority non-empty class:
ClassPolicyPriority order
stop_sched_classInternal stop-machine tasksHighest
dl_sched_classSCHED_DEADLINE2
rt_sched_classSCHED_FIFO, SCHED_RR3
fair_sched_classSCHED_NORMAL, SCHED_BATCH, SCHED_IDLE4
idle_sched_classCPU idle loopLowest
Each class implements hooks (enqueue_task, dequeue_task, pick_next_task, task_tick, etc.) that the core scheduler calls without needing to know policy details.
/* From kernel/sched/sched.h — partial sched_class interface */
struct sched_class {
    void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags);
    void (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags);
    struct task_struct *(*pick_next_task)(struct rq *rq);
    void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
    void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);
    /* ... */
};

Completely Fair Scheduler

CFS models an “ideal, precise multi-tasking CPU” that runs all tasks in parallel, each at 1/nr_running speed. On real hardware — where only one task runs per CPU at a time — it approximates this ideal using a virtual runtime (vruntime).

Virtual runtime and the red-black tree

Each task accumulates vruntime as it runs, normalised by its weight (derived from its nice value). CFS maintains a time-ordered red-black tree of all runnable tasks, keyed by p->se.vruntime. It always picks the leftmost node — the task with the least vruntime — as the next to run.
/* Simplified: task's scheduling entity, embedded in task_struct */
struct sched_entity {
    struct load_weight  load;       /* task weight (from nice) */
    struct rb_node      run_node;   /* node in the CFS rbtree */
    u64                 vruntime;   /* accumulated virtual runtime (ns) */
    /* ... */
};
When a task’s vruntime grows large enough that another task becomes the leftmost node (plus a granularity margin to avoid excessive context switches), preemption occurs.
CFS uses nanosecond-granularity accounting and does not rely on the HZ timer tick. The only exposed tunable is /sys/kernel/debug/sched/base_slice_ns, which adjusts the scheduling granularity between “low-latency desktop” and “high-throughput server” workloads.

EEVDF

The Earliest Eligible Virtual Deadline First (EEVDF) scheduler, merged in kernel 6.6, is gradually replacing CFS’s pick-next logic. EEVDF assigns each task a virtual deadline in addition to vruntime, allowing it to better handle latency-sensitive tasks without sacrificing fairness.

Nice levels and weights

Nice values (−20 to +19) map to task weights. A one-unit nice increase results in roughly a 10% reduction in CPU share relative to a nice-0 task. The weight table is defined in kernel/sched/core.c.
# Set nice level at launch
nice -n 10 my_program

# Renice a running process
renice -n -5 -p <pid>

Real-time scheduling

Real-time tasks bypass CFS entirely and are handled by the RT scheduling class. There are two policies:
A SCHED_FIFO task runs until it voluntarily yields, blocks, or is preempted by a higher-priority RT task. There are no timeslices; the task holds the CPU indefinitely while runnable at its priority level.
struct sched_param param = { .sched_priority = 50 };
sched_setscheduler(0, SCHED_FIFO, &param);
SCHED_RR is identical to SCHED_FIFO but adds a fixed timeslice. When the slice expires the task is moved to the back of the run queue for its priority level, giving other tasks at the same priority a turn.
struct sched_param param = { .sched_priority = 50 };
sched_setscheduler(0, SCHED_RR, &param);
The timeslice is visible at /proc/sys/kernel/sched_rr_timeslice_ms.
SCHED_DEADLINE implements the Earliest Deadline First (EDF) algorithm with CBS (Constant Bandwidth Server) admission control. Tasks declare a runtime, deadline, and period; the kernel guarantees that each task receives its declared runtime within each period.
struct sched_attr attr = {
    .size           = sizeof(attr),
    .sched_policy   = SCHED_DEADLINE,
    .sched_runtime  =  5000000,  /* 5 ms */
    .sched_deadline = 10000000,  /* 10 ms */
    .sched_period   = 10000000,  /* 10 ms */
};
sched_setattr(0, &attr, 0);
RT priorities 1–99 are available to processes with CAP_SYS_NICE. A runaway SCHED_FIFO task at priority 99 can starve all other work including softirqs. The RT throttle (/proc/sys/kernel/sched_rt_runtime_us) limits RT tasks to 95% of CPU time by default to preserve system liveness.

CPU affinity

CPU affinity restricts which CPUs a task may run on. The affinity mask is stored in task_struct.cpus_mask.
# Pin process to CPUs 0 and 1
taskset -cp 0,1 <pid>

# Launch with a specific affinity
taskset -c 2,3 my_program
/* Kernel API */
sched_setaffinity(pid, cpumask_size(), &mask);
sched_getaffinity(pid, cpumask_size(), &mask);

Load balancing and CPU topology

The scheduler models hardware topology as scheduling domains (sched-domains), which form a hierarchy from the CPU core level up through NUMA nodes. Load balancing runs periodically within each domain to migrate tasks from overloaded CPUs to idle ones.
Example topology (2-socket, 4-core/socket, 2-thread SMT):
  NUMA domain (node 0 ↔ node 1)
    ├── MC domain (socket 0: CPUs 0-7)
    │     ├── SMT domain (core 0: CPUs 0,1)
    │     ├── SMT domain (core 1: CPUs 2,3)
    │     └── ...
    └── MC domain (socket 1: CPUs 8-15)
          └── ...
The scheduler prefers to keep tasks within the innermost domain (SMT siblings share L1/L2 caches) and only migrates across NUMA boundaries when imbalance is significant, because remote memory access degrades performance.
Use numactl --hardware to inspect your system’s NUMA topology, and cat /sys/devices/system/cpu/cpu0/topology/core_siblings to view which CPUs share a physical core.

Scheduler tunables

Key tunables are exposed under /proc/sys/kernel/ and /sys/kernel/debug/sched/:
TunableDescription
/sys/kernel/debug/sched/base_slice_nsCFS scheduling granularity
/proc/sys/kernel/sched_rt_runtime_usRT tasks’ CPU budget per period
/proc/sys/kernel/sched_rt_period_usRT throttle period
/proc/sys/kernel/sched_min_granularity_nsMinimum CFS task runtime before preemption
/proc/sys/kernel/numa_balancingEnable automatic NUMA balancing
Scheduler statistics are available per-CPU in /proc/schedstat and per-task in /proc/PID/schedstat.

cgroups and CPU isolation

The cgroup cpu controller integrates with CFS group scheduling (CONFIG_FAIR_GROUP_SCHED). Tasks in a cgroup share a CPU allocation determined by cpu.shares (v1) or cpu.weight / cpu.max (v2).
# Limit a cgroup to 50% of one CPU (cgroup v2)
echo "50000 100000" > /sys/fs/cgroup/mygroup/cpu.max

# Assign processes to the group
echo <pid> > /sys/fs/cgroup/mygroup/cgroup.procs
For hard real-time isolation, isolcpus= on the kernel command line removes CPUs from the scheduler’s general-purpose pool, and nohz_full= disables the scheduler tick on those CPUs to eliminate jitter.

Memory management

Page allocator, slab caches, and NUMA memory placement that the scheduler interacts with for per-CPU data structures.

Locking primitives

The per-CPU runqueue spinlocks and RCU usage that make the scheduler fast and safe.

Networking

NAPI polling and softirq processing that compete with tasks for CPU time.

Filesystems

I/O scheduling and how blocking on filesystem operations interacts with task state.

Build docs developers (and LLMs) love