Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/DeelerDev/linux/llms.txt

Use this file to discover all available pages before exploring further.

Power management in the Linux kernel divides into two distinct but complementary models: system sleep, where the entire system transitions to a low-power state like suspend-to-RAM or hibernation, and runtime PM, where individual devices power down independently while the rest of the system keeps running. Most drivers will encounter both, and the same dev_pm_ops structure is used to implement callbacks for each. Understanding when each callback is invoked—and what the kernel guarantees about ordering—is essential to writing correct PM support.

System sleep states

The kernel exposes several system sleep states, corresponding to the ACPI S-states on x86:

Suspend-to-Idle (S0ix)

The shallowest sleep. CPUs enter deep idle states; devices are suspended but memory is self-refreshed. Fast resume; used in modern laptops for “connected standby”.

Standby / Shallow sleep (S1)

A lightly powered state where the CPU is stopped but hardware context is preserved. Less power saving than suspend-to-RAM.

Suspend-to-RAM (S3)

The most commonly implemented suspend state. All device drivers are asked to suspend; main memory is kept alive but almost everything else powers off. Resume restores hardware context.

Suspend-to-Disk / Hibernation (S4)

The kernel writes a memory image to swap or a dedicated partition, then powers off completely. On resume, the image is loaded and memory state is restored. Slowest to resume; survives power loss.
During a system sleep transition the kernel walks the device hierarchy bottom-up to suspend devices and top-down to resume them. This ordering ensures that a parent device (such as a PCIe root port) is not suspended until all its children are already suspended.

The dev_pm_ops struct

All PM callbacks are gathered in struct dev_pm_ops, defined in include/linux/pm.h. You assign a populated instance to driver.pm:
#include <linux/pm.h>
#include <linux/pm_runtime.h>

static int mydrv_suspend(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    /* Quiesce hardware, save state */
    mydrv_hw_stop(priv);
    return 0;
}

static int mydrv_resume(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    /* Restore state, re-initialize hardware */
    mydrv_hw_start(priv);
    return 0;
}

static int mydrv_runtime_suspend(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    mydrv_hw_clock_disable(priv);
    return 0;
}

static int mydrv_runtime_resume(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    mydrv_hw_clock_enable(priv);
    return 0;
}

static const struct dev_pm_ops mydrv_pm_ops = {
    SYSTEM_SLEEP_PM_OPS(mydrv_suspend, mydrv_resume)
    RUNTIME_PM_OPS(mydrv_runtime_suspend, mydrv_runtime_resume, NULL)
};

static struct platform_driver mydrv_driver = {
    .probe  = mydrv_probe,
    .remove = mydrv_remove,
    .driver = {
        .name           = "my-device",
        .of_match_table = mydrv_of_match,
        .pm             = &mydrv_pm_ops,
    },
};
SYSTEM_SLEEP_PM_OPS() and RUNTIME_PM_OPS() are convenience macros that populate the correct fields and handle CONFIG_PM ifdefs cleanly. Without them you would set .suspend, .resume, .runtime_suspend, .runtime_resume, and .runtime_idle directly.

System sleep callback sequence

When the system suspends, the kernel executes callbacks in this order for each device (using whichever callback is present among pm_domain, device type, class, bus type, and driver, in that precedence order):
prepare → suspend → suspend_late → suspend_noirq
On resume the sequence reverses:
resume_noirq → resume_early → resume → complete
Hibernation adds freeze/thaw phases for creating and discarding the memory snapshot:
prepare → freeze → freeze_late → freeze_noirq
           [snapshot written]
thaw_noirq → thaw_early → thaw → complete
The suspend_noirq and resume_noirq phases run with interrupts disabled except for those marked IRQF_NO_SUSPEND. Do not acquire regular spinlocks or interact with devices that require interrupt-driven completion in these phases.

Runtime PM

Runtime PM lets a device power down while the system is still running. The PM core tracks a usage count and an autosuspend timer for each device. When the usage count drops to zero (and an optional autosuspend delay has expired), the core calls the driver’s runtime_suspend callback.

Enabling runtime PM in probe()

static int mydrv_probe(struct platform_device *pdev)
{
    struct mydrv_priv *priv;

    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
    if (!priv)
        return -ENOMEM;

    /* ... resource acquisition ... */

    /* Configure autosuspend: power off after 2 seconds of idle */
    pm_runtime_set_autosuspend_delay(&pdev->dev, 2000);
    pm_runtime_use_autosuspend(&pdev->dev);

    /* Mark device as active before enabling runtime PM */
    pm_runtime_set_active(&pdev->dev);
    pm_runtime_enable(&pdev->dev);

    platform_set_drvdata(pdev, priv);
    return 0;
}

static void mydrv_remove(struct platform_device *pdev)
{
    pm_runtime_disable(&pdev->dev);
    pm_runtime_set_suspended(&pdev->dev);
}
Always call pm_runtime_disable() in remove() before the device hardware is torn down. Failing to do so can result in the PM core calling runtime_suspend on an already-removed device.

Acquiring and releasing the device in driver I/O paths

Every time your driver needs the hardware to be powered on, it increments the PM usage count. When it is done, it decrements the count, potentially triggering suspend:
static irqreturn_t mydrv_irq_handler(int irq, void *data)
{
    struct mydrv_priv *priv = data;

    /* Device is guaranteed active while IRQ fires */
    mydrv_handle_interrupt(priv);
    return IRQ_HANDLED;
}

static ssize_t mydrv_read(struct file *filp, char __user *buf,
                           size_t count, loff_t *ppos)
{
    struct mydrv_priv *priv = filp->private_data;
    int ret;

    /* Power up device synchronously; blocks until resume completes */
    ret = pm_runtime_get_sync(priv->dev);
    if (ret < 0) {
        pm_runtime_put_noidle(priv->dev);
        return ret;
    }

    ret = mydrv_hw_read(priv, buf, count);

    /* Decrement usage count; schedule autosuspend if idle */
    pm_runtime_mark_last_busy(priv->dev);
    pm_runtime_put_autosuspend(priv->dev);

    return ret;
}
Key runtime PM functions:
FunctionEffect
pm_runtime_enable(dev)Allow runtime PM for this device
pm_runtime_disable(dev)Prevent new suspend/resume calls
pm_runtime_get_sync(dev)Increment usage count; power on if suspended (synchronous)
pm_runtime_get_noresume(dev)Increment usage count without resuming
pm_runtime_put(dev)Decrement usage count; may trigger suspend
pm_runtime_put_autosuspend(dev)Decrement count; trigger autosuspend timer
pm_runtime_put_noidle(dev)Decrement without scheduling suspend
pm_runtime_mark_last_busy(dev)Reset the autosuspend timer
pm_runtime_set_active(dev)Mark device as active (use before pm_runtime_enable)
pm_runtime_set_suspended(dev)Mark device as suspended (use after pm_runtime_disable)
pm_runtime_suspended(dev)Return true if device is runtime-suspended

Power domains

Sometimes multiple devices share a clock, voltage rail, or reset signal. They cannot be individually suspended; the shared resource must be controlled as a unit. The kernel supports this through PM domains, represented by struct dev_pm_domain:
struct dev_pm_domain {
    struct dev_pm_ops ops;
    /* ... */
};
If dev->pm_domain is set, the PM core uses the domain’s ops callbacks instead of the bus or driver callbacks. This allows platform code to group devices and control the shared resource without modifying individual drivers.
PM domains are set up by platform code (SoC PM drivers), not by individual device drivers. If your hardware platform uses a PM domain controller (common on ARM SoCs with genpd), the assignment happens automatically through Device Tree power-domains bindings.

ACPI power management

On x86 systems with ACPI firmware, many devices are described in ACPI namespace and the kernel integrates with the ACPI PM layer. The ACPI layer hooks into the same dev_pm_ops callbacks but also handles D-states (D0–D3cold) as defined by the PCI and ACPI specifications. Drivers on ACPI systems generally do not need to interact with ACPI directly. The PCI or platform bus layer handles the ACPI calls, and the driver’s suspend/resume callbacks are invoked at the right time. For devices that need explicit ACPI interaction, use acpi_device_set_power() and the ACPI_COMPANION() macro to obtain the ACPI handle from a struct device.

Wakeup sources

Some devices can generate a wakeup event that forces the system out of a sleep state—Ethernet wake-on-LAN, RTC alarms, keyboard activity, USB remote wakeup. A driver declares wakeup capability and manages it with:
/* In probe(): declare that this device can wake the system */
device_init_wakeup(&pdev->dev, true);

/* In suspend(): conditionally enable the wakeup mechanism */
static int mydrv_suspend(struct device *dev)
{
    if (device_may_wakeup(dev)) {
        enable_irq_wake(priv->irq);
        priv->wakeup_enabled = true;
    }
    mydrv_hw_stop(priv);
    return 0;
}

static int mydrv_resume(struct device *dev)
{
    mydrv_hw_start(priv);
    if (priv->wakeup_enabled) {
        disable_irq_wake(priv->irq);
        priv->wakeup_enabled = false;
    }
    return 0;
}
device_init_wakeup() registers the device as wakeup-capable and creates the /sys/devices/.../power/wakeup sysfs file. User space can write "enabled" or "disabled" to that file to control whether the device’s wakeup mechanism is actually armed during suspend. enable_irq_wake() tells the interrupt controller to treat the specified IRQ as a wakeup source. It must be paired with disable_irq_wake() on resume.
The PM core tracks active wakeup events to avoid suspending while an event is being processed. If your driver generates software-initiated wakeup events (not hardware IRQs), use the wakeup source API:
struct wakeup_source *ws;

/* In probe */
ws = wakeup_source_register(&pdev->dev, "my-device");

/* When an event occurs */
__pm_stay_awake(ws);  /* or pm_stay_awake(&pdev->dev) */

/* When processing is done */
__pm_relax(ws);       /* or pm_relax(&pdev->dev) */

/* In remove */
wakeup_source_unregister(ws);
If a device is already runtime-suspended when a system suspend begins, the PM core can skip the driver’s suspend callback and leave the device in its current low-power state, provided the driver’s prepare callback returns a positive value. This optimization—called direct-complete—reduces suspend latency significantly on systems with many idle devices.
1

Declare PM ops

Define a const struct dev_pm_ops with SYSTEM_SLEEP_PM_OPS() and RUNTIME_PM_OPS() macros. Assign it to driver.pm.
2

Enable runtime PM in probe()

Call pm_runtime_set_active(), optionally configure autosuspend delay with pm_runtime_set_autosuspend_delay() and pm_runtime_use_autosuspend(), then call pm_runtime_enable().
3

Guard I/O paths

Wrap hardware access with pm_runtime_get_sync() / pm_runtime_put_autosuspend() pairs. Set pm_runtime_mark_last_busy() before putting when using autosuspend.
4

Implement suspend/resume

In suspend(), quiesce DMA, save hardware state, and optionally call enable_irq_wake(). In resume(), restore state and re-initialize hardware.
5

Declare wakeup capability (if applicable)

Call device_init_wakeup() in probe(). In suspend(), call enable_irq_wake() when device_may_wakeup() returns true.
6

Disable runtime PM in remove()

Call pm_runtime_disable() followed by pm_runtime_set_suspended() to bring the PM state machine to a clean stop before the hardware is torn down.

Build docs developers (and LLMs) love