Implementing Driver Power Management in the Linux Kernel

Power management in the Linux kernel divides into two distinct but complementary models: system sleep, where the entire system transitions to a low-power state like suspend-to-RAM or hibernation, and runtime PM, where individual devices power down independently while the rest of the system keeps running. Most drivers will encounter both, and the same dev_pm_ops structure is used to implement callbacks for each. Understanding when each callback is invoked—and what the kernel guarantees about ordering—is essential to writing correct PM support.

System sleep states

The kernel exposes several system sleep states, corresponding to the ACPI S-states on x86:

Suspend-to-Idle (S0ix)

The shallowest sleep. CPUs enter deep idle states; devices are suspended but memory is self-refreshed. Fast resume; used in modern laptops for “connected standby”.

Standby / Shallow sleep (S1)

A lightly powered state where the CPU is stopped but hardware context is preserved. Less power saving than suspend-to-RAM.

Suspend-to-RAM (S3)

The most commonly implemented suspend state. All device drivers are asked to suspend; main memory is kept alive but almost everything else powers off. Resume restores hardware context.

Suspend-to-Disk / Hibernation (S4)

The kernel writes a memory image to swap or a dedicated partition, then powers off completely. On resume, the image is loaded and memory state is restored. Slowest to resume; survives power loss.

During a system sleep transition the kernel walks the device hierarchy bottom-up to suspend devices and top-down to resume them. This ordering ensures that a parent device (such as a PCIe root port) is not suspended until all its children are already suspended.

The dev_pm_ops struct

All PM callbacks are gathered in struct dev_pm_ops, defined in include/linux/pm.h. You assign a populated instance to driver.pm:

#include <linux/pm.h>
#include <linux/pm_runtime.h>

static int mydrv_suspend(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    /* Quiesce hardware, save state */
    mydrv_hw_stop(priv);
    return 0;
}

static int mydrv_resume(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    /* Restore state, re-initialize hardware */
    mydrv_hw_start(priv);
    return 0;
}

static int mydrv_runtime_suspend(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    mydrv_hw_clock_disable(priv);
    return 0;
}

static int mydrv_runtime_resume(struct device *dev)
{
    struct mydrv_priv *priv = dev_get_drvdata(dev);

    mydrv_hw_clock_enable(priv);
    return 0;
}

static const struct dev_pm_ops mydrv_pm_ops = {
    SYSTEM_SLEEP_PM_OPS(mydrv_suspend, mydrv_resume)
    RUNTIME_PM_OPS(mydrv_runtime_suspend, mydrv_runtime_resume, NULL)
};

static struct platform_driver mydrv_driver = {
    .probe  = mydrv_probe,
    .remove = mydrv_remove,
    .driver = {
        .name           = "my-device",
        .of_match_table = mydrv_of_match,
        .pm             = &mydrv_pm_ops,
    },
};

SYSTEM_SLEEP_PM_OPS() and RUNTIME_PM_OPS() are convenience macros that populate the correct fields and handle CONFIG_PM ifdefs cleanly. Without them you would set .suspend, .resume, .runtime_suspend, .runtime_resume, and .runtime_idle directly.

System sleep callback sequence

When the system suspends, the kernel executes callbacks in this order for each device (using whichever callback is present among pm_domain, device type, class, bus type, and driver, in that precedence order):

prepare → suspend → suspend_late → suspend_noirq

On resume the sequence reverses:

resume_noirq → resume_early → resume → complete

Hibernation adds freeze/thaw phases for creating and discarding the memory snapshot:

prepare → freeze → freeze_late → freeze_noirq
           [snapshot written]
thaw_noirq → thaw_early → thaw → complete

The suspend_noirq and resume_noirq phases run with interrupts disabled except for those marked IRQF_NO_SUSPEND. Do not acquire regular spinlocks or interact with devices that require interrupt-driven completion in these phases.

Runtime PM

Runtime PM lets a device power down while the system is still running. The PM core tracks a usage count and an autosuspend timer for each device. When the usage count drops to zero (and an optional autosuspend delay has expired), the core calls the driver’s runtime_suspend callback.

Enabling runtime PM in probe()

static int mydrv_probe(struct platform_device *pdev)
{
    struct mydrv_priv *priv;

    priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
    if (!priv)
        return -ENOMEM;

    /* ... resource acquisition ... */

    /* Configure autosuspend: power off after 2 seconds of idle */
    pm_runtime_set_autosuspend_delay(&pdev->dev, 2000);
    pm_runtime_use_autosuspend(&pdev->dev);

    /* Mark device as active before enabling runtime PM */
    pm_runtime_set_active(&pdev->dev);
    pm_runtime_enable(&pdev->dev);

    platform_set_drvdata(pdev, priv);
    return 0;
}

static void mydrv_remove(struct platform_device *pdev)
{
    pm_runtime_disable(&pdev->dev);
    pm_runtime_set_suspended(&pdev->dev);
}

Always call pm_runtime_disable() in remove() before the device hardware is torn down. Failing to do so can result in the PM core calling runtime_suspend on an already-removed device.

Acquiring and releasing the device in driver I/O paths

Every time your driver needs the hardware to be powered on, it increments the PM usage count. When it is done, it decrements the count, potentially triggering suspend:

static irqreturn_t mydrv_irq_handler(int irq, void *data)
{
    struct mydrv_priv *priv = data;

    /* Device is guaranteed active while IRQ fires */
    mydrv_handle_interrupt(priv);
    return IRQ_HANDLED;
}

static ssize_t mydrv_read(struct file *filp, char __user *buf,
                           size_t count, loff_t *ppos)
{
    struct mydrv_priv *priv = filp->private_data;
    int ret;

    /* Power up device synchronously; blocks until resume completes */
    ret = pm_runtime_get_sync(priv->dev);
    if (ret < 0) {
        pm_runtime_put_noidle(priv->dev);
        return ret;
    }

    ret = mydrv_hw_read(priv, buf, count);

    /* Decrement usage count; schedule autosuspend if idle */
    pm_runtime_mark_last_busy(priv->dev);
    pm_runtime_put_autosuspend(priv->dev);

    return ret;
}

Key runtime PM functions:

Function	Effect
`pm_runtime_enable(dev)`	Allow runtime PM for this device
`pm_runtime_disable(dev)`	Prevent new suspend/resume calls
`pm_runtime_get_sync(dev)`	Increment usage count; power on if suspended (synchronous)
`pm_runtime_get_noresume(dev)`	Increment usage count without resuming
`pm_runtime_put(dev)`	Decrement usage count; may trigger suspend
`pm_runtime_put_autosuspend(dev)`	Decrement count; trigger autosuspend timer
`pm_runtime_put_noidle(dev)`	Decrement without scheduling suspend
`pm_runtime_mark_last_busy(dev)`	Reset the autosuspend timer
`pm_runtime_set_active(dev)`	Mark device as active (use before `pm_runtime_enable`)
`pm_runtime_set_suspended(dev)`	Mark device as suspended (use after `pm_runtime_disable`)
`pm_runtime_suspended(dev)`	Return true if device is runtime-suspended

Power domains

Sometimes multiple devices share a clock, voltage rail, or reset signal. They cannot be individually suspended; the shared resource must be controlled as a unit. The kernel supports this through PM domains, represented by struct dev_pm_domain:

struct dev_pm_domain {
    struct dev_pm_ops ops;
    /* ... */
};

If dev->pm_domain is set, the PM core uses the domain’s ops callbacks instead of the bus or driver callbacks. This allows platform code to group devices and control the shared resource without modifying individual drivers.

PM domains are set up by platform code (SoC PM drivers), not by individual device drivers. If your hardware platform uses a PM domain controller (common on ARM SoCs with genpd), the assignment happens automatically through Device Tree power-domains bindings.

ACPI power management

On x86 systems with ACPI firmware, many devices are described in ACPI namespace and the kernel integrates with the ACPI PM layer. The ACPI layer hooks into the same dev_pm_ops callbacks but also handles D-states (D0–D3cold) as defined by the PCI and ACPI specifications. Drivers on ACPI systems generally do not need to interact with ACPI directly. The PCI or platform bus layer handles the ACPI calls, and the driver’s suspend/resume callbacks are invoked at the right time. For devices that need explicit ACPI interaction, use acpi_device_set_power() and the ACPI_COMPANION() macro to obtain the ACPI handle from a struct device.

Wakeup sources

Some devices can generate a wakeup event that forces the system out of a sleep state—Ethernet wake-on-LAN, RTC alarms, keyboard activity, USB remote wakeup. A driver declares wakeup capability and manages it with:

/* In probe(): declare that this device can wake the system */
device_init_wakeup(&pdev->dev, true);

/* In suspend(): conditionally enable the wakeup mechanism */
static int mydrv_suspend(struct device *dev)
{
    if (device_may_wakeup(dev)) {
        enable_irq_wake(priv->irq);
        priv->wakeup_enabled = true;
    }
    mydrv_hw_stop(priv);
    return 0;
}

static int mydrv_resume(struct device *dev)
{
    mydrv_hw_start(priv);
    if (priv->wakeup_enabled) {
        disable_irq_wake(priv->irq);
        priv->wakeup_enabled = false;
    }
    return 0;
}

device_init_wakeup() registers the device as wakeup-capable and creates the /sys/devices/.../power/wakeup sysfs file. User space can write "enabled" or "disabled" to that file to control whether the device’s wakeup mechanism is actually armed during suspend. enable_irq_wake() tells the interrupt controller to treat the specified IRQ as a wakeup source. It must be paired with disable_irq_wake() on resume.

Wakeup source accounting

The PM core tracks active wakeup events to avoid suspending while an event is being processed. If your driver generates software-initiated wakeup events (not hardware IRQs), use the wakeup source API:

struct wakeup_source *ws;

/* In probe */
ws = wakeup_source_register(&pdev->dev, "my-device");

/* When an event occurs */
__pm_stay_awake(ws);  /* or pm_stay_awake(&pdev->dev) */

/* When processing is done */
__pm_relax(ws);       /* or pm_relax(&pdev->dev) */

/* In remove */
wakeup_source_unregister(ws);

Runtime PM and system sleep interaction

If a device is already runtime-suspended when a system suspend begins, the PM core can skip the driver’s suspend callback and leave the device in its current low-power state, provided the driver’s prepare callback returns a positive value. This optimization—called direct-complete—reduces suspend latency significantly on systems with many idle devices.

Declare PM ops

Define a const struct dev_pm_ops with SYSTEM_SLEEP_PM_OPS() and RUNTIME_PM_OPS() macros. Assign it to driver.pm.

Enable runtime PM in probe()

Call pm_runtime_set_active(), optionally configure autosuspend delay with pm_runtime_set_autosuspend_delay() and pm_runtime_use_autosuspend(), then call pm_runtime_enable().

Guard I/O paths

Wrap hardware access with pm_runtime_get_sync() / pm_runtime_put_autosuspend() pairs. Set pm_runtime_mark_last_busy() before putting when using autosuspend.

Implement suspend/resume

In suspend(), quiesce DMA, save hardware state, and optionally call enable_irq_wake(). In resume(), restore state and re-initialize hardware.

Declare wakeup capability (if applicable)

Call device_init_wakeup() in probe(). In suspend(), call enable_irq_wake() when device_may_wakeup() returns true.

Disable runtime PM in remove()

Call pm_runtime_disable() followed by pm_runtime_set_suspended() to bring the PM state machine to a clean stop before the hardware is torn down.

Core APIs

Driver Development

Security

Implementing Driver Power Management in the Linux Kernel

System sleep states

Suspend-to-Idle (S0ix)

Standby / Shallow sleep (S1)

Suspend-to-RAM (S3)

Suspend-to-Disk / Hibernation (S4)

The dev_pm_ops struct

System sleep callback sequence

Runtime PM

Enabling runtime PM in probe()

Acquiring and releasing the device in driver I/O paths

Power domains

ACPI power management

Wakeup sources

Build docs developers (and LLMs) love

Core APIs

Driver Development

Security

Documentation Index

​System sleep states

Suspend-to-Idle (S0ix)

Standby / Shallow sleep (S1)

Suspend-to-RAM (S3)

Suspend-to-Disk / Hibernation (S4)

​The dev_pm_ops struct

​System sleep callback sequence

​Runtime PM

​Enabling runtime PM in probe()

​Acquiring and releasing the device in driver I/O paths

​Power domains

​ACPI power management

​Wakeup sources

Build docs developers (and LLMs) love

System sleep states

The dev_pm_ops struct

System sleep callback sequence

Runtime PM

Enabling runtime PM in probe()

Acquiring and releasing the device in driver I/O paths

Power domains

ACPI power management

Wakeup sources