NPU core scheduling and thread configuration

EZ RKNN Async supports two approaches to NPU core selection: data-parallel scheduling via the schedule option, which distributes independent tasks across cores in round-robin order, and tensor-parallel mode via tp_mode, which assigns an rknn_core_mask so the RKNN driver handles cross-core coordination internally. The two approaches are mutually exclusive. This page explains how each works, how threads are created, and covers the supplementary options that control pacing and context duplication.

Data-parallel scheduling (`schedule`)

When you set schedule, the runtime distributes tasks across the specified cores by computing core_id = schedule[task_id % len(schedule)]. Task IDs increment monotonically from zero, so a schedule of [0, 1, 2] sends task 0 to core 0, task 1 to core 1, task 2 to core 2, task 3 back to core 0, and so on.

task 0 → core 0
task 1 → core 1
task 2 → core 2
task 3 → core 0
task 4 → core 1
...

This lets you run three independent model instances in parallel and is the recommended approach for throughput-oriented workloads on RK3588, which has three NPU cores.

`schedule` accepts three forms

# Single int — pin all tasks to core 0
make_provider_options(schedule=0)

# Comma-separated string — distribute across cores 0, 1, and 2
make_provider_options(schedule="0,1,2")

# List of ints — fine-grained control; repeat a core to weight it more heavily
make_provider_options(schedule=[0, 0, 1])  # 2/3 of tasks go to core 0

Schedule values must be non-negative integers. An empty schedule raises RuntimeError. Duplicate core IDs are valid and increase the fraction of tasks sent to that core.

Thread count with `schedule`

One set of worker threads is created per unique core in the schedule. For example, schedule=[0, 1, 2] has three unique cores, so if threads_per_core=2 the session creates six worker threads total (two threads per core, each thread owning its own RKNN context).

total_threads = threads_per_core × len(set(schedule))

Tensor-parallel mode (`tp_mode`)

When you set tp_mode, all worker contexts share the same rknn_core_mask and the RKNN driver decides how to split computation across cores. This is useful when a single model is large enough that running it across multiple cores with driver-level coordination yields lower per-request latency.

`tp_mode` value	`rknn_core_mask` constant
`"auto"`	`RKNN_NPU_CORE_AUTO` (default when neither option is set)
`"all"`	`RKNN_NPU_CORE_ALL`
`"0"`	`RKNN_NPU_CORE_0`
`"1"`	`RKNN_NPU_CORE_1`
`"2"`	`RKNN_NPU_CORE_2`
`"0,1"`	`RKNN_NPU_CORE_0_1`
`"0,1,2"`	`RKNN_NPU_CORE_0_1_2`

# Use the driver's automatic core selection (this is the default)
make_provider_options(tp_mode="auto")

# Spread one model across all three RK3588 cores (tensor-parallel)
make_provider_options(tp_mode="0,1,2")

# Pin to core 1 only
make_provider_options(tp_mode="1")

When neither schedule nor tp_mode is set, the session defaults to tp_mode="auto" (RKNN_NPU_CORE_AUTO), which lets the driver pick the least-loaded core.

schedule and tp_mode are mutually exclusive. Setting both raises ValueError from make_provider_options() and RuntimeError from the session constructor. Set only one.

`threads_per_core`

threads_per_core

int

default:"1"

Number of worker threads to create for each unique NPU core. With tp_mode, only one unique “core slot” exists (thread 0 gets the initial context), so the total thread count equals threads_per_core. With schedule, total threads = threads_per_core × len(unique cores).Increasing this value allows more tasks to be processed simultaneously on a single core. On RK3588, values above 2–3 rarely improve throughput because the NPU itself is the bottleneck.Must be greater than 0.

Throughput pacing (`enable_pacing`)

enable_pacing

bool

default:"False"

When True, the session tracks the exponential moving average of per-task inference time and silently drops submissions that arrive faster than the NPU can sustain. The EMA uses α = 0.95 for the previous average and α = 0.05 for the new sample:

avg = 0.95 × prev_avg + 0.05 × latest_inference_us

The inter-submission interval is computed as avg / num_cores, where num_cores is the count of unique cores in the schedule. Submissions that arrive before the interval has elapsed return nullopt internally; the Python-level call retries until the queue accepts the task.Pacing prevents queue saturation under burst load and produces smoother end-to-end throughput at the cost of occasionally delaying a submission for a few milliseconds.

Independent context initialisation (`disable_dup_context`)

disable_dup_context

bool

default:"False"

When False (default), the session calls rknn_dup_context to clone the initial RKNN context for each additional worker thread. This is fast but can cause instability when custom ops are involved. When True, each worker thread calls rknn_init independently with the same model data, which is slower to start but more stable. Loading custom ops forces this flag to True regardless of its value.

Performance logging with `ZTU_EZRKNN_ASYNC_PRINT_PERF`

Setting the environment variable ZTU_EZRKNN_ASYNC_PRINT_PERF=1 (or true) before starting the process causes the runtime to print per-task timing statistics to stderr. Each line contains:

[ztu_ez_rknn_async_perf] task_id=<n> thread_id=<t> core_id=<c>
    set_input_us=<µs> infer_us=<µs> copy_output_us=<µs> total_us=<µs>

This is read once at session construction time via std::getenv("ZTU_EZRKNN_ASYNC_PRINT_PERF") and accepts 1, true, on, or yes (case-insensitive).

ZTU_EZRKNN_ASYNC_PRINT_PERF=1 python my_script.py

Scheduling examples for RK3588

Single core (default)
Three-core data-parallel
Two-core data-parallel
Tensor-parallel across all cores
Data-parallel with pacing

The simplest configuration: one worker thread on NPU core 0. Good for development and profiling.

opts = make_provider_options()
# Equivalent to tp_mode="auto" with one worker thread

Distribute independent frames across all three RK3588 NPU cores for maximum throughput. Set threads_per_core=2 for a deeper pipeline per core.

opts = make_provider_options(
    schedule=[0, 1, 2],
    threads_per_core=2,
    max_queue_size=9,       # 3 cores × 3 in-flight tasks
)

Reserve core 2 for another process while using cores 0 and 1 for this session.

opts = make_provider_options(
    schedule=[0, 1],
    threads_per_core=1,
)

Run one model instance with the driver coordinating all three cores. Lowers per-request latency for large models at the expense of total throughput.

opts = make_provider_options(tp_mode="0,1,2")

Three-core round-robin with pacing enabled to avoid queue saturation under bursty input (e.g., a video source that occasionally delivers frames faster than the NPU can process them).

opts = make_provider_options(
    schedule=[0, 1, 2],
    threads_per_core=1,
    enable_pacing=True,
)

Get Started

Guides

Configuration

NPU core scheduling and thread configuration

Data-parallel scheduling (`schedule`)

`schedule` accepts three forms

Thread count with `schedule`

Tensor-parallel mode (`tp_mode`)

`threads_per_core`

Throughput pacing (`enable_pacing`)

Independent context initialisation (`disable_dup_context`)

Performance logging with `ZTU_EZRKNN_ASYNC_PRINT_PERF`

Scheduling examples for RK3588

Build docs developers (and LLMs) love

Get Started

Guides

Configuration

Documentation Index

​Data-parallel scheduling (schedule)

​schedule accepts three forms

​Thread count with schedule

​Tensor-parallel mode (tp_mode)

​threads_per_core

​Throughput pacing (enable_pacing)

​Independent context initialisation (disable_dup_context)

​Performance logging with ZTU_EZRKNN_ASYNC_PRINT_PERF

​Scheduling examples for RK3588

Build docs developers (and LLMs) love

Data-parallel scheduling (`schedule`)

`schedule` accepts three forms

Thread count with `schedule`

Tensor-parallel mode (`tp_mode`)

`threads_per_core`

Throughput pacing (`enable_pacing`)

Independent context initialisation (`disable_dup_context`)

Performance logging with `ZTU_EZRKNN_ASYNC_PRINT_PERF`

Scheduling examples for RK3588