EZ RKNN Async supports two approaches to NPU core selection: data-parallel scheduling via theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/happyme531/ztu_somemodelruntime_ez_rknn_async/llms.txt
Use this file to discover all available pages before exploring further.
schedule option, which distributes independent tasks across cores in round-robin order, and tensor-parallel mode via tp_mode, which assigns an rknn_core_mask so the RKNN driver handles cross-core coordination internally. The two approaches are mutually exclusive. This page explains how each works, how threads are created, and covers the supplementary options that control pacing and context duplication.
Data-parallel scheduling (schedule)
When you set schedule, the runtime distributes tasks across the specified cores by computing core_id = schedule[task_id % len(schedule)]. Task IDs increment monotonically from zero, so a schedule of [0, 1, 2] sends task 0 to core 0, task 1 to core 1, task 2 to core 2, task 3 back to core 0, and so on.
schedule accepts three forms
Schedule values must be non-negative integers. An empty schedule raises
RuntimeError. Duplicate core IDs are valid and increase the fraction of tasks sent to that core.Thread count with schedule
One set of worker threads is created per unique core in the schedule. For example, schedule=[0, 1, 2] has three unique cores, so if threads_per_core=2 the session creates six worker threads total (two threads per core, each thread owning its own RKNN context).
Tensor-parallel mode (tp_mode)
When you set tp_mode, all worker contexts share the same rknn_core_mask and the RKNN driver decides how to split computation across cores. This is useful when a single model is large enough that running it across multiple cores with driver-level coordination yields lower per-request latency.
tp_mode value | rknn_core_mask constant |
|---|---|
"auto" | RKNN_NPU_CORE_AUTO (default when neither option is set) |
"all" | RKNN_NPU_CORE_ALL |
"0" | RKNN_NPU_CORE_0 |
"1" | RKNN_NPU_CORE_1 |
"2" | RKNN_NPU_CORE_2 |
"0,1" | RKNN_NPU_CORE_0_1 |
"0,1,2" | RKNN_NPU_CORE_0_1_2 |
When neither
schedule nor tp_mode is set, the session defaults to tp_mode="auto" (RKNN_NPU_CORE_AUTO), which lets the driver pick the least-loaded core.threads_per_core
Number of worker threads to create for each unique NPU core. With
tp_mode, only one unique “core slot” exists (thread 0 gets the initial context), so the total thread count equals threads_per_core. With schedule, total threads = threads_per_core × len(unique cores).Increasing this value allows more tasks to be processed simultaneously on a single core. On RK3588, values above 2–3 rarely improve throughput because the NPU itself is the bottleneck.Must be greater than 0.Throughput pacing (enable_pacing)
When The inter-submission interval is computed as
True, the session tracks the exponential moving average of per-task inference time and silently drops submissions that arrive faster than the NPU can sustain. The EMA uses α = 0.95 for the previous average and α = 0.05 for the new sample:avg / num_cores, where num_cores is the count of unique cores in the schedule. Submissions that arrive before the interval has elapsed return nullopt internally; the Python-level call retries until the queue accepts the task.Pacing prevents queue saturation under burst load and produces smoother end-to-end throughput at the cost of occasionally delaying a submission for a few milliseconds.Independent context initialisation (disable_dup_context)
When
False (default), the session calls rknn_dup_context to clone the initial RKNN context for each additional worker thread. This is fast but can cause instability when custom ops are involved. When True, each worker thread calls rknn_init independently with the same model data, which is slower to start but more stable. Loading custom ops forces this flag to True regardless of its value.Performance logging with ZTU_EZRKNN_ASYNC_PRINT_PERF
Setting the environment variable ZTU_EZRKNN_ASYNC_PRINT_PERF=1 (or true) before starting the process causes the runtime to print per-task timing statistics to stderr. Each line contains:
std::getenv("ZTU_EZRKNN_ASYNC_PRINT_PERF") and accepts 1, true, on, or yes (case-insensitive).
Scheduling examples for RK3588
- Single core (default)
- Three-core data-parallel
- Two-core data-parallel
- Tensor-parallel across all cores
- Data-parallel with pacing
The simplest configuration: one worker thread on NPU core 0. Good for development and profiling.