Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/happyme531/ztu_somemodelruntime_ez_rknn_async/llms.txt

Use this file to discover all available pages before exploring further.

EZ RKNN Async supports two approaches to NPU core selection: data-parallel scheduling via the schedule option, which distributes independent tasks across cores in round-robin order, and tensor-parallel mode via tp_mode, which assigns an rknn_core_mask so the RKNN driver handles cross-core coordination internally. The two approaches are mutually exclusive. This page explains how each works, how threads are created, and covers the supplementary options that control pacing and context duplication.

Data-parallel scheduling (schedule)

When you set schedule, the runtime distributes tasks across the specified cores by computing core_id = schedule[task_id % len(schedule)]. Task IDs increment monotonically from zero, so a schedule of [0, 1, 2] sends task 0 to core 0, task 1 to core 1, task 2 to core 2, task 3 back to core 0, and so on.
task 0 → core 0
task 1 → core 1
task 2 → core 2
task 3 → core 0
task 4 → core 1
...
This lets you run three independent model instances in parallel and is the recommended approach for throughput-oriented workloads on RK3588, which has three NPU cores.

schedule accepts three forms

# Single int — pin all tasks to core 0
make_provider_options(schedule=0)

# Comma-separated string — distribute across cores 0, 1, and 2
make_provider_options(schedule="0,1,2")

# List of ints — fine-grained control; repeat a core to weight it more heavily
make_provider_options(schedule=[0, 0, 1])  # 2/3 of tasks go to core 0
Schedule values must be non-negative integers. An empty schedule raises RuntimeError. Duplicate core IDs are valid and increase the fraction of tasks sent to that core.

Thread count with schedule

One set of worker threads is created per unique core in the schedule. For example, schedule=[0, 1, 2] has three unique cores, so if threads_per_core=2 the session creates six worker threads total (two threads per core, each thread owning its own RKNN context).
total_threads = threads_per_core × len(set(schedule))

Tensor-parallel mode (tp_mode)

When you set tp_mode, all worker contexts share the same rknn_core_mask and the RKNN driver decides how to split computation across cores. This is useful when a single model is large enough that running it across multiple cores with driver-level coordination yields lower per-request latency.
tp_mode valuerknn_core_mask constant
"auto"RKNN_NPU_CORE_AUTO (default when neither option is set)
"all"RKNN_NPU_CORE_ALL
"0"RKNN_NPU_CORE_0
"1"RKNN_NPU_CORE_1
"2"RKNN_NPU_CORE_2
"0,1"RKNN_NPU_CORE_0_1
"0,1,2"RKNN_NPU_CORE_0_1_2
# Use the driver's automatic core selection (this is the default)
make_provider_options(tp_mode="auto")

# Spread one model across all three RK3588 cores (tensor-parallel)
make_provider_options(tp_mode="0,1,2")

# Pin to core 1 only
make_provider_options(tp_mode="1")
When neither schedule nor tp_mode is set, the session defaults to tp_mode="auto" (RKNN_NPU_CORE_AUTO), which lets the driver pick the least-loaded core.
schedule and tp_mode are mutually exclusive. Setting both raises ValueError from make_provider_options() and RuntimeError from the session constructor. Set only one.

threads_per_core

threads_per_core
int
default:"1"
Number of worker threads to create for each unique NPU core. With tp_mode, only one unique “core slot” exists (thread 0 gets the initial context), so the total thread count equals threads_per_core. With schedule, total threads = threads_per_core × len(unique cores).Increasing this value allows more tasks to be processed simultaneously on a single core. On RK3588, values above 2–3 rarely improve throughput because the NPU itself is the bottleneck.Must be greater than 0.

Throughput pacing (enable_pacing)

enable_pacing
bool
default:"False"
When True, the session tracks the exponential moving average of per-task inference time and silently drops submissions that arrive faster than the NPU can sustain. The EMA uses α = 0.95 for the previous average and α = 0.05 for the new sample:
avg = 0.95 × prev_avg + 0.05 × latest_inference_us
The inter-submission interval is computed as avg / num_cores, where num_cores is the count of unique cores in the schedule. Submissions that arrive before the interval has elapsed return nullopt internally; the Python-level call retries until the queue accepts the task.Pacing prevents queue saturation under burst load and produces smoother end-to-end throughput at the cost of occasionally delaying a submission for a few milliseconds.

Independent context initialisation (disable_dup_context)

disable_dup_context
bool
default:"False"
When False (default), the session calls rknn_dup_context to clone the initial RKNN context for each additional worker thread. This is fast but can cause instability when custom ops are involved. When True, each worker thread calls rknn_init independently with the same model data, which is slower to start but more stable. Loading custom ops forces this flag to True regardless of its value.

Performance logging with ZTU_EZRKNN_ASYNC_PRINT_PERF

Setting the environment variable ZTU_EZRKNN_ASYNC_PRINT_PERF=1 (or true) before starting the process causes the runtime to print per-task timing statistics to stderr. Each line contains:
[ztu_ez_rknn_async_perf] task_id=<n> thread_id=<t> core_id=<c>
    set_input_us=<µs> infer_us=<µs> copy_output_us=<µs> total_us=<µs>
This is read once at session construction time via std::getenv("ZTU_EZRKNN_ASYNC_PRINT_PERF") and accepts 1, true, on, or yes (case-insensitive).
ZTU_EZRKNN_ASYNC_PRINT_PERF=1 python my_script.py

Scheduling examples for RK3588

The simplest configuration: one worker thread on NPU core 0. Good for development and profiling.
opts = make_provider_options()
# Equivalent to tp_mode="auto" with one worker thread

Build docs developers (and LLMs) love