Multi-core NPU inference with data parallelism

Rockchip SoCs such as the RK3588 contain three independent NPU cores (cores 0, 1, and 2). EZ RKNN Async provides two complementary strategies for spreading work across them: data parallelism via schedule, where each inference task is sent to a different core in round-robin order, and tensor parallelism via tp_mode, where a single inference task is executed collaboratively by multiple cores. The two strategies are mutually exclusive — set one or the other, never both.

Data parallelism with `schedule`

When schedule is set, the session creates one worker thread group per listed core and dispatches tasks using a modulo assignment:

coreId = schedule[taskId % len(schedule)]

This means a schedule of [0, 1, 2] sends task 0 to core 0, task 1 to core 1, task 2 to core 2, task 3 to core 0 again, and so on. You can bias distribution by repeating a core: [0, 0, 1] sends two-thirds of tasks to core 0.

Install and import

from ztu_somemodelruntime_ez_rknn_async import InferenceSession, make_provider_options
import numpy as np

Create a session with schedule

Pass a list of core indices to schedule. All three cores on RK3588 in equal rotation:

opts = make_provider_options(
    schedule=[0, 1, 2],   # round-robin across all three cores
    threads_per_core=1,   # one worker thread per core
)

sess = InferenceSession("model.rknn", provider_options=opts)

Run inference as usual

Nothing changes in how you call run, run_async, or run_pipeline. The dispatch is transparent.

input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)
outputs = sess.run(None, {"input": input_data})

`threads_per_core`

Each core in the schedule gets threads_per_core worker threads (default 1). Increasing this allows a single core to process multiple tasks concurrently using separate RKNN contexts. For latency-sensitive workloads keep this at 1; increase it if you want to hide I/O stalls within a core.

opts = make_provider_options(
    schedule=[0, 1],
    threads_per_core=2,  # two concurrent inference threads per core
)

Tensor parallelism with `tp_mode`

In tensor-parallel mode, RKNN splits a single model’s computation across multiple cores simultaneously. The model must be compiled with tensor-parallel support. Set tp_mode to one of the accepted strings:

Value	Cores used
`"auto"`	RKNN chooses (default when neither option is set)
`"all"`	All available cores
`"0"`	Core 0 only
`"1"`	Core 1 only
`"2"`	Core 2 only
`"0,1"`	Cores 0 and 1
`"0,1,2"`	All three cores on RK3588

opts = make_provider_options(
    tp_mode="0,1,2",  # tensor-parallel across all three cores
)

sess = InferenceSession("model.rknn", provider_options=opts)

Tensor-parallel mode can lower latency for large models that are memory-bandwidth-bound, but it has no benefit for small models. Benchmark both modes for your specific model and SoC.

Choosing between schedule and tp_mode

Use schedule when
Use tp_mode when

You want higher throughput: multiple small tasks processed in parallel.
Your model runs fast enough on one core and latency is already acceptable.
You are processing a stream of independent frames.

opts = make_provider_options(schedule=[0, 1, 2])

You want lower latency on a single large inference task.
Your model was exported with multi-core tensor-parallel support.
You cannot tolerate the inter-task latency introduced by round-robin dispatch.

opts = make_provider_options(tp_mode="0,1,2")

Setting both schedule and tp_mode in the same call to make_provider_options raises a ValueError at session creation time. Pick one strategy per session.

Default behavior

If you set neither schedule nor tp_mode, the session defaults to tp_mode="auto", which lets the RKNN runtime decide the core assignment. This is equivalent to:

opts = make_provider_options()  # tp_mode defaults to "auto"

Get Started

Guides

Configuration

Multi-core NPU inference with data parallelism

Data parallelism with `schedule`

`threads_per_core`

Tensor parallelism with `tp_mode`

Choosing between schedule and tp_mode

Default behavior

Build docs developers (and LLMs) love

Get Started

Guides

Configuration

Documentation Index

​Data parallelism with schedule

​threads_per_core

​Tensor parallelism with tp_mode

​Choosing between schedule and tp_mode

​Default behavior

Build docs developers (and LLMs) love

Data parallelism with `schedule`

`threads_per_core`

Tensor parallelism with `tp_mode`

Choosing between schedule and tp_mode

Default behavior