EZ RKNN Async exposes three distinct ways to run inference throughDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/happyme531/ztu_somemodelruntime_ez_rknn_async/llms.txt
Use this file to discover all available pages before exploring further.
InferenceSession. All three share the same session setup — you choose the calling convention that fits your workload. run blocks until results are ready, run_async fires a callback when inference completes without blocking the caller, and run_pipeline keeps a sliding window of in-flight frames so the NPU is never idle between calls.
The three modes at a glance
| Mode | Blocks caller? | Returns results | Best for |
|---|---|---|---|
run | Yes | Immediately as List[np.ndarray] | Scripts, one-shot inference, ORT drop-in |
run_async | No | Via callback when ready | Real-time systems, event loops |
run_pipeline | Yes (on oldest result) | After pipeline fills | Video/streaming, maximum throughput |
Synchronous inference with run
run submits a task internally and waits for the result before returning. The API is identical to onnxruntime.InferenceSession.run, making it a direct drop-in replacement for existing code.
run when:
- You are porting code from
onnxruntimeand want minimal changes. - You are writing a script or a test that runs inference once or in a simple loop.
- Latency per call matters more than overall throughput.
Async inference with run_async
run_async submits a task to the NPU worker queue and returns None immediately. Your callback is invoked on a dedicated callback thread when the result is ready.
run_async when:
- Your main thread drives an event loop or UI that must not stall.
- You want to pipeline NPU work with other CPU processing done in the callback.
- You need fine-grained control over when results are consumed.
run_async does not support ztu_modelrt_dispatch_batch=True in run_options. Pass dispatch-batch workloads through run instead.Pipeline inference with run_pipeline
run_pipeline is a pseudo-synchronous interface designed for continuous frame streams. Each call submits the current frame and returns the result of a frame submitted depth calls earlier. The NPU stays busy during the caller’s inter-frame processing because future frames are already queued.
run_pipeline when:
- You process a continuous stream of frames and want maximum NPU utilization.
- You can tolerate
depthframes of added latency in exchange for higher throughput. - Your workload is single-threaded (one producer, one consumer).
Detailed guides
Async inference
Callbacks, queue limits, and ordering guarantees for run_async.
Pipeline inference
Depth tuning, reset behavior, and frame-loop patterns for run_pipeline.
Multi-core NPU
Distribute tasks across NPU cores with schedule or tp_mode.
Migrate from onnxruntime
Side-by-side comparison of the ORT and EZ RKNN Async APIs.