.pftrace). You can open the resulting file in Perfetto UI to inspect exactly where time is spent during session build, training, and inference.
Enabling profiling with MEGANEURA_TRACE
The simplest way to enable profiling is to set the MEGANEURA_TRACE environment variable to the output file path before running your binary. All examples in the repository check this variable and initialize the profiler automatically.
profiler::save explicitly).
The profiler API
You can also enable profiling programmatically:profiler::init() installs a global tracing subscriber that records span enter/exit events on the CPU track. Subsequent calls are no-ops, so it is safe to call from multiple locations.
profiler::save(path) serializes all collected events — both CPU spans and GPU pass timings — into a Perfetto binary protobuf file at the given path.
Example: trace path logic from examples/mnist.rs
The MNIST example shows the idiomatic pattern for optional profiling:
Perfetto trace format
Meganeura writes traces in the Perfetto binary protobuf format (.pftrace). Perfetto is an open-source system tracing platform used by Android and Chrome.
Open Perfetto UI
Navigate to https://ui.perfetto.dev/ in your browser.
Load the trace file
Click Open trace file and select the
.pftrace file produced in the previous step.Perfetto UI runs entirely in the browser — your trace data is never uploaded to a server.
What appears in the trace
The trace is organized into two tracks:| Track | Contents |
|---|---|
| CPU | All tracing spans from session build, training epochs, and input staging |
| GPU | Per-pass hardware timestamp durations from blade-graphics |
CPU spans
The following named spans appear on the CPU track:| Span | When it appears |
|---|---|
build_session | Wraps the entire session build pipeline |
optimize_forward | egglog optimization of the forward graph |
autodiff | Automatic differentiation pass |
optimize_full | egglog optimization of the combined forward + backward graph |
compile | Compilation of the optimized graph to an ExecutionPlan |
gpu_init | GPU resource allocation (Session::new) |
train_epoch | One full pass over the dataset |
set_input | Staging input data into GPU buffers before each step |
GPU passes
GPU pass durations are recorded from blade-graphics hardware timestamp queries and placed on the GPU track. Individual kernel names (e.g.matmul, relu) appear as GPU slices, laid out sequentially from the submission timestamp of each command buffer.
Profiling benchmark runs
You can attachMEGANEURA_TRACE to any benchmark binary in the same way: