Skip to main content
Meganeura includes a built-in profiler that records CPU span timings and GPU pass durations into a single Perfetto binary trace (.pftrace). You can open the resulting file in Perfetto UI to inspect exactly where time is spent during session build, training, and inference.
Profiling adds measurement overhead. CPU spans use mutex-protected event buffers on every span enter and exit. Do not use profiling runs to measure peak performance — use them for structural analysis only.

Enabling profiling with MEGANEURA_TRACE

The simplest way to enable profiling is to set the MEGANEURA_TRACE environment variable to the output file path before running your binary. All examples in the repository check this variable and initialize the profiler automatically.
MEGANEURA_TRACE=trace.pftrace cargo run --release --example mnist
The trace file is written when the process finishes (or when you call profiler::save explicitly).

The profiler API

You can also enable profiling programmatically:
// Install the global tracing subscriber — safe to call multiple times.
// Must be called before any spans you want captured.
meganeura::profiler::init();

// ... build session, train, run inference ...

// Write all collected events to a .pftrace file.
meganeura::profiler::save("trace.pftrace").unwrap();
profiler::init() installs a global tracing subscriber that records span enter/exit events on the CPU track. Subsequent calls are no-ops, so it is safe to call from multiple locations. profiler::save(path) serializes all collected events — both CPU spans and GPU pass timings — into a Perfetto binary protobuf file at the given path.

Example: trace path logic from examples/mnist.rs

The MNIST example shows the idiomatic pattern for optional profiling:
// Set up Perfetto profiling: MEGANEURA_TRACE=path.pftrace
let trace_path = std::env::var("MEGANEURA_TRACE").ok();
if trace_path.is_some() {
    meganeura::profiler::init();
}

// ... build session, train ...

// Save Perfetto trace when profiling.
if let Some(ref trace_file) = trace_path {
    let path = Path::new(trace_file);
    meganeura::profiler::save(path).expect("failed to save profile");
    println!("profile saved to {}", path.display());
}

Perfetto trace format

Meganeura writes traces in the Perfetto binary protobuf format (.pftrace). Perfetto is an open-source system tracing platform used by Android and Chrome.
1

Run your binary with tracing enabled

MEGANEURA_TRACE=my_trace.pftrace cargo run --release --example mnist
2

Open Perfetto UI

Navigate to https://ui.perfetto.dev/ in your browser.
3

Load the trace file

Click Open trace file and select the .pftrace file produced in the previous step.
4

Inspect the timeline

The trace contains two tracks under the meganeura process: CPU and GPU. Use the WASD keys or scroll wheel to navigate the timeline.
Perfetto UI runs entirely in the browser — your trace data is never uploaded to a server.

What appears in the trace

The trace is organized into two tracks:
TrackContents
CPUAll tracing spans from session build, training epochs, and input staging
GPUPer-pass hardware timestamp durations from blade-graphics

CPU spans

The following named spans appear on the CPU track:
SpanWhen it appears
build_sessionWraps the entire session build pipeline
optimize_forwardegglog optimization of the forward graph
autodiffAutomatic differentiation pass
optimize_fullegglog optimization of the combined forward + backward graph
compileCompilation of the optimized graph to an ExecutionPlan
gpu_initGPU resource allocation (Session::new)
train_epochOne full pass over the dataset
set_inputStaging input data into GPU buffers before each step

GPU passes

GPU pass durations are recorded from blade-graphics hardware timestamp queries and placed on the GPU track. Individual kernel names (e.g. matmul, relu) appear as GPU slices, laid out sequentially from the submission timestamp of each command buffer.

Profiling benchmark runs

You can attach MEGANEURA_TRACE to any benchmark binary in the same way:
MEGANEURA_TRACE=bench_trace.pftrace \
  cargo run --release --example bench_smolvla_meganeura -- --runs 1 --warmup 0
This lets you inspect the dispatch sequence for a single SmolVLA denoising step alongside GPU kernel durations on the same timeline.

Build docs developers (and LLMs) love