Skip to main content
Meganeura ships a benchmark suite under bench/ that measures latency for training and inference workloads and compares results against PyTorch. Results are also published on Infermark, a comprehensive comparison between frameworks.

Running the benchmarks

1

Set up the Python virtual environment (one-time)

The benchmark suite runs both the Meganeura Rust binary and the equivalent PyTorch Python script, then prints a side-by-side comparison table.
bash bench/ensure_venv.sh
2

Run the default benchmark (SmolVLA training)

bash bench/compare.sh
This builds and runs bench_smolvla_train against the PyTorch equivalent and prints a comparison table.
3

Run a specific model benchmark

# SmolVLA action expert inference
bash bench/compare.sh --model smolvla

# SmolLM2-135M text generation
bash bench/compare.sh --model smollm2

# Stable Diffusion U-Net training
bash bench/compare.sh --model sd_unet_train

# All benchmarks
bash bench/compare.sh --model all

Environment overrides

You can tune the benchmark parameters with environment variables:
RUNS=10 WARMUP=5 bash bench/compare.sh --model smolvla
VariableDefaultDescription
RUNS5Number of timed runs
WARMUP3Warm-up runs before timing
PYTORCH_DTYPEfloat32PyTorch dtype for comparison
STEPS10Denoising steps (SmolVLA only)
CHUNK_SIZE50Action chunk size (SmolVLA only)
VLM_SEQ_LEN16VLM context length (SmolVLA only)
MAX_TOKENS32Tokens to generate (SmolLM2 only)
Pass --no-venv to use the system Python instead of the managed venv:
bash bench/compare.sh --no-venv

SmolVLA training results

The table below shows SmolVLA action expert training results (chunk_size=50, vlm_seq_len=16, float32, random weights) with full GQA (15/5 heads, head_dim=64) and exact backward through all ops including fused MHA and RmsNorm:
GPUFrameworkCompileForwardBackward
Radeon 890M (RADV)Meganeura0 s14.2 ms36.4 ms
Radeon 890M (RADV)PyTorch 2.10.0 ROCm7.30 s20.9 ms48.0 ms
GeForce RTX 5080 (590/Linux)Meganeura0 s6.1 ms35.1 ms
GeForce RTX 5080 (590/Linux)PyTorch 2.11.0+cu1283.41 s1.57 ms4.68 ms
GeForce RTX 3050 (566.36/Windows)Meganeura0 s11.2 ms53.3 ms
GeForce RTX 3050 (566.36/Windows)PyTorch 2.11.0+cu1280 s (unsupported)12.3 ms33.8 ms
Apple M3Meganeura0 s45.5 ms87.0 ms
Apple M3PyTorch 2.11.05.92 s17.7 ms78.1 ms
Meganeura reports 0 s compile time because the execution plan is cached after the first build — subsequent runs skip autodiff, egglog, and compilation entirely. See Execution plan caching for details.

Gradient verification

Gradients are verified against PyTorch running on CPU. 88 out of 136 parameters pass the strict threshold (cosine similarity > 0.99, norm error < 5%). Failures are in attention and layernorm weights of deeper layers, where f32 precision differences compound across the network. Gradient magnitudes (norm error) remain below 2% for all parameters. You can reproduce gradient verification with:
cargo run --release --example grad_check

Benchmark examples

bench_meganeura — SmolLM2-135M inference

bench/bench_meganeura.rs measures per-step latency and tokens per second for SmolLM2-135M text generation. It outputs a JSON file with timing statistics for comparison:
use meganeura::{
    Graph, build_inference_session,
    data::safetensors::SafeTensorsModel,
    models::smollm2::{self, SmolLM2Config},
};

let config = SmolLM2Config::smollm2_135m();
let mut g = Graph::new();
let logits = smollm2::build_graph(&mut g, &config, seq_len);
g.set_outputs(vec![logits]);

let mut session = build_inference_session(&g);
Run it directly with:
cargo run --release --example bench_meganeura -- \
    --prompt "The meaning of life is" \
    --max-tokens 32 \
    --warmup 3 \
    --runs 5

bench_smolvla_meganeura — SmolVLA action expert

bench/bench_smolvla_meganeura.rs measures the SmolVLA denoising loop latency using synthetic inputs. On Linux it checks preconditions (AC power, GPU load, clock speed) before running to improve result reliability:
use meganeura::{
    Graph, build_inference_session,
    data::safetensors::SafeTensorsModel,
    models::smolvla::{self, SmolVLAConfig},
};

let config = SmolVLAConfig::smolvla_base();
let mut g = Graph::new();
let action_out = smolvla::build_action_expert(&mut g, &config, action_seq_len, vlm_seq_len);
g.set_outputs(vec![action_out]);

let mut session = build_inference_session(&g);
Run it directly with:
cargo run --release --example bench_smolvla_meganeura -- \
    --steps 10 \
    --warmup 3 \
    --runs 5 \
    --force
Pass --force to skip the precondition check (useful in CI or when GPU is partially loaded).

Profiling benchmark runs

Attach MEGANEURA_TRACE to any benchmark run to capture a Perfetto trace alongside timing numbers:
MEGANEURA_TRACE=bench.pftrace \
  cargo run --release --example bench_smolvla_meganeura -- \
    --steps 1 --warmup 0 --runs 1 --force
Open bench.pftrace at https://ui.perfetto.dev/ to see CPU spans and GPU kernel durations on the same timeline. See Profiling for a full walkthrough.
Use a single warm-up run when profiling so the trace reflects steady-state GPU behaviour rather than cold-start compilation artefacts.

Infermark

Aggregated benchmark results across frameworks and hardware are published on Infermark. Infermark provides a reproducible comparison methodology — results are gathered from the same bench/compare.sh script using standardized hardware configurations.

Build docs developers (and LLMs) love