bench/ that measures latency for training and inference workloads and compares results against PyTorch. Results are also published on Infermark, a comprehensive comparison between frameworks.
Running the benchmarks
Set up the Python virtual environment (one-time)
The benchmark suite runs both the Meganeura Rust binary and the equivalent PyTorch Python script, then prints a side-by-side comparison table.
Run the default benchmark (SmolVLA training)
bench_smolvla_train against the PyTorch equivalent and prints a comparison table.Environment overrides
You can tune the benchmark parameters with environment variables:| Variable | Default | Description |
|---|---|---|
RUNS | 5 | Number of timed runs |
WARMUP | 3 | Warm-up runs before timing |
PYTORCH_DTYPE | float32 | PyTorch dtype for comparison |
STEPS | 10 | Denoising steps (SmolVLA only) |
CHUNK_SIZE | 50 | Action chunk size (SmolVLA only) |
VLM_SEQ_LEN | 16 | VLM context length (SmolVLA only) |
MAX_TOKENS | 32 | Tokens to generate (SmolLM2 only) |
--no-venv to use the system Python instead of the managed venv:
SmolVLA training results
The table below shows SmolVLA action expert training results (chunk_size=50, vlm_seq_len=16, float32, random weights) with full GQA (15/5 heads, head_dim=64) and exact backward through all ops including fused MHA and RmsNorm:| GPU | Framework | Compile | Forward | Backward |
|---|---|---|---|---|
| Radeon 890M (RADV) | Meganeura | 0 s | 14.2 ms | 36.4 ms |
| Radeon 890M (RADV) | PyTorch 2.10.0 ROCm | 7.30 s | 20.9 ms | 48.0 ms |
| GeForce RTX 5080 (590/Linux) | Meganeura | 0 s | 6.1 ms | 35.1 ms |
| GeForce RTX 5080 (590/Linux) | PyTorch 2.11.0+cu128 | 3.41 s | 1.57 ms | 4.68 ms |
| GeForce RTX 3050 (566.36/Windows) | Meganeura | 0 s | 11.2 ms | 53.3 ms |
| GeForce RTX 3050 (566.36/Windows) | PyTorch 2.11.0+cu128 | 0 s (unsupported) | 12.3 ms | 33.8 ms |
| Apple M3 | Meganeura | 0 s | 45.5 ms | 87.0 ms |
| Apple M3 | PyTorch 2.11.0 | 5.92 s | 17.7 ms | 78.1 ms |
Meganeura reports 0 s compile time because the execution plan is cached after the first build — subsequent runs skip autodiff, egglog, and compilation entirely. See Execution plan caching for details.
Gradient verification
Gradients are verified against PyTorch running on CPU. 88 out of 136 parameters pass the strict threshold (cosine similarity > 0.99, norm error < 5%). Failures are in attention and layernorm weights of deeper layers, where f32 precision differences compound across the network. Gradient magnitudes (norm error) remain below 2% for all parameters. You can reproduce gradient verification with:Benchmark examples
bench_meganeura — SmolLM2-135M inference
bench/bench_meganeura.rs measures per-step latency and tokens per second for SmolLM2-135M text generation. It outputs a JSON file with timing statistics for comparison:
bench_smolvla_meganeura — SmolVLA action expert
bench/bench_smolvla_meganeura.rs measures the SmolVLA denoising loop latency using synthetic inputs. On Linux it checks preconditions (AC power, GPU load, clock speed) before running to improve result reliability:
--force to skip the precondition check (useful in CI or when GPU is partially loaded).
Profiling benchmark runs
AttachMEGANEURA_TRACE to any benchmark run to capture a Perfetto trace alongside timing numbers:
bench.pftrace at https://ui.perfetto.dev/ to see CPU spans and GPU kernel durations on the same timeline. See Profiling for a full walkthrough.
Infermark
Aggregated benchmark results across frameworks and hardware are published on Infermark. Infermark provides a reproducible comparison methodology — results are gathered from the samebench/compare.sh script using standardized hardware configurations.