Why Meganeura?
Portable. Meganeura uses blade-graphics to target GPUs across Linux, Windows, macOS, iOS, and Android through Vulkan and Metal. You write one model definition and it runs everywhere. Zero compile-time execution. The execution plan is compiled once when you callbuild_session. From that point on, every forward and backward pass runs as a fixed sequence of GPU dispatches — no recompilation, no tracing overhead.
E-graph optimized. During compilation, Meganeura explores the search space of equivalent kernel combinations using egglog equality saturation — the same technique used by Luminal. Operations like matmul+bias, SwiGLU, and RmsNorm are automatically fused without any manual annotation.
HuggingFace integration. Load safetensors weights directly from the HuggingFace Hub. The SafeTensorsModel API downloads and maps weights to your graph parameters with a single call.
Built-in transformer primitives. The nn module ships with layers for multi-head attention (including grouped query attention), RoPE positional encoding, RmsNorm, SwiGLU, and more — everything you need to run or fine-tune modern architectures.
Architecture overview
A model moves through four stages from definition to GPU execution:- Graph — You build a model as a
Graphof typed tensor operations:matmul,bias_add,relu,cross_entropy_loss, and so on. Parameters and inputs are named nodes. - Autodiff —
build_sessionautomatically extends the graph with backward-pass operations for every trainable parameter using reverse-mode automatic differentiation. - Egglog optimization — The extended graph is passed to an e-graph rewriting engine. It searches for equivalent programs with fewer or cheaper operations and fuses compatible kernels.
- Compile — The optimized graph is lowered to WGSL shaders and compiled to GPU-native code via blade-graphics. The result is a fixed list of buffer allocations and compute dispatches.
- Session — At runtime,
Sessionholds the GPU buffers and executes the dispatch plan. You set inputs and parameters, callstep(), and read outputs.
Key capabilities
- Training — Full gradient descent training with
TrainerandTrainConfig. Supports SGD and configurable learning rates out of the box. - Inference — Build read-only sessions with
build_inference_sessionfor faster, memory-efficient forward-only execution. - HuggingFace Hub — Download and load safetensors models directly with
SafeTensorsModel::download. - Built-in models — Pre-built graph definitions for SmolLM2, SmolVLM2, and Stable Diffusion UNet.
- Profiling — Emit binary Perfetto traces with
MEGANEURA_TRACE=<path>for detailed GPU timeline analysis.
Get started
Quickstart
Train a two-layer MLP on MNIST in under five minutes with the complete working example.
System requirements
Check supported GPUs, drivers, and platforms before you start.
Concepts
Understand how computation graphs, autodiff, and e-graph optimization work together.
API reference
Full reference for
Graph, Session, Trainer, and all neural network layers.