What equality saturation is
A conventional compiler applies rewrite rules in a fixed order: if rule A fires and prevents rule B from firing later, you get a locally optimal but globally suboptimal result. Equality saturation avoids this by maintaining an e-graph (equality graph) — a compact data structure that represents many equivalent program variants at once. Rules are applied until no new equivalences can be discovered (saturation), then the cheapest variant is extracted. The result is that the optimizer can simultaneously discoverMatMul+Add → FusedMatMulAdd and SwiGLU(MatMul, MatMul) → SwiGLUConcat(MatMul) without either rule blocking the other.
Why egglog
Meganeura uses egglog as its e-graph engine. Egglog combines the equality saturation algorithm with a Datalog-style rule language, making it concise to express algebraic rewrite rules directly over the graph IR. Each graph node is encoded as an egglog term. Rewrite rules express the algebraic equivalences the optimizer is allowed to exploit:Add are listed explicitly rather than using a general commutativity rule, because a fully commutative Add causes exponential e-class growth on large transformer graphs.
For graphs with more than 300 nodes, egglog saturation is skipped and Meganeura falls back to direct pattern matching. Shared-weight graphs such as transformer stacks create large e-classes that make saturation slow at that scale. The same fusion rules are applied via an iterative fixpoint loop instead.
Fusions applied
MatMul+Add → FusedMatMulAdd
MatMul+Add → FusedMatMulAdd
When a
MatMul, MatMulAT, or MatMulBT node is consumed only by an Add node, both are replaced by a single fused node (FusedMatMulAdd, FusedMatMulATAdd, or FusedMatMulBTAdd). This eliminates the intermediate buffer write and saves one round-trip to GPU memory.The fusion applies to both forward and backward passes. For example, the backward input gradient grad_out @ W^T + prev_grad becomes a single FusedMatMulBTAdd dispatch.SwiGLU(MatMul, MatMul) → SwiGLUConcat(MatMul)
SwiGLU(MatMul, MatMul) → SwiGLUConcat(MatMul)
When a This halves the number of matmul dispatches for the MLP gate+up projection. The optimizer creates a derived
SwiGLU node consumes two MatMul nodes that share the same input tensor (h), the optimizer fuses them into one wide matmul followed by SwiGLUConcat:concat parameter node and records the split point in DerivedParam so the runtime can fill it correctly when you call session.set_parameter().FusedRmsNormMatMul (disabled by default)
FusedRmsNormMatMul (disabled by default)
MatMul(RmsNorm(x, w_norm), w_proj) → FusedRmsNormMatMul(x, w_norm, w_proj) saves one intermediate write. This fusion is currently disabled for integrated GPUs because the per-element normalization overhead inside the tiled kernel outweighs the bandwidth savings. It can be enabled manually via apply_rms_norm_matmul_fusions.GroupNorm+Silu → GroupNormSilu (inference only)
GroupNorm+Silu → GroupNormSilu (inference only)
Applied by
apply_group_norm_silu_fusions during build_inference_session. Not applied to training graphs because the fused backward pass is not implemented.OptimizeReport
build_session_with_report() returns an OptimizeReport alongside the session. Use it to inspect what the optimizer did:
Getting the report
When the optimizer runs
The optimizer runs twice duringbuild_session():
- Before autodiff — on the forward graph only. This runs before
differentiate()so that the backward pass differentiates the fused ops (e.g.SwiGLUConcat) rather than the unfused originals. - After autodiff — on the full graph (forward + backward). This catches new fusion opportunities in the backward pass, such as fusing backward
MatMulBT+Addnodes.
build_session_with_report is:
Next steps
Automatic differentiation
How the backward pass is constructed from the optimized forward graph.
GPU execution
How the compiled plan is executed on Vulkan and Metal hardware.