Skip to main content
Meganeura uses reverse-mode automatic differentiation (backpropagation) to compute gradients. Rather than executing gradient computations at runtime, differentiate() statically appends backward nodes to the graph at build time. The resulting graph contains both the forward and backward passes as a single static compute sequence.

The differentiate() function

pub fn differentiate(forward: &Graph) -> Graph
differentiate() takes a forward graph that ends in a scalar loss and returns a new graph containing:
  1. All original forward nodes (copied verbatim).
  2. New gradient nodes appended after the forward nodes.
  3. An updated output list: [loss, grad_param_0, grad_param_1, ...] — one gradient output per Parameter node, in the order they appear in the forward graph.
The function iterates over forward nodes in reverse topological order, propagating gradients using the chain rule. When multiple forward paths contribute gradients to the same node (e.g. a shared weight used in two matmuls), the gradients are accumulated with Add:
fn accumulate_grad(
    graph: &mut Graph,
    grads: &mut HashMap<NodeId, NodeId>,
    node: NodeId,
    grad: NodeId,
) {
    match grads.get(&node) {
        Some(&existing) => {
            let sum = graph.add(existing, grad);
            grads.insert(node, sum);
        }
        None => {
            grads.insert(node, grad);
        }
    }
}

Gradient rules

Each Op has a corresponding gradient rule. Here are the most important ones:
For C = A @ B:
  • dA = dC @ B^TMatMulBT(dC, B)
  • dB = A^T @ dCMatMulAT(A, dC)
For C = A @ B + D:
  • Same dA and dB as MatMul, plus dD = dC (passthrough). The gradient of the addend passes straight through.
For out = input + bias (broadcast):
  • d_input = d_out
  • d_bias = SumRows(d_out) — column sums of the upstream gradient.
For y = RmsNorm(x, w, eps):
  • d_wRmsNormGradW(dy, x, w) — exact formula: sum_i(dy[i,j] * x[i,j] * rsqrt_i)
  • d_xRmsNormGradX(dy, x, w) — exact formula: rsqrt_i * (dy[i,j]*w[j] - x[i,j]*s_i)
For out = silu(gate) * up:
  • d_gateSwiGLUGradGate(d_out, gate, up)
  • d_upSwiGLUGradUp(d_out, gate)
For out[M,N] = silu(input[:,:N]) * input[:,N:]:
  • d_inputSwiGLUConcatGrad(d_out, input) — produces [M, 2*N].
d_xSiluGrad(d_out, x) — computes d/dx silu(x) = sigmoid(x) * (1 + x * (1 - sigmoid(x))).
For O = MultiHeadAttn(Q, K, V), three separate gradient nodes are emitted:
  • d_QMultiHeadAttnGradQ(dO, Q, K, V) (references the forward node id for the LSE buffer)
  • d_KMultiHeadAttnGradK(dO, Q, K, V)
  • d_VMultiHeadAttnGradV(dO, Q, K, V)
d_logits = softmax(logits) - labels
d_x = d_out * (x > 0) — uses a Greater node to create the mask.
d_x = (1/N) broadcast to the shape of x — emitted as a constant tensor.
d_tableScatterAdd(indices, d_out, vocab_size) — accumulates gradient rows by index.
  • d_inputConv2dGradInput(d_out, kernel)
  • d_kernelConv2dGradWeight(d_out, input)
Input nodes, Constant nodes, and inference-only ops (RoPE, CausalAttention, LayerNorm, CacheWrite, etc.) do not accumulate gradients. Calling differentiate() on a graph containing inference-only ops emits a log warning and leaves those ops without gradients.

Autodiff runs on the optimized graph

A critical detail: differentiate() is called on the optimized forward graph, not the original one. This means the backward pass differentiates through fused ops like SwiGLUConcat and FusedMatMulAdd directly, rather than re-deriving them from their unfused components. If you call differentiate() on an unoptimized graph, the optimizer still runs on the full graph afterward and will fuse the backward matmuls, but you will not benefit from SwiGLUConcat on the forward pass.

The full pipeline

1

Build the forward graph

Construct your model ending in a scalar loss and call set_outputs().
2

Optimize the forward graph

let optimized_forward = meganeura::optimize::optimize(&forward_graph);
This fuses SwiGLU, MatMul+Add, and other patterns before differentiation.
3

Toposort

let sorted = optimized_forward.toposort();
The optimizer may have appended nodes out of order; toposort restores dependency order.
4

Differentiate

let full_graph = meganeura::autodiff::differentiate(&sorted);
The returned graph has outputs [loss, grad_w1, grad_b1, grad_w2, ...].
5

Optimize the full graph

let (optimized, report) = meganeura::optimize::optimize_with_report(&full_graph);
This second pass fuses backward MatMulBT+Add patterns and similar opportunities.
6

Compile

let plan = meganeura::compile::compile(&optimized);
All six steps are wrapped by build_session_with_report(). Call compile_training_graph() to run the pipeline up to step 6 (exclusive of GPU session creation) in environments without a GPU:
use meganeura::compile_training_graph;

let mut g = Graph::new();
let x  = g.input("x",      &[4, 3]);
let w  = g.parameter("w",  &[3, 2]);
let y  = g.matmul(x, w);
let loss = g.mean_all(y);
g.set_outputs(vec![loss]);

let (plan, report) = compile_training_graph(&g);
println!("dispatches: {}", plan.dispatches.len());
println!("param/grad pairs: {}", plan.param_grad_pairs.len());
println!("{}", report);
compile_training_graph skips the forward-graph optimization step (it calls differentiate on the raw graph). For production use, prefer build_session_with_report which runs the full two-pass pipeline.

Gradient outputs and parameter alignment

The output list produced by differentiate() has a fixed structure:
outputs = [loss_node, grad_param_0, grad_param_1, ..., grad_param_N]
Every Parameter node in the forward graph gets exactly one gradient output, in the same positional order. If the optimizer fused a parameter away (marking it Nop), a scalar zero placeholder is emitted so that the positional mapping stays intact for compile.rs to build param_grad_pairs.

Next steps

GPU execution

How the compiled plan is executed on GPU hardware.

Trainers and optimizers

How to use SGD and Adam with a compiled training session.

Build docs developers (and LLMs) love