Graph is the central data structure in Meganeura. Every call to a builder method appends a node and returns its NodeId. Nodes are identified by a u32 handle and stored in definition order. The graph can be topologically sorted and compiled to a GPU program.
Tensor creation
g.input
Declares an f32 runtime input (e.g. data or activations passed at inference time).
Unique name used to bind values at runtime.
Tensor shape.
Node of type
f32 with the given shape.g.input_u32
Declares a U32 runtime input. Required for token indices, position counters, and other integer data.
Unique name used to bind values at runtime.
Tensor shape.
Node of type
u32 with the given shape.g.parameter
Declares a learnable f32 parameter (weight or bias). Parameters are loaded from a checkpoint and updated by the optimizer.
Unique name used to look up the parameter in the weight file.
Parameter shape.
Node of type
f32 with the given shape.g.constant
Embeds a fixed f32 tensor whose values are known at graph construction time.
Flat data buffer. Length must equal the product of all
shape dimensions.Tensor shape.
Node holding the constant values.
g.scalar
Convenience wrapper that creates a constant with shape [1].
The scalar value.
Shape
[1] constant node.Matrix operations
All matrix ops require 2D tensors.g.matmul
Standard matrix multiply: C = A @ B.
Shape
[M, K].Shape
[K, N].Shape
[M, N].g.matmul_at
Transposed-A matrix multiply: C = A^T @ B.
A is stored as [K, M] (i.e. the transpose is implicit — no actual transpose is performed).
Shape
[K, M] (stored transposed).Shape
[K, N].Shape
[M, N].g.matmul_bt
Transposed-B matrix multiply: C = A @ B^T.
B is stored as [N, K] (transposed layout).
Shape
[M, K].Shape
[N, K] (stored transposed).Shape
[M, N].Elementwise ops
g.add
Element-wise addition. Both inputs must have the same shape.
Shape
[...].Same shape as
a.Same shape as inputs.
g.mul
Element-wise multiplication. Both inputs must have the same shape.
Shape
[...].Same shape as
a.Same shape as inputs.
g.bias_add
Adds a 1D bias to each row of a 2D tensor: out[i, j] = a[i, j] + bias[j].
2D tensor of shape
[M, N].1D bias of shape
[N].Shape
[M, N].g.broadcast_add
Adds a [1, N] tensor to every row of a [M, N] tensor. Uses the same BiasAdd shader as bias_add.
2D tensor of shape
[M, N].2D tensor of shape
[1, N].Shape
[M, N].g.greater
Element-wise greater-than comparison for use in autodiff (e.g. ReLU gradient). Both inputs must have the same shape.
Shape
[...].Same shape as
a.Same shape. Values are
1.0 where a > b, 0.0 otherwise.g.neg
Element-wise negation: out = -x.
Any shape.
Same shape.
g.abs
Element-wise absolute value: out = |x|.
Any shape.
Same shape.
g.log
Element-wise natural logarithm: out = ln(x).
Any shape.
Same shape.
g.recip
Element-wise reciprocal: out = 1 / x.
Any shape.
Same shape.
g.div
Element-wise division: out = a / b. Implemented as a * recip(b).
Any shape.
Same shape as
a.Same shape.
Activations
g.relu
Rectified linear unit: out = max(0, x).
Any shape.
Same shape.
g.sigmoid
Logistic sigmoid: out = 1 / (1 + exp(-x)).
Any shape.
Same shape.
g.silu
Sigmoid linear unit: out = x * sigmoid(x).
Any shape.
Same shape.
g.gelu
Gaussian error linear unit: out = x * 0.5 * (1 + erf(x / sqrt(2))).
Any shape.
Same shape.
g.swiglu
Fused SwiGLU: out = silu(gate) * up. Both inputs must have the same shape.
Gate tensor of shape
[M, N].Up tensor of shape
[M, N].Shape
[M, N].g.swiglu_concat
SwiGLU on a concatenated input of shape [M, 2*N]. Reads gate from the first half and up from the second half.
2D tensor of shape
[M, 2*N]. The last dimension must be even.Shape
[M, N].Reductions
g.sum_all
Sums all elements to a scalar.
Any shape.
Shape
[1].g.mean_all
Averages all elements to a scalar.
Any shape.
Shape
[1].g.softmax
Row-wise softmax for 2D inputs.
Any shape (row-wise for 2D).
Same shape.
g.log_softmax
Numerically stable log-softmax (row-wise for 2D inputs).
Any shape.
Same shape.
g.transpose
Swaps the two dimensions of a 2D tensor: [M, N] → [N, M].
2D tensor of shape
[M, N].Shape
[N, M].Embedding ops
g.embedding
Looks up rows from an embedding table.
1D U32 tensor of shape
[seq_len].2D f32 parameter of shape
[vocab_size, embed_dim].Shape
[seq_len, embed_dim].g.scatter_add
Accumulates source rows into an output tensor indexed by indices. The backward of embedding.
1D U32 tensor of shape
[seq_len].2D f32 tensor of shape
[seq_len, embed_dim].Number of rows in the output accumulator.
Shape
[vocab_size, embed_dim] where output[indices[i]] += src[i].Spatial ops
All spatial ops work on tensors stored as flat 1D arrays in NCHW order.g.conv2d
2D convolution: input[N, C_in, H, W] * kernel[C_out, C_in, kH, kW] → output[N, C_out, oH, oW].
Flat tensor of size
N * C_in * H * W.Flat kernel of size
C_out * C_in * kH * kW.Batch size
N.Input channels
C_in.Input height
H.Input width
W.Output channels
C_out.Kernel height
kH.Kernel width
kW.Convolution stride.
Zero-padding on each edge.
Flat tensor of size
N * C_out * oH * oW, where oH = (H + 2*padding - kH) / stride + 1.g.concat
Concatenates two flat NCHW tensors along the channel dimension: [N, Ca, H, W] ++ [N, Cb, H, W] → [N, Ca+Cb, H, W].
Flat tensor of size
N * Ca * H * W.Flat tensor of size
N * Cb * H * W.Batch size
N.Number of channels in
a.Number of channels in
b.Spatial size
H * W.Flat tensor of size
N * (Ca + Cb) * H * W.g.split_a
Extracts the first channels_a channels from a concatenated [N, Ca+Cb, H, W] tensor.
Flat tensor of size
N * (Ca + Cb) * H * W.Batch size
N.Channels to extract (first
Ca).Remaining channels
Cb.Spatial size
H * W.Flat tensor of size
N * Ca * H * W.g.split_b
Extracts the last channels_b channels from a concatenated [N, Ca+Cb, H, W] tensor.
Flat tensor of size
N * (Ca + Cb) * H * W.Batch size
N.Leading channels
Ca.Channels to extract (last
Cb).Spatial size
H * W.Flat tensor of size
N * Cb * H * W.g.upsample_2x
Nearest-neighbor 2× upsampling: [N, C, H, W] → [N, C, 2H, 2W].
Flat input tensor of size
N * C * H * W.Batch size
N.Number of channels
C.Input height
H.Input width
W.Flat tensor of size
N * C * (2H) * (2W).