nn:: structs for common normalization patterns and low-level Graph ops for direct use in custom architectures.
nn::RmsNorm
RMS normalization: scales the input by the inverse RMS, then multiplies element-wise by a learned weight.
Formula: y = x / sqrt(mean(x²) + eps) * weight
Fields
Scale parameter of shape
[dim].Small constant added to the denominator for numerical stability.
RmsNorm::new
The computation graph to register the parameter into.
Name for the weight parameter.
Feature dimension (last dimension of the input).
Stability epsilon (e.g.
1e-5 or 1e-6).forward
The computation graph to append ops to.
2D input tensor of shape
[seq, dim].Normalized output tensor of shape
[seq, dim].nn::LayerNorm
Layer normalization with both a learned scale (weight) and a learned shift (bias).
Formula: y = (x - mean(x)) / sqrt(var(x) + eps) * weight + bias
Fields
Scale parameter of shape
[dim]. Registered as {name}.weight.Shift parameter of shape
[dim]. Registered as {name}.bias.Small constant added to the variance for numerical stability.
LayerNorm::new
The computation graph to register parameters into.
Name prefix. Registers
{name}.weight and {name}.bias.Feature dimension (last dimension of the input).
Stability epsilon (e.g.
1e-5).forward
The computation graph to append ops to.
2D input tensor of shape
[seq, dim].Normalized output tensor of shape
[seq, dim].Graph normalization ops
g.rms_norm
Applies RMS normalization directly.
2D input tensor of shape
[seq, dim].1D weight tensor of shape
[dim].Stability epsilon.
Normalized tensor of shape
[seq, dim].g.layer_norm
Applies standard layer normalization.
2D input tensor of shape
[seq, dim].1D scale tensor of shape
[dim].1D shift tensor of shape
[dim].Stability epsilon.
Normalized tensor of shape
[seq, dim].g.group_norm
Group normalization over a flat NCHW tensor. Divides the channels dimension into num_groups groups, each normalized independently.
Flat 1D input tensor representing
[N, C, H, W] in NCHW order (total size N*C*H*W).Scale parameter of shape
[C].Shift parameter of shape
[C].Batch size
N.Number of channels
C. Must be divisible by num_groups.Spatial size
H * W.Number of groups to divide channels into.
Stability epsilon.
Normalized flat tensor of the same shape as the input.
During inference optimization, the compiler may automatically fuse adjacent
GroupNorm + SiLU sequences into a single GroupNormSilu kernel. This fusion is applied transparently and does not require changes to your graph construction code.