TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/itsubaki/autograd/llms.txt
Use this file to discover all available pages before exploring further.
optimizer package provides update rules that adjust model parameters using gradients computed by the autograd engine. All optimizers accept any value that satisfies the Model interface and support pre-update gradient hooks.
Model interface
Params() method that returns layer.Parameters can be passed to an optimizer. Both model.MLP and model.LSTM satisfy this interface.
Hook type
Hook is a function that receives the list of parameters with non-nil gradients before the parameter update step. Use hooks to apply regularization or gradient clipping globally without changing the optimizer implementation.
The hook package provides two ready-made hooks: WeightDecay and ClipGrad.
Params helper
m that have a non-nil gradient, applies each hook in order, then returns the filtered parameter slice. All optimizers call this internally — you rarely need to call it directly.
The model to collect parameters from.
Hook functions to run on the collected parameters before they are returned.
SGD
Stochastic gradient descent. Updates each parameter by subtracting the gradient scaled by the learning rate:Step size applied to each gradient update.
Gradient hooks run before each update step.
Update
model that have a gradient.
Momentum
SGD with momentum. Accumulates a velocity vector and updates parameters using:Step size applied to each gradient.
Fraction of the previous velocity retained at each step. Typical value:
0.9.Gradient hooks run before each update step.
Update
Adam
Adaptive moment estimation. Maintains per-parameter first and second moment estimates and applies bias correction:Base learning rate. Typical value:
0.001.Exponential decay rate for the first moment estimate. Typical value:
0.9.Exponential decay rate for the second moment estimate. Typical value:
0.999.Gradient hooks run before each update step.
Update
The
Adam struct maintains internal state (ms, vs maps and an iteration counter). Reuse the same Adam instance across training steps — creating a new one each step discards the accumulated moments.AdamW
AdamW extends Adam with decoupled weight decay applied directly to the parameters rather than through the gradient. This avoids the interaction between adaptive learning rates and L2 regularization.Embedded Adam optimizer. Set
Alpha, Beta1, Beta2, and Hook here.Weight decay coefficient λ. Typical value:
0.01.Update
Examples
SGD
Momentum
Adam
AdamW
Attaching hooks
WeightDecay adds L2 regularization to the gradients; ClipGrad rescales gradients whose global norm exceeds the threshold.
See also
- model package —
MLPandLSTMmodel types - hook package —
WeightDecayandClipGradhooks - Guides: gradient descent — walkthrough of the training loop