The gradient descent module implements the weight and bias update step of the training process. It applies the gradient descent optimization algorithm to adjust network parameters based on computed gradients, enabling the network to learn from training data.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/tiny-tpu-v2/tiny-tpu/llms.txt
Use this file to discover all available pages before exploring further.
Module ports
System clock signal
Active-high reset signal
Learning rate (η) as 16-bit fixed-point value
Current parameter value (weight or bias) before update
Computed gradient for this parameter
Start signal indicating valid gradient and parameter inputs
Parameter type selector: 0 = weight, 1 = bias
Updated parameter value after gradient descent step
Completion signal indicating update is finished
Gradient descent algorithm
The module implements the standard gradient descent update rule:- θ represents a parameter (weight or bias)
- η is the learning rate
- ∇L is the gradient of the loss with respect to the parameter
- The negative sign indicates moving opposite to the gradient (downhill)
Operation
Update process
-
Gradient scaling: Multiply gradient by learning rate using
fxp_mul- Computes:
scaled_gradient = grad × lr - Uses fixed-point multiplication to preserve precision
- Computes:
-
Parameter update: Subtract scaled gradient from old value using
fxp_addsub- Computes:
value_new = value_old - scaled_gradient - Subtraction mode (
sub=1) implements the negative gradient direction
- Computes:
- Output registration: On clock edge, register updated value and assert done signal
Pipeline stages
-
Combinational computation:
- Multiply:
mul_out = grad × lr - Subtract:
sub_value_out = sub_in_a - mul_out
- Multiply:
-
Registered output:
- When
grad_descent_valid_inis high, latchvalue_updated_out = sub_value_out - Assert
grad_descent_done_outto signal completion
- When
Weight vs. bias handling
The module uses different update strategies based ongrad_bias_or_weight:
Weight updates (grad_bias_or_weight = 0)
Weights may require accumulated updates:- If
grad_descent_done_outis already asserted, usevalue_updated_outas the base - Otherwise, use
value_old_inas the base - This allows multiple gradient contributions to be accumulated
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:50-55 for the implementation.
Bias updates (grad_bias_or_weight = 1)
Biases use simple updates:- Always use
value_old_inas the base - Each update is independent
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:58-60.
Fixed-point arithmetic
The module uses 16-bit fixed-point representation (Q8.8 format):-
Multiplication (
fxp_mul):- Computes
grad × lrwith proper binary point handling - See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:33-38 - Implementation at
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278
- Computes
-
Subtraction (
fxp_addsub):- Computes
value - (grad × lr)withsub=1mode - See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:40-46 - Implementation at
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:186
- Computes
Control flow
The module uses combinational and sequential logic:Combinational logic
- Multiplexer for selecting subtraction input based on parameter type
- Fixed-point arithmetic units (multiply and subtract)
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:48-62.
Sequential logic
- Done signal registered from valid input signal
- Updated value registered when valid input is high
- Reset behavior clears all outputs to zero
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv:64-77.
Integration with system
The gradient descent module operates outside the main VPU pipeline:Typical usage flow
- Gradient computation: VPU computes gradients for all parameters
- Gradient accumulation: Gradients may be summed across batch (external to this module)
- Parameter update: For each parameter:
- Load old value from unified buffer
- Load corresponding gradient
- Assert
grad_descent_valid_in - Wait for
grad_descent_done_out - Write updated value back to unified buffer
- Iteration: Repeat for all weights and biases in the network
Learning rate
The learning rate is:- Set by the host system based on training hyperparameters
- Represented as fixed-point (e.g., η=0.01 = 0x0028 in Q8.8)
- Typically remains constant during training (though can be adjusted for learning rate schedules)
- Shared across all parameter updates in an epoch
Implementation details
- Latency: 1 clock cycle (combinational arithmetic + registered output)
- Throughput: 1 parameter update per cycle
- Parallelism: Module can be instantiated multiple times for concurrent updates
- Reset behavior: All outputs cleared to zero
- Done signal timing: Asserted one cycle after valid input
Example update
Consider updating a weight:- Old weight:
w_old = 0.5(0x0080 in Q8.8) - Gradient:
∂L/∂w = 0.2(0x0033 in Q8.8) - Learning rate:
η = 0.1(0x0019 in Q8.8)
- Scale gradient:
0.2 × 0.1 = 0.02(0x0005 in Q8.8) - Update weight:
0.5 - 0.02 = 0.48(0x007A in Q8.8) - Output:
w_new = 0.48
Training loop integration
The gradient descent module is used during the parameter update phase of each training iteration:- Forward pass: Compute predictions (VPU forward pathway)
- Loss computation: Compare predictions to targets (VPU transition pathway)
- Backward pass: Compute gradients (VPU backward pathway + systolic array)
- Parameter update: Apply gradient descent (this module)
- Repeat: Next training iteration with updated parameters
Optimization considerations
Current implementation
- Basic gradient descent (no momentum, no adaptive learning rates)
- Single parameter updated per cycle
- Simple accumulation logic for weight updates
Potential enhancements
- Momentum: Add velocity term to smooth updates
- Adaptive learning rates: Per-parameter learning rate adjustment (Adam, RMSprop)
- Parallelization: Multiple gradient descent modules for faster updates
- Learning rate decay: Automatic reduction over time
Source files
- Module implementation:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/gradient_descent.sv