The loss module computes the gradient of the mean squared error (MSE) loss function with respect to the network’s output. This gradient serves as the starting point for backpropagation through the neural network layers.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/tiny-tpu-v2/tiny-tpu/llms.txt
Use this file to discover all available pages before exploring further.
Architecture
The loss module follows a parent-child structure:- loss_parent: Top-level module instantiating two loss_child modules
- loss_child: Individual processing unit computing gradient for one output column
Module ports
loss_parent
System clock signal
Active-high reset signal
Network output (prediction) for column 1
Network output (prediction) for column 2
Target value (ground truth) for column 1 from unified buffer
Target value (ground truth) for column 2 from unified buffer
Valid signal for column 1 inputs
Valid signal for column 2 inputs
Precomputed scaling factor (2/N) as fixed-point value from unified buffer
Computed gradient for column 1
Computed gradient for column 2
Valid signal for column 1 output
Valid signal for column 2 output
loss_child
System clock signal
Active-high reset signal
Network output (prediction)
Target value (ground truth)
Input valid signal
Scaling factor (2/N) as fixed-point value
Computed gradient
Output valid signal
Loss function
The mean squared error loss is defined as:- N is the batch size
- H is the network output (prediction)
- Y is the target value
Operation
The loss_child module implements a two-stage pipeline for MSE gradient computation:Pipeline stages
-
Stage 1 - Difference computation:
- Compute
diff = H - Yusingfxp_addsubwith subtraction mode - This represents the prediction error
- Compute
-
Stage 2 - Scaling:
- Multiply difference by
2/Nusingfxp_mul - Results in final gradient:
gradient = (2/N) × (H - Y)
- Multiply difference by
-
Registered output:
- On clock edge, register the gradient and propagate valid signal
Fixed-point arithmetic
The module uses 16-bit signed fixed-point (Q8.8 format) for all operations:-
Subtraction:
fxp_addsubwithsub=1computesH - Y- See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:186for implementation - Handles sign extension and overflow detection
- See
-
Multiplication:
fxp_mulscales the difference by2/N- See
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/fixedpoint.sv:278for implementation - Properly positions binary point after multiplication
- See
Precomputed scaling factor
The factor2/N is:
- Precomputed by the host and stored in unified buffer
- Represented as fixed-point (e.g., for N=4, 2/N = 0.5 = 0x0080 in Q8.8)
- Shared across all gradient computations in the batch
- Avoids expensive division operations in hardware
Integration with VPU
The loss module is active only during the transition pathway:- Pathway 1111 (transition):
systolic → bias → leaky_relu → loss → leaky_relu_derivative → output
vpu_data_pathway[1] is set to 1:
- Leaky ReLU outputs (H matrix) route to loss module H inputs
- Target values (Y) provided from unified buffer
- Scaling factor (2/N) provided from unified buffer
- Loss gradients route to leaky ReLU derivative module
- H values are simultaneously cached for use in backward pass
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/vpu.sv:268-302 for the loss routing logic.
Data flow
Implementation details
- Latency: 1 clock cycle (pipelined combinational logic with registered output)
- Throughput: 2 gradients per cycle
- Overflow handling: Both subtraction and multiplication detect overflow
- Reset behavior: Gradient output and valid signals cleared to zero
- Valid signal propagation: Input valid signal is registered and becomes output valid signal
Transition phase operation
The loss module is critical during the transition between forward and backward passes:- Final forward layer: Produces output H (predictions)
- Loss computation: Computes gradients comparing H to targets Y
- Backward pass start: Gradients seed the backpropagation process
- H caching: Output layer activations are cached for computing activation derivatives
Source files
- Parent module:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/loss_parent.sv - Child module:
https://github.com/tiny-tpu-v2/tiny-tpu/blob/main/src/loss_child.sv