AllDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/mwalmsley/zoobot/llms.txt
Use this file to discover all available pages before exploring further.
FinetuneableZoobot classes share a common set of parameters for controlling the finetuning process. These can have a big effect on performance.
Finetuning is fast and easy to experiment with, so we recommend trying different parameters to see what works best for your dataset. The full parameter list lives in FinetuneableZoobotAbstract.
The parameters below are listed in rough order of importance.
Parameters
learning_rate (default: 1e-4)
learning_rate (default: 1e-4)
Learning rate sets how fast the model parameters are updated during training.Zoobot uses the adaptive optimizer AdamW. Adaptive optimizers adjust the learning rate for each parameter based on the mean and variance of previous gradients, which means you don’t need to tune the learning rate as carefully as you would with a fixed-rate optimizer like SGD.
1e-4 is a good starting point for most tasks.- If the model is not learning, try increasing the learning rate.
- If the training loss varies wildly, or the train loss decreases much faster than the validation loss (a sign of overfitting), try decreasing it.
- Using
training_mode='full'often requires a lower learning rate thantraining_mode='head_only', because more parameters are being updated per batch.
training_mode ('full' vs 'head_only')
training_mode ('full' vs 'head_only')
Deep learning models are often divided into an encoder (which extracts features from images) and a head (which makes predictions from those features). In Zoobot, when you load
End-to-end finetuning (
FinetuneableZoobotClassifier(name='hf_hub:mwalmsley/zoobot-encoder-convnext_nano', ...), the encoder is the ConvNeXt model.training_mode controls which parts of the model are updated during training:| Mode | Description | Also known as |
|---|---|---|
'full' (default) | Trains both encoder and head end-to-end | End-to-end finetuning |
'head_only' | Freezes the encoder; trains only the new head | Transfer learning, linear probing |
'full') can give better results, but often requires more labelled data (or a smaller pretrained model) and more careful tuning of the learning rate and other hyperparameters.Linear probing ('head_only') is a useful starting point when you have very little data, or as a quick sanity check before committing to full finetuning.layer_decay (default: 0.75)
layer_decay (default: 0.75)
The common intuition in deep learning is that lower layers (closer to the input) learn simple, general features, while higher layers (closer to the output) learn more complex, task-specific features. It is often beneficial to use a lower learning rate for lower layers that have already learned to recognise basic galaxy features.Layer decay reduces the learning rate for each successive encoder block from the top down.For example, with
The head always uses the full learning rate, regardless of layer decay.In the extreme cases:
learning_rate=1e-4 and layer_decay=0.75 (the default):| Block | Learning Rate |
|---|---|
| Highest (nearest output) | 1e-4 × (0.75 ** 0) = 1e-4 |
| Second-highest | 1e-4 × (0.75 ** 1) = 7.5e-5 |
| Third-highest | 1e-4 × (0.75 ** 2) = 5.6e-5 |
| … and so on | … |
layer_decay=0— disables learning in all encoder blocks except the topmost (0 ** 0 = 1).layer_decay=1— gives every block the same learning rate (no decay).
This is slightly counterintuitive: a lower
layer_decay value means a faster learning rate reduction for lower blocks.weight_decay (default: 0.05)
weight_decay (default: 0.05)
Weight decay is a regularization term that penalizes large weight values. When using Zoobot’s default AdamW optimizer, it is closely related to L2 regularization (see Decoupled Weight Decay Regularization for the subtlety).
- Increasing weight decay strengthens the penalty on large weights, which can help prevent overfitting.
- Decreasing it can help if the model is underfitting or training too slowly.
0.05. The head does not use weight decay.head_dropout_prob (default: 0.5)
head_dropout_prob (default: 0.5)
Dropout is a regularization technique that randomly sets a fraction of activations to zero during training. This prevents the model from becoming overly dependent on any single feature, which helps guard against overfitting.Zoobot applies dropout before the final linear output layer in the head. The default probability is
0.5 (i.e. 50% of activations are zeroed per forward pass during training).- If the model overfits, try increasing
head_dropout_prob. - If the model underfits or the head is not learning, try decreasing it.
scheduler_kwargs (default: None)
scheduler_kwargs (default: None)
Gradually reducing the learning rate during training can slightly improve results by finding a better minimum near convergence. This is called learning rate scheduling.Zoobot supports the full suite of By default, no scheduler is used (
timm learning rate schedulers. Pass a dict of scheduler arguments to scheduler_kwargs:scheduler_kwargs=None). We recommend only adding a scheduler after you have already tuned the other parameters, as it adds another degree of freedom to your search.