Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/mwalmsley/zoobot/llms.txt

Use this file to discover all available pages before exploring further.

Citizen science projects like Galaxy Zoo record the total votes each answer received for each morphology question. These questions are arranged as a decision tree: the question asked to a volunteer depends on how previous questions were answered. This structure means that some questions receive many votes while others — asked only after specific prior answers — receive very few. A standard cross-entropy loss is not well-suited to this variable-votes setting.
This guide is only relevant if you are training on raw volunteer vote counts from a Galaxy Zoo-style survey. For classification (e.g. smooth vs. featured), use FinetuneableZoobotClassifier instead — see the Finetuning guide.

The Dirichlet-Multinomial Loss

Zoobot includes a custom Dirichlet-Multinomial loss designed specifically for Galaxy Zoo decision trees. It works by:
  1. Predicting a Dirichlet distribution over the probability of a typical volunteer giving each answer.
  2. Comparing the predicted distribution (given k volunteers were asked) to the true vote counts using the Dirichlet-Multinomial likelihood.
Both FinetuneableZoobotTree and ZoobotTree use this loss. To use it, you need to provide:
  • The vote counts for each image, as columns in a catalog.
  • A Schema object describing which answers belong to which questions and how the decision tree is structured.

Step 1: Create the Vote Count Catalog

Create a catalog (pandas DataFrame) where each row represents a unique galaxy. The required columns are:
ColumnDescription
id_strUnique string identifier (e.g. J012345 or 1856_67919)
file_locAbsolute path to the galaxy image (.jpg or .png)
(vote count columns)One column per answer, e.g. smooth-or-featured_smooth
For example, a GZ2-style catalog might look like:
id_strfile_locsmooth-or-featured_smoothsmooth-or-featured_featured-or-disk
J101419/path/to/J101419.jpg1228
J101420/path/to/J101420.jpg1723
Answers with zero votes must be listed as 0, not left blank or set to NaN. Zoobot sums vote columns to get the total votes per question — any NaN values will cause incorrect totals and corrupt the loss calculation.

Step 2: Specify the Decision Tree with a Schema

A Schema object tells Zoobot which answers belong to which questions and which questions are dependent on prior answers. The galaxy-datasets library provides pre-written pairs and dependencies for all major Galaxy Zoo surveys in label_metadata.py. For example:
# Inside galaxy-datasets/shared/label_metadata.py
gz2_pairs = {
    'smooth-or-featured': ['_smooth', '_featured-or-disk'],
    'disk-edge-on': ['_yes', '_no'],
    'has-spiral-arms': ['_yes', '_no']
    # etc.
}

gz2_dependencies = {
    'smooth-or-featured': None,  # always asked first
    'disk-edge-on': 'smooth-or-featured_featured-or-disk',
    'has-spiral-arms': 'smooth-or-featured_featured-or-disk'
    # etc.
}
Pass these to Schema to create the structured decision tree object:
from zoobot.shared.schemas import Schema

schema = Schema(gz2_pairs, gz2_dependencies)
Decision trees for all major GZ projects are already specified in label_metadata.py. For a custom survey, define your own pairs and dependencies dicts following the same pattern.

Step 3: Finetune with FinetuneableZoobotTree

Pass the schema to FinetuneableZoobotTree. All standard finetuning parameters (learning rate, layer decay, etc.) are inherited from FinetuneableZoobotAbstract and work exactly as described in Choosing Parameters.
from zoobot.pytorch.training.finetune import FinetuneableZoobotTree
from zoobot.shared.schemas import Schema

model = FinetuneableZoobotTree(
    name='hf_hub:mwalmsley/zoobot-encoder-convnext_nano',
    schema=schema
)
Then set up a CatalogDataModule with all the vote count columns as label_cols:
from galaxy_datasets.pytorch.galaxy_datamodule import CatalogDataModule

datamodule = CatalogDataModule(
    train_catalog=train_catalog,
    val_catalog=val_catalog,
    label_cols=schema.label_cols,  # all vote count answer columns
    batch_size=32
)
Finally, train with get_trainer just as you would for any other finetuning task:
from zoobot.pytorch.training import finetune

trainer = finetune.get_trainer(save_dir, accelerator='gpu', max_epochs=100)
trainer.fit(model, datamodule)
FinetuneableZoobotTree logs training and validation loss only — it does not report accuracy or RMSE metrics, as these are not meaningful for the Dirichlet-Multinomial setting.

Full Working Example

See finetune_counts_full_tree.py for a complete script that finetunes Zoobot on a GZ-style decision tree from start to finish.

Build docs developers (and LLMs) love