Training Zoobot on Galaxy Zoo Vote Count Decision Trees

Citizen science projects like Galaxy Zoo record the total votes each answer received for each morphology question. These questions are arranged as a decision tree: the question asked to a volunteer depends on how previous questions were answered. This structure means that some questions receive many votes while others — asked only after specific prior answers — receive very few. A standard cross-entropy loss is not well-suited to this variable-votes setting.

This guide is only relevant if you are training on raw volunteer vote counts from a Galaxy Zoo-style survey. For classification (e.g. smooth vs. featured), use FinetuneableZoobotClassifier instead — see the Finetuning guide.

The Dirichlet-Multinomial Loss

Zoobot includes a custom Dirichlet-Multinomial loss designed specifically for Galaxy Zoo decision trees. It works by:

Predicting a Dirichlet distribution over the probability of a typical volunteer giving each answer.
Comparing the predicted distribution (given k volunteers were asked) to the true vote counts using the Dirichlet-Multinomial likelihood.

Both FinetuneableZoobotTree and ZoobotTree use this loss. To use it, you need to provide:

The vote counts for each image, as columns in a catalog.
A Schema object describing which answers belong to which questions and how the decision tree is structured.

Step 1: Create the Vote Count Catalog

Create a catalog (pandas DataFrame) where each row represents a unique galaxy. The required columns are:

Column	Description
`id_str`	Unique string identifier (e.g. `J012345` or `1856_67919`)
`file_loc`	Absolute path to the galaxy image (.jpg or .png)
(vote count columns)	One column per answer, e.g. `smooth-or-featured_smooth`

For example, a GZ2-style catalog might look like:

id_str	file_loc	smooth-or-featured_smooth	smooth-or-featured_featured-or-disk
J101419	/path/to/J101419.jpg	12	28
J101420	/path/to/J101420.jpg	17	23

Answers with zero votes must be listed as 0, not left blank or set to NaN. Zoobot sums vote columns to get the total votes per question — any NaN values will cause incorrect totals and corrupt the loss calculation.

Step 2: Specify the Decision Tree with a Schema

A Schema object tells Zoobot which answers belong to which questions and which questions are dependent on prior answers. The galaxy-datasets library provides pre-written pairs and dependencies for all major Galaxy Zoo surveys in label_metadata.py. For example:

# Inside galaxy-datasets/shared/label_metadata.py
gz2_pairs = {
    'smooth-or-featured': ['_smooth', '_featured-or-disk'],
    'disk-edge-on': ['_yes', '_no'],
    'has-spiral-arms': ['_yes', '_no']
    # etc.
}

gz2_dependencies = {
    'smooth-or-featured': None,  # always asked first
    'disk-edge-on': 'smooth-or-featured_featured-or-disk',
    'has-spiral-arms': 'smooth-or-featured_featured-or-disk'
    # etc.
}

Pass these to Schema to create the structured decision tree object:

from zoobot.shared.schemas import Schema

schema = Schema(gz2_pairs, gz2_dependencies)

Decision trees for all major GZ projects are already specified in label_metadata.py. For a custom survey, define your own pairs and dependencies dicts following the same pattern.

Step 3: Finetune with FinetuneableZoobotTree

Pass the schema to FinetuneableZoobotTree. All standard finetuning parameters (learning rate, layer decay, etc.) are inherited from FinetuneableZoobotAbstract and work exactly as described in Choosing Parameters.

from zoobot.pytorch.training.finetune import FinetuneableZoobotTree
from zoobot.shared.schemas import Schema

model = FinetuneableZoobotTree(
    name='hf_hub:mwalmsley/zoobot-encoder-convnext_nano',
    schema=schema
)

Then set up a CatalogDataModule with all the vote count columns as label_cols:

from galaxy_datasets.pytorch.galaxy_datamodule import CatalogDataModule

datamodule = CatalogDataModule(
    train_catalog=train_catalog,
    val_catalog=val_catalog,
    label_cols=schema.label_cols,  # all vote count answer columns
    batch_size=32
)

Finally, train with get_trainer just as you would for any other finetuning task:

from zoobot.pytorch.training import finetune

trainer = finetune.get_trainer(save_dir, accelerator='gpu', max_epochs=100)
trainer.fit(model, datamodule)

FinetuneableZoobotTree logs training and validation loss only — it does not report accuracy or RMSE metrics, as these are not meaningful for the Dirichlet-Multinomial setting.

Full Working Example

See finetune_counts_full_tree.py for a complete script that finetunes Zoobot on a GZ-style decision tree from start to finish.

Get Started

Finetuning Guide

Pretrained Models

Training from Scratch

Training Zoobot on Galaxy Zoo Vote Count Decision Trees

The Dirichlet-Multinomial Loss

Step 1: Create the Vote Count Catalog

Step 2: Specify the Decision Tree with a Schema

Step 3: Finetune with FinetuneableZoobotTree

Full Working Example

Build docs developers (and LLMs) love

Get Started

Finetuning Guide

Pretrained Models

Training from Scratch

Documentation Index

​The Dirichlet-Multinomial Loss

​Step 1: Create the Vote Count Catalog

​Step 2: Specify the Decision Tree with a Schema

​Step 3: Finetune with FinetuneableZoobotTree

​Full Working Example

Build docs developers (and LLMs) love

The Dirichlet-Multinomial Loss

Step 1: Create the Vote Count Catalog

Step 2: Specify the Decision Tree with a Schema

Step 3: Finetune with FinetuneableZoobotTree

Full Working Example