Citizen science projects like Galaxy Zoo record the total votes each answer received for each morphology question. These questions are arranged as a decision tree: the question asked to a volunteer depends on how previous questions were answered. This structure means that some questions receive many votes while others — asked only after specific prior answers — receive very few. A standard cross-entropy loss is not well-suited to this variable-votes setting.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/mwalmsley/zoobot/llms.txt
Use this file to discover all available pages before exploring further.
This guide is only relevant if you are training on raw volunteer vote counts from a Galaxy Zoo-style survey. For classification (e.g. smooth vs. featured), use
FinetuneableZoobotClassifier instead — see the Finetuning guide.The Dirichlet-Multinomial Loss
Zoobot includes a custom Dirichlet-Multinomial loss designed specifically for Galaxy Zoo decision trees. It works by:- Predicting a Dirichlet distribution over the probability of a typical volunteer giving each answer.
- Comparing the predicted distribution (given k volunteers were asked) to the true vote counts using the Dirichlet-Multinomial likelihood.
FinetuneableZoobotTree and ZoobotTree use this loss. To use it, you need to provide:
- The vote counts for each image, as columns in a catalog.
- A
Schemaobject describing which answers belong to which questions and how the decision tree is structured.
Step 1: Create the Vote Count Catalog
Create a catalog (pandas DataFrame) where each row represents a unique galaxy. The required columns are:| Column | Description |
|---|---|
id_str | Unique string identifier (e.g. J012345 or 1856_67919) |
file_loc | Absolute path to the galaxy image (.jpg or .png) |
| (vote count columns) | One column per answer, e.g. smooth-or-featured_smooth |
| id_str | file_loc | smooth-or-featured_smooth | smooth-or-featured_featured-or-disk |
|---|---|---|---|
| J101419 | /path/to/J101419.jpg | 12 | 28 |
| J101420 | /path/to/J101420.jpg | 17 | 23 |
Step 2: Specify the Decision Tree with a Schema
ASchema object tells Zoobot which answers belong to which questions and which questions are dependent on prior answers.
The galaxy-datasets library provides pre-written pairs and dependencies for all major Galaxy Zoo surveys in label_metadata.py. For example:
Schema to create the structured decision tree object:
label_metadata.py. For a custom survey, define your own pairs and dependencies dicts following the same pattern.
Step 3: Finetune with FinetuneableZoobotTree
Pass theschema to FinetuneableZoobotTree. All standard finetuning parameters (learning rate, layer decay, etc.) are inherited from FinetuneableZoobotAbstract and work exactly as described in Choosing Parameters.
CatalogDataModule with all the vote count columns as label_cols:
get_trainer just as you would for any other finetuning task:
FinetuneableZoobotTree logs training and validation loss only — it does not report accuracy or RMSE metrics, as these are not meaningful for the Dirichlet-Multinomial setting.Full Working Example
Seefinetune_counts_full_tree.py for a complete script that finetunes Zoobot on a GZ-style decision tree from start to finish.