Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/mwalmsley/zoobot/llms.txt

Use this file to discover all available pages before exploring further.

Zoobot’s ultimate purpose is to enable science. Beyond providing a finetunable model, the Zoobot project releases science-ready data products: compact galaxy representations suitable for unsupervised applications, and detailed volunteer-calibrated morphology catalogs covering millions of galaxies. These outputs let you do meaningful research without needing to run any deep learning yourself.

Precalculated Representations

New in Zoobot v2! Precalculated representations are a brand-new data product. We’re excited to see what you build with them — reach out if you need help getting started.
Zoobot v2 ships with precalculated representations for every galaxy in the Galaxy Zoo DESI data release. Rather than working with raw images, you get a compact 40-dimensional PCA-compressed vector per galaxy that summarises its visual morphology. Download: representations_pca_40_with_coords.parquet (2.5 GB)

Schema

ColumnDescription
id_strUnique galaxy identifier ({brickid}_{objid}) from DESI Legacy Surveys DR8
raRight ascension in degrees
decDeclination in degrees
feat_pca_0First PCA component of the Zoobot representation
feat_pca_1Second PCA component
...Components up to feat_pca_39 (40 total)
id_str is formed as {brickid}_{objid}, where brickid is the unique identifier for the sky brick in the Legacy Surveys and objid is the unique identifier for the source within that brick. Use id_str to cross-match with the GZ DESI morphology catalog (below) via the dr8_id key.

Use Cases

The precalculated representations are well-suited for tasks such as:
  • Similarity search — find galaxies that look like a query example
  • Anomaly detection — identify rare or unusual morphologies at scale
  • Multi-modal models — use the representation as the vision branch alongside spectroscopic or photometric data
  • Any application that needs a short vector summarising the morphology of a galaxy image
The PCA-compressed 40-dimensional representations offer a practical trade-off between information content and file size. The full encoder output is impractically large for most downstream applications. Starting with these 40 components is strongly recommended.

Galaxy Zoo Morphology Catalogs

GZ DESI — 8.7 Million Galaxies

Zoobot was used to produce a detailed morphology catalog for every extended galaxy brighter than r = 19 in the DESI Legacy Surveys — 8.7 million galaxies in total. The catalog and full schema are available from Zenodo: Download: https://zenodo.org/records/8360385
If you are new to the catalog, start with gz_desi_deep_learning_catalog_friendly.parquet. This file contains the most useful columns in a ready-to-use format, without requiring familiarity with the full schema.

GZ DECaLS DR5 (Superseded)

A previous Zoobot-powered morphology catalog was created for DECaLS DR5: Download: https://zenodo.org/records/4573248
GZ DECaLS DR5 has been superseded by the GZ DESI catalog above. GZ DESI covers all the same galaxies and many more. New projects should use GZ DESI.

Future Catalogs

The Zoobot team is actively working on expanding coverage to additional surveys. Planned releases, roughly in order of priority:
  1. DESI-LS DR10 — an updated morphology catalog using the full DR10 footprint (image redownload in progress)
  2. HSC — Hyper Suprime-Cam morphologies at greater depth
  3. JWST — high-redshift morphology measurements with JWST imaging
  4. Euclid — wide-field morphology from the Euclid satellite
Zoobot is already deployed in the Euclid processing pipeline to produce the OU-MER morphology catalog. The first public results from Euclid Q1 are documented in Euclid preparation: Measuring detailed galaxy morphologies for Euclid with Machine Learning (2024) and the Euclid Q1 first visual morphology catalogue (2025).

Build docs developers (and LLMs) love