The GitHub Pages Discovery Set Protocol v1 is a working contract for building the machine-readable discovery layer on top of a GitHub Pages artifact. It provides infrastructure that helps crawlers, bots, LLM readers, and external systems understand what has been published — without competing with the human-facing page. Use it after a GitHub Pages deployment has been committed to, once the owner, repository, deployment type, and canonical URLs are known.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/XxYouDeaDPunKxX/ai-protocol-kit/llms.txt
Use this file to discover all available pages before exploring further.
Core Principle
Discovery Set is infrastructure. It helps crawlers, bots, LLM readers, and external systems understand the published GitHub Pages artifact. It must not compete with the human page. GitHub Pages path rules are part of the artifact. Never generate discovery URLs before the GitHub Pages publication root is closed.GitHub Pages Gate
Before generating any discovery files, all required variables must be closed. If any required value is missing, do not guess paths — ask for the missing deployment root or produce placeholders only.| Variable | Description |
|---|---|
OWNER | GitHub user or organisation name |
REPO | Repository name |
PAGES_TYPE | USER_OR_ORG_SITE, PROJECT_SITE, or CUSTOM_DOMAIN_SITE |
SOURCE_BRANCH | Branch used by GitHub Pages, when known |
SOURCE_PATH | root or /docs, when known |
CUSTOM_DOMAIN | Active custom domain, or NONE |
DOMAIN_ROOT | Absolute URL of the bare Pages host or custom domain |
SITE_ROOT | Absolute URL where the GitHub Pages artifact begins |
DISCOVERY_ROOT | Root where llms.txt, raw-manifest.json, sitemap.xml, and companion files live |
CURRENT_PAGE_URL | Absolute canonical URL of each canonical HTML page |
GitHub Pages Deployment Types
- USER_OR_ORG_SITE
- PROJECT_SITE
- CUSTOM_DOMAIN_SITE
Repository pattern:
OWNER.github.ioDefault SITE_ROOT: https://OWNER.github.io/DOMAIN_ROOT usually equals SITE_ROOT. Root-relative discovery paths may be valid only when DOMAIN_ROOT and SITE_ROOT are the same.Example variables:Publication Variables and Default Resolution
| Variable | Default |
|---|---|
DISCOVERY_ROOT | equals SITE_ROOT |
LLMS_URL | URL_JOIN(DISCOVERY_ROOT, "llms.txt") |
RAW_MANIFEST_URL | URL_JOIN(DISCOVERY_ROOT, "raw-manifest.json") |
SITEMAP_URL | URL_JOIN(DISCOVERY_ROOT, "sitemap.xml") |
ROBOTS_URL (authoritative) | URL_JOIN(DOMAIN_ROOT, "robots.txt") |
ROBOTS_URL (companion) | URL_JOIN(DISCOVERY_ROOT, "robots.txt") |
When constructing URLs, join path segments with exactly one slash. Do not construct discovery URLs through raw string concatenation. Prefer absolute URLs when in doubt.
Site vs. Page Classification
Before generating discovery files, classify the artifact using the binary classifier: Does this GitHub Pages artifact expose more than one canonical HTML URL under the sameSITE_ROOT?
| Answer | Type |
|---|---|
| Yes | SITE — multiple canonical HTML pages |
| No | PAGE — single canonical HTML URL |
Anchor links, GitHub repository links, links to
llms.txt, raw-manifest.json, sitemap.xml, assets, or external pages do not make a SITE. Only multiple canonical HTML pages under the same SITE_ROOT make a SITE.What the Discovery Set Includes
- TYPE: PAGE
- TYPE: SITE
Definition: Single canonical HTML URL, usually one
index.html at SITE_ROOT.HEAD — required elements<title><meta name="description"><meta name="robots" content="index, follow"><link rel="canonical" href="[CURRENT_PAGE_URL]"><link rel="alternate" type="text/plain" href="[LLMS_URL]" title="LLM-readable index"><link rel="alternate" type="application/json" href="[RAW_MANIFEST_URL]" title="Machine-readable manifest">- Open Graph:
og:title,og:description,og:type="website",og:url,og:image(absolute) - JSON-LD:
SoftwareSourceCodefor repos/protocols/tools;WebPagefor editorial;ProfilePagefor profile surfaces - Favicon
index.htmlllms.txtraw-manifest.jsonsitemap.xml— single<url>entry forCURRENT_PAGE_URLrobots.txt— authoritative only whenDOMAIN_ROOTis controlled; otherwise optional companion
Robots Authority Rule
robots.txt has a crawler-standard location at DOMAIN_ROOT/robots.txt.
- If the GitHub Pages deployment controls
DOMAIN_ROOT, place authoritativerobots.txtthere. - If the artifact is a
PROJECT_SITEunderOWNER.github.io/REPO/and does not controlDOMAIN_ROOT, arobots.txtatDISCOVERY_ROOTmay be included as a project-level discovery companion — but it must not be treated as authoritative crawler control. sitemap.xmlmay be cited from an authoritative domain-rootrobots.txtwhen domain-root control exists.
Path Rules Summary
| Element | Rule |
|---|---|
canonical | Always absolute URL; equals CURRENT_PAGE_URL |
og:url | Always absolute URL; equals CURRENT_PAGE_URL |
og:image | Always absolute URL |
robots sitemap entry | Must be absolute URL |
llms.txt / raw-manifest.json in HEAD | Relative allowed at DISCOVERY_ROOT; absolute preferred for subpages and PROJECT_SITE |
Root-relative paths (/llms.txt) | Only allowed when DISCOVERY_ROOT equals DOMAIN_ROOT |