The HTML and Website Discovery Set Protocol v1.2 is a working contract for building the machine-readable discovery layer on top of any static HTML page or website. It provides infrastructure that helps crawlers, bots, LLM readers, and external systems understand a published artifact — without competing with the human-facing page. This protocol applies to static HTML publications in general. If the deployment target is specifically GitHub Pages, see the GitHub Pages Discovery Set Protocol, which handles GitHub Pages path rules, deployment types, and project-site risks as a dedicated scope.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/XxYouDeaDPunKxX/ai-protocol-kit/llms.txt
Use this file to discover all available pages before exploring further.
How This Protocol Differs from the GitHub Pages Discovery Set Protocol
This protocol applies to any static HTML publication regardless of hosting provider. The GitHub Pages Discovery Set Protocol applies only to artifacts deployed through GitHub Pages. Use this protocol as the general case; use the GitHub Pages one only when the deployment is confirmed as GitHub Pages.
| Dimension | This protocol (v1.2) | GitHub Pages Discovery Set (v1) |
|---|---|---|
| Scope | Any static HTML page or website | GitHub Pages only |
| Deployment model | Generic static hosting, CDN, subpath, custom domain | GitHub Pages user/org site, project site, custom domain via GitHub Pages |
| Path risk | Subpath deployments, DOMAIN_ROOT vs SITE_ROOT | PROJECT_SITE root-relative path confusion |
| Required gate variables | SITE_ROOT, DOMAIN_ROOT, PROJECT_ROOT, CURRENT_PAGE_URL, DISCOVERY_ROOT | OWNER, REPO, PAGES_TYPE, CUSTOM_DOMAIN, SITE_ROOT, DOMAIN_ROOT, DISCOVERY_ROOT |
Core Principle
Discovery Set is infrastructure. It helps crawlers, bots, LLM readers, and external systems understand the published artifact. It must not compete with the human page.
Publication Model
This protocol’s default model is:- A static HTML artifact
- One absolute publication root
- One machine-readable discovery root
- One or more canonical HTML URLs
- Discovery files that help machines understand the artifact without competing with the human page
Publication Variables
Before generating any discovery files, close all required variables. If any value is missing, do not guess paths — ask for the missing deployment root or produce placeholders only.| Variable | Description | Default |
|---|---|---|
SITE_ROOT | Absolute URL where the published artifact begins | Must be provided |
DOMAIN_ROOT | Absolute URL of the bare host or domain | Must be provided |
PROJECT_ROOT | Optional subpath root below DOMAIN_ROOT | When present, SITE_ROOT usually equals PROJECT_ROOT |
CURRENT_PAGE_URL | Absolute canonical URL of the current HTML page | Must be provided per page |
DISCOVERY_ROOT | Root where llms.txt, raw-manifest.json, sitemap.xml, and companion files live | Equals SITE_ROOT |
LLMS_URL | Resolved URL to llms.txt | URL_JOIN(DISCOVERY_ROOT, "llms.txt") |
RAW_MANIFEST_URL | Resolved URL to raw-manifest.json | URL_JOIN(DISCOVERY_ROOT, "raw-manifest.json") |
SITEMAP_URL | Resolved URL to sitemap.xml | URL_JOIN(DISCOVERY_ROOT, "sitemap.xml") |
ROBOTS_URL (authoritative) | Resolved URL to authoritative robots.txt | URL_JOIN(DOMAIN_ROOT, "robots.txt") |
ROBOTS_URL (companion) | Resolved URL to project-level companion robots.txt | URL_JOIN(DISCOVERY_ROOT, "robots.txt") |
| Scenario | Variables |
|---|---|
| Domain-root site | SITE_ROOT = https://example.com/ · DISCOVERY_ROOT = https://example.com/ |
| Subpath site | SITE_ROOT = https://example.com/project/ · DISCOVERY_ROOT = https://example.com/project/ |
| GitHub Pages project site | SITE_ROOT = https://USER.github.io/REPO/ · DISCOVERY_ROOT = https://USER.github.io/REPO/ |
GitHub Pages Special Case
If the deployment target is GitHub Pages, also determine before generating any URL: USER or organisation name, repository name, whether the site is a user/org site or project site, whether a custom domain is active, exactSITE_ROOT, exact DOMAIN_ROOT, and exact DISCOVERY_ROOT.
Site vs. Page Classification
Does this published artifact expose more than one canonical HTML URL under the sameSITE_ROOT?
| Answer | Type |
|---|---|
| Yes | SITE — multiple canonical HTML pages |
| No | PAGE — single canonical HTML URL |
Anchor links, repository links outside
SITE_ROOT, and links to llms.txt, raw-manifest.json, sitemap.xml, assets, or external pages do not make a SITE. Only multiple canonical HTML pages under the same SITE_ROOT make a SITE.What the Discovery Set Includes
- TYPE: PAGE
- TYPE: SITE
Definition: Single canonical HTML URL, usually one FooterLow-noise machine links for a page at If
index.html.HEAD — required elements<title><meta name="description"><meta name="robots" content="index, follow"><link rel="canonical" href="[CURRENT_PAGE_URL]"><link rel="alternate" type="text/plain" href="[LLMS_URL]" title="LLM-readable index"><link rel="alternate" type="application/json" href="[RAW_MANIFEST_URL]" title="Machine-readable manifest">- Open Graph:
og:title,og:description,og:type="website",og:url(absolute),og:image(absolute) - JSON-LD:
SoftwareSourceCodefor repos/protocols/tools/code artefacts;WebPagefor editorial/static pages - Favicon
llms.txtraw-manifest.jsonsitemap.xml— single<url>entry forCURRENT_PAGE_URLrobots.txt— required only whenDOMAIN_ROOTis controlled; optional companion otherwise
DISCOVERY_ROOT:CURRENT_PAGE_URL is not at DISCOVERY_ROOT, use correct relative paths or absolute URLs.Discovery File Location Rules
| File | Location | Notes |
|---|---|---|
llms.txt | DISCOVERY_ROOT only | Readable project or site index for LLM/crawler use. Not stored in assets/. |
raw-manifest.json | DISCOVERY_ROOT only | Structured machine-readable project or site manifest. Not stored in assets/. |
sitemap.xml | DISCOVERY_ROOT only | Lists canonical HTML URLs only. |
robots.txt (authoritative) | DOMAIN_ROOT only | Only when authoritative crawler control is required and domain-root control exists. |
robots.txt (companion) | DISCOVERY_ROOT only | Only as a project-level discovery companion when domain-root control does not exist. |
Robots Authority Rule
robots.txt has a crawler-standard location at DOMAIN_ROOT/robots.txt.
- If the deployment controls
DOMAIN_ROOT, place authoritativerobots.txtthere. - If the artifact is published under a subpath and does not control
DOMAIN_ROOT, a project-levelrobots.txtatDISCOVERY_ROOTmay be included as a discovery companion — but it must not be treated as authoritative crawler control.
Path Rules Summary
| Element | Rule |
|---|---|
canonical | Always absolute URL; equals CURRENT_PAGE_URL |
og:url | Always absolute URL; equals CURRENT_PAGE_URL |
og:image | Always absolute URL |
robots sitemap entry | Must be absolute URL |
llms.txt / raw-manifest.json in HEAD | Relative allowed at DISCOVERY_ROOT; absolute preferred for subpages and subpath deployments |
Root-relative paths (/llms.txt) | Only allowed when DISCOVERY_ROOT equals DOMAIN_ROOT |
JSON-LD Rules
| Schema type | Use when |
|---|---|
SoftwareSourceCode | Page represents a repo, protocol, framework, kernel, compiler, developer tool, operational system, or technical artefact |
WebPage | Page is an article, documentation page, static editorial page, or human-facing explainer without repo/tool identity |
WebSite | Multiple canonical pages exist under the same SITE_ROOT |
Footer Rules
Footer links must be visible, low-noise, small, non-dominant, machine-oriented, and not part of the main editorial hierarchy. A small machine-readable row in the footer is correct. A large visible manifest block inside the main page is wrong.
When to Use This Protocol vs. the GitHub Pages Discovery Set Protocol
Use this protocol (v1.2)
- Any static site hosted on a CDN, custom server, or shared hosting
- Subpath deployments under a domain you control
- GitHub Pages user/org sites or project sites when you want the general rules
- Any deployment where GitHub-specific path constraints are not the primary concern
Use GitHub Pages Discovery Set (v1)
- When the deployment is specifically GitHub Pages
- When you need explicit handling of USER_OR_ORG_SITE vs PROJECT_SITE vs CUSTOM_DOMAIN_SITE classification
- When GitHub Pages path rules and project-site root-relative path confusion are the primary risk