Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/XxYouDeaDPunKxX/ai-protocol-kit/llms.txt

Use this file to discover all available pages before exploring further.

The HTML and Website Discovery Set Protocol v1.2 is a working contract for building the machine-readable discovery layer on top of any static HTML page or website. It provides infrastructure that helps crawlers, bots, LLM readers, and external systems understand a published artifact — without competing with the human-facing page. This protocol applies to static HTML publications in general. If the deployment target is specifically GitHub Pages, see the GitHub Pages Discovery Set Protocol, which handles GitHub Pages path rules, deployment types, and project-site risks as a dedicated scope.

How This Protocol Differs from the GitHub Pages Discovery Set Protocol

This protocol applies to any static HTML publication regardless of hosting provider. The GitHub Pages Discovery Set Protocol applies only to artifacts deployed through GitHub Pages. Use this protocol as the general case; use the GitHub Pages one only when the deployment is confirmed as GitHub Pages.
DimensionThis protocol (v1.2)GitHub Pages Discovery Set (v1)
ScopeAny static HTML page or websiteGitHub Pages only
Deployment modelGeneric static hosting, CDN, subpath, custom domainGitHub Pages user/org site, project site, custom domain via GitHub Pages
Path riskSubpath deployments, DOMAIN_ROOT vs SITE_ROOTPROJECT_SITE root-relative path confusion
Required gate variablesSITE_ROOT, DOMAIN_ROOT, PROJECT_ROOT, CURRENT_PAGE_URL, DISCOVERY_ROOTOWNER, REPO, PAGES_TYPE, CUSTOM_DOMAIN, SITE_ROOT, DOMAIN_ROOT, DISCOVERY_ROOT

Core Principle

Discovery Set is infrastructure. It helps crawlers, bots, LLM readers, and external systems understand the published artifact. It must not compete with the human page.

Publication Model

This protocol’s default model is:
  • A static HTML artifact
  • One absolute publication root
  • One machine-readable discovery root
  • One or more canonical HTML URLs
  • Discovery files that help machines understand the artifact without competing with the human page
Use GitHub Pages only as a special deployment case under this model, not as the default assumption.

Publication Variables

Before generating any discovery files, close all required variables. If any value is missing, do not guess paths — ask for the missing deployment root or produce placeholders only.
VariableDescriptionDefault
SITE_ROOTAbsolute URL where the published artifact beginsMust be provided
DOMAIN_ROOTAbsolute URL of the bare host or domainMust be provided
PROJECT_ROOTOptional subpath root below DOMAIN_ROOTWhen present, SITE_ROOT usually equals PROJECT_ROOT
CURRENT_PAGE_URLAbsolute canonical URL of the current HTML pageMust be provided per page
DISCOVERY_ROOTRoot where llms.txt, raw-manifest.json, sitemap.xml, and companion files liveEquals SITE_ROOT
LLMS_URLResolved URL to llms.txtURL_JOIN(DISCOVERY_ROOT, "llms.txt")
RAW_MANIFEST_URLResolved URL to raw-manifest.jsonURL_JOIN(DISCOVERY_ROOT, "raw-manifest.json")
SITEMAP_URLResolved URL to sitemap.xmlURL_JOIN(DISCOVERY_ROOT, "sitemap.xml")
ROBOTS_URL (authoritative)Resolved URL to authoritative robots.txtURL_JOIN(DOMAIN_ROOT, "robots.txt")
ROBOTS_URL (companion)Resolved URL to project-level companion robots.txtURL_JOIN(DISCOVERY_ROOT, "robots.txt")
Example variable sets
ScenarioVariables
Domain-root siteSITE_ROOT = https://example.com/ · DISCOVERY_ROOT = https://example.com/
Subpath siteSITE_ROOT = https://example.com/project/ · DISCOVERY_ROOT = https://example.com/project/
GitHub Pages project siteSITE_ROOT = https://USER.github.io/REPO/ · DISCOVERY_ROOT = https://USER.github.io/REPO/
When constructing URLs, join path segments with exactly one slash. Do not construct discovery URLs through raw string concatenation. Never assume a root-relative discovery path until SITE_ROOT and DOMAIN_ROOT are closed. Prefer absolute URLs when in doubt.

GitHub Pages Special Case

GitHub Pages project sites are a special risk case. For a GitHub Pages project site, SITE_ROOT is usually https://USER.github.io/REPO/. Root-relative paths such as /llms.txt, /raw-manifest.json, or /sitemap.xml usually point to https://USER.github.io/ — not to https://USER.github.io/REPO/. For GitHub Pages project sites, use absolute URLs or correct relative paths that preserve /REPO/.
If the deployment target is GitHub Pages, also determine before generating any URL: USER or organisation name, repository name, whether the site is a user/org site or project site, whether a custom domain is active, exact SITE_ROOT, exact DOMAIN_ROOT, and exact DISCOVERY_ROOT.

Site vs. Page Classification

Does this published artifact expose more than one canonical HTML URL under the same SITE_ROOT?
AnswerType
YesSITE — multiple canonical HTML pages
NoPAGE — single canonical HTML URL
Anchor links, repository links outside SITE_ROOT, and links to llms.txt, raw-manifest.json, sitemap.xml, assets, or external pages do not make a SITE. Only multiple canonical HTML pages under the same SITE_ROOT make a SITE.

What the Discovery Set Includes

Definition: Single canonical HTML URL, usually one index.html.HEAD — required elements
  • <title>
  • <meta name="description">
  • <meta name="robots" content="index, follow">
  • <link rel="canonical" href="[CURRENT_PAGE_URL]">
  • <link rel="alternate" type="text/plain" href="[LLMS_URL]" title="LLM-readable index">
  • <link rel="alternate" type="application/json" href="[RAW_MANIFEST_URL]" title="Machine-readable manifest">
  • Open Graph: og:title, og:description, og:type="website", og:url (absolute), og:image (absolute)
  • JSON-LD: SoftwareSourceCode for repos/protocols/tools/code artefacts; WebPage for editorial/static pages
  • Favicon
Discovery root files
  • llms.txt
  • raw-manifest.json
  • sitemap.xml — single <url> entry for CURRENT_PAGE_URL
  • robots.txt — required only when DOMAIN_ROOT is controlled; optional companion otherwise
sitemap.xml minimum
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>[CURRENT_PAGE_URL]</loc>
  </url>
</urlset>
FooterLow-noise machine links for a page at DISCOVERY_ROOT:
<!-- Discovery Set: low-noise links for crawlers, bots, and LLM readers. -->
<div class="machine-links" aria-label="Machine-readable project files">
  <span>Machine-readable:</span>
  <a href="llms.txt">llms.txt</a>
  <a href="raw-manifest.json">manifest</a>
  <a href="sitemap.xml">sitemap</a>
</div>
If CURRENT_PAGE_URL is not at DISCOVERY_ROOT, use correct relative paths or absolute URLs.

Discovery File Location Rules

FileLocationNotes
llms.txtDISCOVERY_ROOT onlyReadable project or site index for LLM/crawler use. Not stored in assets/.
raw-manifest.jsonDISCOVERY_ROOT onlyStructured machine-readable project or site manifest. Not stored in assets/.
sitemap.xmlDISCOVERY_ROOT onlyLists canonical HTML URLs only.
robots.txt (authoritative)DOMAIN_ROOT onlyOnly when authoritative crawler control is required and domain-root control exists.
robots.txt (companion)DISCOVERY_ROOT onlyOnly as a project-level discovery companion when domain-root control does not exist.

Robots Authority Rule

robots.txt has a crawler-standard location at DOMAIN_ROOT/robots.txt.
  • If the deployment controls DOMAIN_ROOT, place authoritative robots.txt there.
  • If the artifact is published under a subpath and does not control DOMAIN_ROOT, a project-level robots.txt at DISCOVERY_ROOT may be included as a discovery companion — but it must not be treated as authoritative crawler control.
robots.txt minimum — authoritative domain root
User-agent: *
Allow: /

Sitemap: [SITEMAP_URL]
robots.txt minimum — project-level companion
User-agent: *
Allow: /

Sitemap: [SITEMAP_URL]

Path Rules Summary

ElementRule
canonicalAlways absolute URL; equals CURRENT_PAGE_URL
og:urlAlways absolute URL; equals CURRENT_PAGE_URL
og:imageAlways absolute URL
robots sitemap entryMust be absolute URL
llms.txt / raw-manifest.json in HEADRelative allowed at DISCOVERY_ROOT; absolute preferred for subpages and subpath deployments
Root-relative paths (/llms.txt)Only allowed when DISCOVERY_ROOT equals DOMAIN_ROOT

JSON-LD Rules

Schema typeUse when
SoftwareSourceCodePage represents a repo, protocol, framework, kernel, compiler, developer tool, operational system, or technical artefact
WebPagePage is an article, documentation page, static editorial page, or human-facing explainer without repo/tool identity
WebSiteMultiple canonical pages exist under the same SITE_ROOT
Do not inject generic SEO terms or list concepts that are not actually present in the artifact. Keywords and about fields must reflect the real repo, page, site, or artefact.
Footer links must be visible, low-noise, small, non-dominant, machine-oriented, and not part of the main editorial hierarchy. A small machine-readable row in the footer is correct. A large visible manifest block inside the main page is wrong.

When to Use This Protocol vs. the GitHub Pages Discovery Set Protocol

Use this protocol (v1.2)

  • Any static site hosted on a CDN, custom server, or shared hosting
  • Subpath deployments under a domain you control
  • GitHub Pages user/org sites or project sites when you want the general rules
  • Any deployment where GitHub-specific path constraints are not the primary concern

Use GitHub Pages Discovery Set (v1)

  • When the deployment is specifically GitHub Pages
  • When you need explicit handling of USER_OR_ORG_SITE vs PROJECT_SITE vs CUSTOM_DOMAIN_SITE classification
  • When GitHub Pages path rules and project-site root-relative path confusion are the primary risk

Build docs developers (and LLMs) love