The portfolio scraper takes a URL and a browser provider, fetches the fully rendered page, and runs the complete GitResolve resolution pipeline on the extracted links. It is the primary entry point when you have a candidate’s personal website, GitHub Pages site, or any other public portfolio URL.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/clyrisai/gitresolve/llms.txt
Use this file to discover all available pages before exploring further.
scrapePortfolio
Fetches a portfolio page using the supplied BrowserProvider, extracts every link and git URL from the HTML, and returns a fully resolved ResolverResult.
Parameters
Absolute URL of the portfolio page to scrape. Must include the scheme (
https:// or http://).The browser provider instance responsible for fetching the page content. Obtain one from
createProvider or instantiate a provider class directly. The provider choice affects JavaScript rendering — FetchProvider handles static HTML only; PuppeteerProvider and BrowserlessProvider render SPAs.Optional pre-resolved owner profile. When supplied, owner disambiguation is bypassed and this profile is used directly with
confidence: 'high'. Useful when the caller already knows the candidate’s git username from another source (e.g. their résumé).Returns
Promise<ResolverResult> — this function never throws. Any network or parsing error is captured in result.error and the function returns normally.
The original
url argument, unchanged.Always
'portfolio' for results from this function.Best-guess git profile for the page owner, or
null if none could be determined.How confident the disambiguator is in the resolved owner. See
resolveOwnerAndCategorize for the full confidence logic.Repos where
repo.username matches the resolved owner (case-insensitive).Explicit PR and issue links found on the page.
Repos whose owner username does not match the resolved owner — external references or dependencies.
Every
ExtractedGitLink parsed from the page, unfiltered and uncategorised.Diagnostic messages. Always includes a note about which provider was used (see below). May include messages from the disambiguator.
Set when a top-level fetch or parsing error occurred. When
error is set, all arrays will be empty and confidence will be 'none'.Provider warning messages
When the page fetch succeeds, one provider message is prepended towarnings:
| Provider | Warning text |
|---|---|
FetchProvider | "Used fetch provider (SPA rendering layer like Puppeteer/Browserless is missing)" |
PuppeteerProvider | "Used puppeteer provider for page content" |
BrowserlessProvider | "Used browserless provider for page content" |
The provider warning is added only after
provider.getPageContent() returns successfully. If the fetch throws, result.error is set and no provider warning is added to warnings.How it works
Fetch page HTML
Calls
provider.getPageContent(url) to retrieve the fully rendered HTML string. If this throws, result.error is set and the function returns early.Extract all links
Passes the HTML and base URL to
extractLinksFromHtml, which combines href attribute scanning with extractGitUrlsFromText on the full HTML body. All links are deduplicated.Filter for git provider URLs
The full link list is passed through
isGitProviderUrl. Non-git URLs are discarded.Parse each git URL
Each surviving URL is passed to
parseGitLink, which classifies it as a profile, repo, gist, PR, issue, or other. null results (reserved paths, static assets) are silently dropped.Examples
extractLinksFromHtml
Extracts all links from a raw HTML string. Returns both href attribute values (resolved against a base URL) and any git provider URLs found anywhere in the raw HTML text.
Parameters
Raw HTML string. This is typically the value returned by
provider.getPageContent().The URL the HTML was fetched from. Used to resolve relative
href values like /projects or ../about into absolute URLs.Returns
A deduplicatedstring[] of all extracted URLs. The array contains all links — not just git provider URLs. Callers are responsible for filtering. scrapePortfolio filters this list with isGitProviderUrl before further processing.
How it works
The function uses two complementary strategies and merges the results:-
hrefattribute scanning — A regex (/href\s*=\s*["']([^"']+)["']/gi) extracts everyhrefvalue. Relative paths (starting with/or not starting withhttp) are resolved to absolute URLs usingnew URL(href, baseUrl). Fragment links (#),mailto:links, and unparseable hrefs are silently dropped. -
Raw text scanning —
extractGitUrlsFromTextis run on the full HTML string, catching git URLs that appear in data attributes, comments, inline JavaScript, or JSON blobs embedded in the page — anywhere a regex can find them.
Set before returning.