scrapePortfolio — Portfolio Website Scraper Reference

The portfolio scraper takes a URL and a browser provider, fetches the fully rendered page, and runs the complete GitResolve resolution pipeline on the extracted links. It is the primary entry point when you have a candidate’s personal website, GitHub Pages site, or any other public portfolio URL.

`scrapePortfolio`

Fetches a portfolio page using the supplied BrowserProvider, extracts every link and git URL from the HTML, and returns a fully resolved ResolverResult.

async function scrapePortfolio(
  url: string,
  provider: BrowserProvider,
  knownOwnerProfile?: ExtractedGitLink
): Promise<ResolverResult>

Parameters

url

string

required

Absolute URL of the portfolio page to scrape. Must include the scheme (https:// or http://).

provider

BrowserProvider

required

The browser provider instance responsible for fetching the page content. Obtain one from createProvider or instantiate a provider class directly. The provider choice affects JavaScript rendering — FetchProvider handles static HTML only; PuppeteerProvider and BrowserlessProvider render SPAs.

knownOwnerProfile

ExtractedGitLink

Optional pre-resolved owner profile. When supplied, owner disambiguation is bypassed and this profile is used directly with confidence: 'high'. Useful when the caller already knows the candidate’s git username from another source (e.g. their résumé).

Returns

Promise<ResolverResult> — this function never throws. Any network or parsing error is captured in result.error and the function returns normally.

source

string

The original url argument, unchanged.

sourceType

'portfolio'

Always 'portfolio' for results from this function.

ownerProfile

ExtractedGitLink | null

Best-guess git profile for the page owner, or null if none could be determined.

confidence

'high' | 'medium' | 'low' | 'none'

How confident the disambiguator is in the resolved owner. See resolveOwnerAndCategorize for the full confidence logic.

ownedRepos

ExtractedGitLink[]

Repos where repo.username matches the resolved owner (case-insensitive).

contributions

ExtractedGitLink[]

Explicit PR and issue links found on the page.

externalRepos

ExtractedGitLink[]

Repos whose owner username does not match the resolved owner — external references or dependencies.

allLinks

ExtractedGitLink[]

Every ExtractedGitLink parsed from the page, unfiltered and uncategorised.

warnings

string[]

Diagnostic messages. Always includes a note about which provider was used (see below). May include messages from the disambiguator.

error

string | undefined

Set when a top-level fetch or parsing error occurred. When error is set, all arrays will be empty and confidence will be 'none'.

Provider warning messages

When the page fetch succeeds, one provider message is prepended to warnings:

Provider	Warning text
`FetchProvider`	`"Used fetch provider (SPA rendering layer like Puppeteer/Browserless is missing)"`
`PuppeteerProvider`	`"Used puppeteer provider for page content"`
`BrowserlessProvider`	`"Used browserless provider for page content"`

The provider warning is added only after provider.getPageContent() returns successfully. If the fetch throws, result.error is set and no provider warning is added to warnings.

How it works

Fetch page HTML

Calls provider.getPageContent(url) to retrieve the fully rendered HTML string. If this throws, result.error is set and the function returns early.

Extract all links

Passes the HTML and base URL to extractLinksFromHtml, which combines href attribute scanning with extractGitUrlsFromText on the full HTML body. All links are deduplicated.

Filter for git provider URLs

The full link list is passed through isGitProviderUrl. Non-git URLs are discarded.

Parse each git URL

Each surviving URL is passed to parseGitLink, which classifies it as a profile, repo, gist, PR, issue, or other. null results (reserved paths, static assets) are silently dropped.

Resolve owner and categorise

The collected ExtractedGitLink[] is passed to resolveOwnerAndCategorize with sourceContext: 'portfolio' and the optional knownOwnerProfile. This produces the ownerProfile, confidence, ownedRepos, contributions, and externalRepos fields.

Examples

import { createProvider, scrapePortfolio } from '@clyrisai/gitresolve';

const provider = await createProvider();

try {
  const result = await scrapePortfolio('https://janedoe.dev', provider);

  if (result.error) {
    console.error('Scrape failed:', result.error);
  } else {
    console.log('Owner:', result.ownerProfile?.username);
    console.log('Confidence:', result.confidence);
    console.log('Owned repos:', result.ownedRepos.map(r => r.repo));
    console.log('Contributions:', result.contributions.length);
  }
} finally {
  await provider.cleanup();
}

For SPAs built with React, Vue, or similar frameworks, always use PuppeteerProvider or BrowserlessProvider. FetchProvider only downloads the initial HTML shell and will miss dynamically rendered links.

`extractLinksFromHtml`

Extracts all links from a raw HTML string. Returns both href attribute values (resolved against a base URL) and any git provider URLs found anywhere in the raw HTML text.

function extractLinksFromHtml(html: string, baseUrl: string): string[]

Parameters

html

string

required

Raw HTML string. This is typically the value returned by provider.getPageContent().

baseUrl

string

required

The URL the HTML was fetched from. Used to resolve relative href values like /projects or ../about into absolute URLs.

Returns

A deduplicated string[] of all extracted URLs. The array contains all links — not just git provider URLs. Callers are responsible for filtering. scrapePortfolio filters this list with isGitProviderUrl before further processing.

How it works

The function uses two complementary strategies and merges the results:

href attribute scanning — A regex (/href\s*=\s*["']([^"']+)["']/gi) extracts every href value. Relative paths (starting with / or not starting with http) are resolved to absolute URLs using new URL(href, baseUrl). Fragment links (#), mailto: links, and unparseable hrefs are silently dropped.
Raw text scanning — extractGitUrlsFromText is run on the full HTML string, catching git URLs that appear in data attributes, comments, inline JavaScript, or JSON blobs embedded in the page — anywhere a regex can find them.

Both result sets are merged and deduplicated with Set before returning.

The returned array includes every href found in the document, including navigation links, icon links, stylesheet references, and non-git URLs. Always filter with isGitProviderUrl before passing to parseGitLink.

Examples

import { extractLinksFromHtml, isGitProviderUrl } from '@clyrisai/gitresolve';

const html = `
  <html>
    <body>
      <a href="/about">About</a>
      <a href="https://github.com/janedoe">GitHub</a>
      <a href="https://github.com/janedoe/my-project">My Project</a>
      <p>Data stored at github.com/janedoe/dataset</p>
    </body>
  </html>
`;

const allLinks = extractLinksFromHtml(html, 'https://janedoe.dev');
// [
//   'https://janedoe.dev/about',         ← relative href resolved
//   'https://github.com/janedoe',
//   'https://github.com/janedoe/my-project',
//   'https://github.com/janedoe/dataset', ← caught by text scan
// ]

const gitLinksOnly = allLinks.filter(isGitProviderUrl);
// [
//   'https://github.com/janedoe',
//   'https://github.com/janedoe/my-project',
//   'https://github.com/janedoe/dataset',
// ]

Programmatic API

Types

scrapePortfolio — Portfolio Website Scraper Reference