Input Classification: How GitResolve Reads Your Data

Before GitResolve fetches a single URL or opens a single file, it runs every input through a classification step. This step determines which processing pipeline to invoke — portfolio scraping, PDF parsing, direct profile resolution, or a deliberate skip — so that downstream logic always knows exactly what kind of source it is working with. Getting classification right is essential: sending a resume file path into the portfolio scraper, or treating a bare GitHub profile as a repository URL, would produce empty or incorrect results. classifyInput resolves that ambiguity with a fast, deterministic decision before any network or file I/O takes place.

InputType reference

Every input resolves to one of seven InputType values. The table below lists each value, what it represents, and a concrete example.

`InputType`	Meaning	Example input
`repo_url`	A direct link to a specific repository on GitHub, GitLab, or Bitbucket	`https://github.com/torvalds/linux`
`git_profile`	A profile page on a supported git host (no repo path)	`https://github.com/torvalds`
`portfolio`	Any other fully-qualified URL — personal sites, project pages, etc.	`https://janedoe.dev`
`resume_file`	A local file path ending in `.pdf`, `.doc`, `.docx`, or `.rtf`	`./resumes/jane_doe.pdf`
`resume_url`	A URL that points to a hosted resume document	`https://cdn.example.com/resume.pdf`
`linkedin`	Any URL whose hostname contains `linkedin.com`	`https://www.linkedin.com/in/janedoe`
`unknown`	Anything that cannot be parsed as a URL and is not a recognised file extension	`github.com/janedoe` (no scheme)

resume_url is defined in the InputType union but is never returned by classifyInput. A URL ending in .pdf hosted on a non-git domain resolves to 'portfolio' through the classification algorithm. The CLI assigns 'resume_url' as the sourceType after it has downloaded a remote PDF and is about to hand it off for parsing — this assignment happens outside classifyInput entirely.

Classification decision flow

classifyInput applies rules in strict order and returns as soon as a match is found. Step 7 (portfolio) acts as a catch-all for any valid URL that did not match an earlier rule — the only step that returns without a positive match is step 3, which returns unknown when the URL constructor throws.

Trim whitespace

The raw input string is trimmed of leading and trailing whitespace. All subsequent checks operate on this cleaned value.

File extension check → resume_file

If the trimmed string ends with .pdf, .doc, .docx, or .rtf (case-insensitive), the input is classified as resume_file immediately. No URL parsing is attempted.

// All of these → 'resume_file'
classifyInput("./cv/jane_doe.pdf")     // → 'resume_file'
classifyInput("/tmp/Resume.DOCX")      // → 'resume_file'
classifyInput("C:\\Users\\jane.rtf")   // → 'resume_file'

URL parse attempt — failure → unknown

The string is passed to the URL constructor. If parsing throws, the input cannot be a valid web address and is classified as unknown.

classifyInput("github.com/janedoe")    // → 'unknown'  (no https:// scheme)
classifyInput("not a url at all")      // → 'unknown'

classifyInput does not prepend https:// automatically. A bare hostname like github.com/janedoe fails URL parsing and returns unknown. Always pass fully-qualified URLs with a scheme.

LinkedIn hostname check → linkedin

If the parsed URL’s hostname contains linkedin.com, the input is classified as linkedin. Processing stops here — GitResolve does not attempt to scrape or resolve LinkedIn URLs (see LinkedIn handling below).

classifyInput("https://www.linkedin.com/in/janedoe") // → 'linkedin'
classifyInput("https://linkedin.com/company/acme")   // → 'linkedin'

Repo URL validation → repo_url

The URL is passed to parseRepoUrl(). If it returns valid: true — meaning it has a recognised git hostname and a valid owner/repo path structure — the input is classified as repo_url.

classifyInput("https://github.com/torvalds/linux")           // → 'repo_url'
classifyInput("https://gitlab.com/inkscape/inkscape")        // → 'repo_url'
classifyInput("https://bitbucket.org/atlassian/localstack")  // → 'repo_url'
// With contribution paths too:
classifyInput("https://github.com/torvalds/linux/pull/42")   // → 'repo_url'

Known git host with path segments → git_profile

If the hostname is github.com, www.github.com, gitlab.com, www.gitlab.com, bitbucket.org, or www.bitbucket.org, and the URL path contains at least one segment, the input is a profile page.

classifyInput("https://github.com/torvalds")       // → 'git_profile'
classifyInput("https://gitlab.com/gitlab-org")     // → 'git_profile'
classifyInput("https://bitbucket.org/atlassian")   // → 'git_profile'

A bare root URL with no path segments (https://github.com) returns unknown because there is no username to extract.

Any other valid URL → portfolio

If the URL passed all previous checks without matching, it is classified as portfolio. This covers personal websites, project homepages, hosted slides, and any other web page that might contain git links.

classifyInput("https://janedoe.dev")                      // → 'portfolio'
classifyInput("https://janesmith.io/projects")            // → 'portfolio'
classifyInput("https://cdn.example.com/resume.pdf")       // → 'portfolio'

What GitResolve does for each type

Classification determines which pipeline runs next:

`InputType`	Processing strategy
`repo_url`	Owner is extracted directly from the URL. `knownOwnerProfile` is passed to the disambiguator, yielding `confidence: 'high'` with no scraping needed.
`git_profile`	Owner is extracted directly from the URL path. Same high-confidence bypass as `repo_url`.
`portfolio`	Page HTML is fetched via `scrapePortfolio()`, all `href` attributes and inline git URLs are extracted, then disambiguation runs on the full link set.
`resume_file`	The file is read with `parseResume()`, which runs two extraction passes: plain-text extraction via `unpdf` and hyperlink annotation extraction from PDF metadata. Disambiguation then runs on the combined link set.
`resume_url`	Set by the CLI after downloading a remote PDF — not produced by `classifyInput`. The downloaded file is then processed the same way as `resume_file`.
`linkedin`	Flagged and skipped — no request is made (see below).
`unknown`	An error is attached to the result and processing is skipped.

LinkedIn: intentionally not resolved

When classifyInput returns linkedin, GitResolve records the type and moves on without issuing any request. LinkedIn’s terms of service prohibit automated scraping, and their login walls make reliable extraction impractical. If a candidate’s LinkedIn URL is the only input available, the resolver returns a result with confidence: 'none' and a warning indicating the source was skipped.

If you need to connect a LinkedIn profile to a GitHub identity, ask candidates to include their GitHub URL directly on their portfolio or resume. GitResolve will pick it up automatically during scraping or PDF parsing.

Code example

import { classifyInput } from "@clyrisai/gitresolve";

const inputs = [
  "./jane_doe_resume.pdf",
  "https://github.com/janedoe",
  "https://github.com/janedoe/my-project",
  "https://janedoe.dev",
  "https://www.linkedin.com/in/janedoe",
  "github.com/janedoe",                    // missing scheme
];

for (const input of inputs) {
  console.log(input, "→", classifyInput(input));
}

// ./jane_doe_resume.pdf               → resume_file
// https://github.com/janedoe          → git_profile
// https://github.com/janedoe/my-project → repo_url
// https://janedoe.dev                 → portfolio
// https://www.linkedin.com/in/janedoe → linkedin
// github.com/janedoe                  → unknown

Disambiguation

How GitResolve determines which GitHub identity owns the resolved links

Result Structure

The full shape of ResolverResult and AggregatedResult

Get Started

CLI Guide

Browser Providers

Concepts

Input Classification: How GitResolve Reads Your Data

InputType reference

Classification decision flow

What GitResolve does for each type

LinkedIn: intentionally not resolved

Code example

Disambiguation

Result Structure

Build docs developers (and LLMs) love

Get Started

CLI Guide

Browser Providers

Concepts

Documentation Index

​InputType reference

​Classification decision flow

​What GitResolve does for each type

​LinkedIn: intentionally not resolved

​Code example

Disambiguation

Result Structure

Build docs developers (and LLMs) love

InputType reference

Classification decision flow

What GitResolve does for each type

LinkedIn: intentionally not resolved

Code example