Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

Survivorship bias is a well-documented analytical error that occurs when a dataset systematically includes only the “survivors” of a selection process while excluding everything that did not survive. In financial analysis, it manifests as mutual fund performance databases that only contain funds that are still operating — the failed funds are deleted, making average performance appear better than it was. In this dataset, the same mechanism operates at the level of job postings: only postings that were active and visible during the crawl window appear in the data. Every job that was posted and filled before the crawl, pulled by the employer early, filled through a referral before going public, or decided against entirely is invisible. The dataset shows you what “survived” to be crawled — not what the full market looked like.

What’s Missing

The categories of hiring activity absent from the dataset due to survivorship bias include:

Jobs Filled Through Referrals

Many positions, particularly in technology companies, are filled through employee referral programs before a job description is ever written or posted. The requisition opens, a recruiter reaches out to an employee’s contact, and the role is filled internally — no posting ever appears on LinkedIn.

Postings Pulled Within Hours or Days

Some postings generate overwhelming response immediately upon publication and are pulled by the employer within hours. These represent high-demand, easy-to-fill roles — typically commodity skills at competitive salaries — that are systematically absent from a dataset built from a point-in-time crawl. The fastest-filling jobs leave the smallest footprint.

Roles That Were Never Filled

Companies sometimes open requisitions, begin the hiring process, and then cancel the search — due to budget changes, reorgs, or internal candidates emerging. These “ghost requisitions” leave no trace in the dataset.

Roles That Were Never Posted

As discussed in the Selection Bias page, a significant share of hiring occurs entirely outside public job boards. Survivorship bias compounds selection bias: even among publicly posted roles, only those that survived to the crawl date are captured.

The closed_time Evidence

The closed_time column — which would normally allow researchers to identify when postings were removed and calculate time-to-fill — has a 99.1% null rate. This is strong evidence that the dataset does not reliably track job posting lifecycle. The absence of closed_time data means:
  • Time-to-fill cannot be calculated
  • The dataset cannot distinguish between a posting that is genuinely active and one that was filled but not properly closed
  • The “active” status of postings in the dataset is largely inferred from their presence in the crawl, not from explicit status tracking
The 99.1% null rate on closed_time is itself a data quality issue distinct from survivorship bias — it means even for the postings that were captured, we have almost no information about their resolution. See the Temporal Bias page for related discussion of missing lifecycle timestamps.

Impact on Analysis

The survivorship mechanism creates a predictable distortion in demand and salary signals: Harder-to-fill roles are over-represented: roles that stay open for weeks or months (because they require rare skills, offer below-market compensation, or are for niche domains) are more likely to be visible during a crawl than roles filled within 48 hours. The dataset’s skill demand rankings may therefore overstate demand for niche skills relative to commodity skills. Commodity roles are under-represented: high-volume, easily-filled roles — junior analysts, standard software engineers at market rates, common administrative roles — fill quickly and may not be present in the crawl at all. This under-representation means:
  • The distribution of role types may skew toward senior, specialized, and harder-to-fill positions
  • Median salary in the dataset may be inflated because quickly-filled (presumably at-market) roles are missing
  • Skill frequency counts for common, in-demand skills may be understated

Using Applies and Views as Proxies

In the absence of closed_time data, the applies and views columns provide partial proxies for understanding job “heat” — how quickly a role likely filled:
  • High applies: suggests a commodity role generating strong applicant interest — these roles likely fill fast and may be underrepresented in the dataset
  • Low applies: suggests a niche role with limited candidate pool — these roles stay open longer and are more likely to survive to the crawl date
# High-applies postings (commodity roles — filled fast, may be underrepresented)
commodity = df[df['applies'] > df['applies'].quantile(0.75)]
# Low-applies postings (niche roles — stay open longer)
niche = df[df['applies'] < df['applies'].quantile(0.25)]
print(f"Commodity median salary: ${commodity['salary_annual'].median():,.0f}")
print(f"Niche median salary:     ${niche['salary_annual'].median():,.0f}")
The applies column is itself subject to undercounting bias — it only captures LinkedIn Easy Apply submissions, not external application link clicks. Interpret applies-based proxies with this limitation in mind. See Applies Undercounting for details.

Mitigation Strategies

StrategyDescription
Scope framingFrame all demand analyses as applying to “active postings during the crawl window,” not “all market demand”
Applies segmentationSeparate high-applies (commodity) and low-applies (niche) postings when analyzing skill demand or salary
Views as demand proxyUse views (less subject to undercounting than applies) as a proxy for role visibility and demand
External validationCross-reference skill demand findings with real-time labor market APIs (Lightcast, Indeed Trends) to validate survivorship-adjusted rankings
Freshness filteringFilter to recently listed postings (listed_time within 30–60 days of crawl) to minimize staleness effects

Build docs developers (and LLMs) love