Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

The HRIA project is built on the LinkedIn Job Postings dataset published on Kaggle, a comprehensive scrape of real job listings from LinkedIn’s public-facing search interface. The dataset was selected for its breadth — nearly 124K postings across hundreds of industries — and its relational richness: rather than a single flat file, the source is structured as 11 interrelated CSVs that mirror how LinkedIn organizes employer, job, and skills data internally. This structure makes it well-suited for multi-dimensional EDA, allowing HRIA to join salary ranges, company metadata, skill tagging, and benefit listings without relying on a single denormalized export. Beyond scale, the dataset captures real market signals — organic listings from a platform used by millions of active job seekers — which means the biases observed (missing salary disclosure, sparse remote status, underrepresentation of certain roles) are themselves meaningful findings rather than data artifacts.

Dataset Files

The full dataset spans 11 CSV files organized into four directories: the root postings.csv fact table, plus companies/, jobs/, and mappings/ subdirectories.
FileRowsColumnsDescription
postings.csv123,84931Main fact table — one row per job posting
companies/companies.csv24,47310Company profiles
companies/company_industries.csv24,3752Company-to-industry mapping
companies/company_specialities.csv169,3872Company specialities (M:M)
companies/employee_counts.csv35,7874Company headcount snapshots
jobs/benefits.csv67,9433Benefits per job posting
jobs/job_industries.csv164,8082Job-to-industry mapping
jobs/job_skills.csv213,7682Job-to-skill mapping
jobs/salaries.csv40,7858Salary ranges per posting
mappings/industries.csv4222Industry lookup table
mappings/skills.csv352Skill category lookup (35 categories)

Relational Schema

The dataset follows a star-schema-like structure with postings.csv at the center. Every satellite table connects back to the fact table through one of two foreign keys:
  • job_id — links a posting to its salary ranges, skill assignments, benefit listings, and industry tags.
  • company_id — links a posting to company profiles, headcount snapshots, specialities, and company-level industry memberships.
The two lookup tables in mappings/ (industries.csv and skills.csv) serve as dimension tables: industries.csv decodes the numeric industry IDs used in both job_industries.csv and company_industries.csv, while skills.csv decodes the abbreviated skill codes stored in job_skills.csv into 35 human-readable categories. N:M relationships (one job → many skills, one job → many industries, one company → many specialities) are handled during preprocessing by aggregating satellite rows into comma-separated strings, preserving the one-row-per-posting structure throughout the analysis pipeline.

Scope and Limitations

While the dataset is large, several scope boundaries shape what HRIA can and cannot conclude:
  • Geographic bias — The overwhelming majority of postings target the US market. International listings exist but are sparse and inconsistently formatted, so cross-country salary comparisons are not reliable.
  • Platform bias — All postings originate from LinkedIn only. Roles listed exclusively on Indeed, Glassdoor, company career pages, or niche boards are absent, which may over-represent enterprise employers and under-represent startups.
  • Role filtering — HRIA’s core analyses focus on data and tech roles specifically. Applying a keyword filter (see Preprocessing) reduces the working dataset to 19,725 postings — 15.9% of the 123,849 total. Findings about salary distributions and skill demand apply to this filtered subset, not the full dataset.
  • Temporal window — The scrape represents a snapshot in time and does not capture how the market evolved quarter-over-quarter within the collection window.

Data Quality Headlines

Two missing-data patterns are prominent enough to affect every analytical decision downstream:
  • Salary data is sparse — only 24% of postings include a salary range. This is not random missingness: companies actively choose whether to disclose compensation, and the decision correlates with industry, company size, and role seniority. Any salary analysis must account for this selection effect.
  • Remote status is almost never disclosed87.7% of postings leave remote_allowed blank. The formatted_work_type column (On-site / Remote / Hybrid) is more reliably populated and is used as the primary remote-work indicator throughout HRIA.
The LinkedIn Job Postings dataset is publicly available on Kaggle. To reproduce the HRIA analysis, download the dataset from https://www.kaggle.com/datasets/arshkon/linkedin-job-postings and place the extracted files in your project’s /data/raw/ directory, preserving the original companies/, jobs/, and mappings/ subdirectory structure. A Kaggle account and the kaggle CLI (or manual download) are required.

Continue Reading

Dataset Schema

Full column-level schema for all 11 CSV files, including data types, null rates, and primary/foreign key relationships.

Preprocessing Pipeline

How HRIA merges the 11 CSVs, normalizes salaries to annual USD, removes outliers, and produces three publication-ready output files.

Build docs developers (and LLMs) love