The HRIA project is built on the LinkedIn Job Postings dataset published on Kaggle, a comprehensive scrape of real job listings from LinkedIn’s public-facing search interface. The dataset was selected for its breadth — nearly 124K postings across hundreds of industries — and its relational richness: rather than a single flat file, the source is structured as 11 interrelated CSVs that mirror how LinkedIn organizes employer, job, and skills data internally. This structure makes it well-suited for multi-dimensional EDA, allowing HRIA to join salary ranges, company metadata, skill tagging, and benefit listings without relying on a single denormalized export. Beyond scale, the dataset captures real market signals — organic listings from a platform used by millions of active job seekers — which means the biases observed (missing salary disclosure, sparse remote status, underrepresentation of certain roles) are themselves meaningful findings rather than data artifacts.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
Dataset Files
The full dataset spans 11 CSV files organized into four directories: the rootpostings.csv fact table, plus companies/, jobs/, and mappings/ subdirectories.
| File | Rows | Columns | Description |
|---|---|---|---|
postings.csv | 123,849 | 31 | Main fact table — one row per job posting |
companies/companies.csv | 24,473 | 10 | Company profiles |
companies/company_industries.csv | 24,375 | 2 | Company-to-industry mapping |
companies/company_specialities.csv | 169,387 | 2 | Company specialities (M:M) |
companies/employee_counts.csv | 35,787 | 4 | Company headcount snapshots |
jobs/benefits.csv | 67,943 | 3 | Benefits per job posting |
jobs/job_industries.csv | 164,808 | 2 | Job-to-industry mapping |
jobs/job_skills.csv | 213,768 | 2 | Job-to-skill mapping |
jobs/salaries.csv | 40,785 | 8 | Salary ranges per posting |
mappings/industries.csv | 422 | 2 | Industry lookup table |
mappings/skills.csv | 35 | 2 | Skill category lookup (35 categories) |
Relational Schema
The dataset follows a star-schema-like structure withpostings.csv at the center. Every satellite table connects back to the fact table through one of two foreign keys:
job_id— links a posting to its salary ranges, skill assignments, benefit listings, and industry tags.company_id— links a posting to company profiles, headcount snapshots, specialities, and company-level industry memberships.
mappings/ (industries.csv and skills.csv) serve as dimension tables: industries.csv decodes the numeric industry IDs used in both job_industries.csv and company_industries.csv, while skills.csv decodes the abbreviated skill codes stored in job_skills.csv into 35 human-readable categories.
N:M relationships (one job → many skills, one job → many industries, one company → many specialities) are handled during preprocessing by aggregating satellite rows into comma-separated strings, preserving the one-row-per-posting structure throughout the analysis pipeline.
Scope and Limitations
While the dataset is large, several scope boundaries shape what HRIA can and cannot conclude:- Geographic bias — The overwhelming majority of postings target the US market. International listings exist but are sparse and inconsistently formatted, so cross-country salary comparisons are not reliable.
- Platform bias — All postings originate from LinkedIn only. Roles listed exclusively on Indeed, Glassdoor, company career pages, or niche boards are absent, which may over-represent enterprise employers and under-represent startups.
- Role filtering — HRIA’s core analyses focus on data and tech roles specifically. Applying a keyword filter (see Preprocessing) reduces the working dataset to 19,725 postings — 15.9% of the 123,849 total. Findings about salary distributions and skill demand apply to this filtered subset, not the full dataset.
- Temporal window — The scrape represents a snapshot in time and does not capture how the market evolved quarter-over-quarter within the collection window.
Data Quality Headlines
Two missing-data patterns are prominent enough to affect every analytical decision downstream:- Salary data is sparse — only 24% of postings include a salary range. This is not random missingness: companies actively choose whether to disclose compensation, and the decision correlates with industry, company size, and role seniority. Any salary analysis must account for this selection effect.
- Remote status is almost never disclosed — 87.7% of postings leave
remote_allowedblank. Theformatted_work_typecolumn (On-site / Remote / Hybrid) is more reliably populated and is used as the primary remote-work indicator throughout HRIA.
The LinkedIn Job Postings dataset is publicly available on Kaggle. To reproduce the HRIA analysis, download the dataset from https://www.kaggle.com/datasets/arshkon/linkedin-job-postings and place the extracted files in your project’s
/data/raw/ directory, preserving the original companies/, jobs/, and mappings/ subdirectory structure. A Kaggle account and the kaggle CLI (or manual download) are required.Continue Reading
Dataset Schema
Full column-level schema for all 11 CSV files, including data types, null rates, and primary/foreign key relationships.
Preprocessing Pipeline
How HRIA merges the 11 CSVs, normalizes salaries to annual USD, removes outliers, and produces three publication-ready output files.