LinkedIn Job Postings Dataset: Structure and Scale

The HRIA project is built on the LinkedIn Job Postings dataset published on Kaggle, a comprehensive scrape of real job listings from LinkedIn’s public-facing search interface. The dataset was selected for its breadth — nearly 124K postings across hundreds of industries — and its relational richness: rather than a single flat file, the source is structured as 11 interrelated CSVs that mirror how LinkedIn organizes employer, job, and skills data internally. This structure makes it well-suited for multi-dimensional EDA, allowing HRIA to join salary ranges, company metadata, skill tagging, and benefit listings without relying on a single denormalized export. Beyond scale, the dataset captures real market signals — organic listings from a platform used by millions of active job seekers — which means the biases observed (missing salary disclosure, sparse remote status, underrepresentation of certain roles) are themselves meaningful findings rather than data artifacts.

Dataset Files

The full dataset spans 11 CSV files organized into four directories: the root postings.csv fact table, plus companies/, jobs/, and mappings/ subdirectories.

File	Rows	Columns	Description
`postings.csv`	123,849	31	Main fact table — one row per job posting
`companies/companies.csv`	24,473	10	Company profiles
`companies/company_industries.csv`	24,375	2	Company-to-industry mapping
`companies/company_specialities.csv`	169,387	2	Company specialities (M:M)
`companies/employee_counts.csv`	35,787	4	Company headcount snapshots
`jobs/benefits.csv`	67,943	3	Benefits per job posting
`jobs/job_industries.csv`	164,808	2	Job-to-industry mapping
`jobs/job_skills.csv`	213,768	2	Job-to-skill mapping
`jobs/salaries.csv`	40,785	8	Salary ranges per posting
`mappings/industries.csv`	422	2	Industry lookup table
`mappings/skills.csv`	35	2	Skill category lookup (35 categories)

Relational Schema

The dataset follows a star-schema-like structure with postings.csv at the center. Every satellite table connects back to the fact table through one of two foreign keys:

job_id — links a posting to its salary ranges, skill assignments, benefit listings, and industry tags.
company_id — links a posting to company profiles, headcount snapshots, specialities, and company-level industry memberships.

The two lookup tables in mappings/ (industries.csv and skills.csv) serve as dimension tables: industries.csv decodes the numeric industry IDs used in both job_industries.csv and company_industries.csv, while skills.csv decodes the abbreviated skill codes stored in job_skills.csv into 35 human-readable categories. N:M relationships (one job → many skills, one job → many industries, one company → many specialities) are handled during preprocessing by aggregating satellite rows into comma-separated strings, preserving the one-row-per-posting structure throughout the analysis pipeline.

Scope and Limitations

While the dataset is large, several scope boundaries shape what HRIA can and cannot conclude:

Geographic bias — The overwhelming majority of postings target the US market. International listings exist but are sparse and inconsistently formatted, so cross-country salary comparisons are not reliable.
Platform bias — All postings originate from LinkedIn only. Roles listed exclusively on Indeed, Glassdoor, company career pages, or niche boards are absent, which may over-represent enterprise employers and under-represent startups.
Role filtering — HRIA’s core analyses focus on data and tech roles specifically. Applying a keyword filter (see Preprocessing) reduces the working dataset to 19,725 postings — 15.9% of the 123,849 total. Findings about salary distributions and skill demand apply to this filtered subset, not the full dataset.
Temporal window — The scrape represents a snapshot in time and does not capture how the market evolved quarter-over-quarter within the collection window.

Data Quality Headlines

Two missing-data patterns are prominent enough to affect every analytical decision downstream:

Salary data is sparse — only 24% of postings include a salary range. This is not random missingness: companies actively choose whether to disclose compensation, and the decision correlates with industry, company size, and role seniority. Any salary analysis must account for this selection effect.
Remote status is almost never disclosed — 87.7% of postings leave remote_allowed blank. The formatted_work_type column (On-site / Remote / Hybrid) is more reliably populated and is used as the primary remote-work indicator throughout HRIA.

The LinkedIn Job Postings dataset is publicly available on Kaggle. To reproduce the HRIA analysis, download the dataset from https://www.kaggle.com/datasets/arshkon/linkedin-job-postings and place the extracted files in your project’s /data/raw/ directory, preserving the original companies/, jobs/, and mappings/ subdirectory structure. A Kaggle account and the kaggle CLI (or manual download) are required.

Continue Reading

Dataset Schema

Full column-level schema for all 11 CSV files, including data types, null rates, and primary/foreign key relationships.

Preprocessing Pipeline

How HRIA merges the 11 CSVs, normalizes salaries to annual USD, removes outliers, and produces three publication-ready output files.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Dataset Files

Relational Schema

Scope and Limitations

Data Quality Headlines

Continue Reading

Dataset Schema

Preprocessing Pipeline

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​Dataset Files

​Relational Schema

​Scope and Limitations

​Data Quality Headlines

​Continue Reading

Dataset Schema

Preprocessing Pipeline

Build docs developers (and LLMs) love

Dataset Files

Relational Schema

Scope and Limitations

Data Quality Headlines

Continue Reading