The raw LinkedIn Job Postings dataset arrives as 11 separate CSV files spanning four directories. Before any analysis can run, those files must be joined into a coherent master DataFrame, stripped of structural noise (duplicate columns from joins, redundant index columns), and enriched with derived features — most importantly a singleDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
salary_annual column that places every salary figure on the same annual-USD basis regardless of the pay period it was originally quoted in. The pipeline then filters the full 123,849-row corpus down to the 19,725 data-related roles that are the focus of HRIA’s core analyses, applies targeted imputation to columns where missingness is safe to fill, removes salary outliers using a fenced IQR approach, and writes three publication-ready output files at different levels of completeness. Every step is documented below.
Preprocessing Pipeline
Each file is read into a dedicated pandas DataFrame with explicit
dtype coercion on key identifier columns (job_id, company_id) to prevent silent type mismatches during joins. The mappings/ lookup tables are loaded first so they are available as reference objects throughout the rest of the pipeline.All 11 files are merged into a single flat DataFrame using
pandas.merge(). The merge strategy preserves every row in postings.csv (the fact table) while enriching each row with columns from the satellite tables.postings.csv ← LEFT JOIN → companies/companies.csv on company_idLEFT JOIN → jobs/salaries.csv on job_idLEFT JOIN → companies/employee_counts.csv on company_idN:M tables (one job maps to many skills, industries, or benefits) are not joined row-by-row, which would explode the DataFrame to millions of rows. Instead, they are aggregated prior to joining:
# Aggregate job skills into a comma-separated string per job_id
skills_agg = (
job_skills
.merge(skills_map, on='skill_abr', how='left')
.groupby('job_id')['skill_name']
.apply(lambda x: ', '.join(x.dropna().unique()))
.reset_index()
.rename(columns={'skill_name': 'skills_list'})
)
# Aggregate job industries similarly
industries_agg = (
job_industries
.merge(industries_map, on='industry_id', how='left')
.groupby('job_id')['industry_name']
.apply(lambda x: ', '.join(x.dropna().unique()))
.reset_index()
.rename(columns={'industry_name': 'industries_list'})
)
This approach keeps the master DataFrame at exactly 123,849 rows × 49 columns — one row per posting, no duplication.
Eight free-text and categorical columns are normalized to prevent the same value being counted as multiple distinct categories due to case or whitespace differences:
TEXT_COLS = [
'title', 'company_name', 'location',
'formatted_work_type', 'formatted_experience_level',
'compensation_type', 'pay_period', 'industries_list'
]
for col in TEXT_COLS:
master[col] = (
master[col]
.astype(str)
.str.lower()
.str.strip()
.str.replace(r'\s+', ' ', regex=True)
)
Without this step, “Data Engineer”, “data engineer”, and “Data Engineer ” are treated as three distinct job titles in value-count operations and pivot tables.
The full 123,849-row master dataset is filtered down to data and tech roles using a keyword match against the normalized
title column. A posting is included if its title contains any of the following terms:DATA_KEYWORDS = [
'data', 'analyst', 'scientist', 'engineer',
'machine learning', 'analytics', 'intelligence',
'statistician', 'quantitative', 'etl', 'warehouse',
'big data', 'nlp', 'sql developer', 'python developer',
'ai engineer', 'ml engineer'
]
pattern = '|'.join(DATA_KEYWORDS)
data_roles = master[master['title'].str.contains(pattern, na=False)].copy()
This yields 19,725 postings — 15.9% of the total corpus. The filtered DataFrame gets one additional column (
salary_annual) added in the next step, bringing it to 50 columns.Salary figures in the raw data are quoted in four different pay periods (HOURLY, MONTHLY, WEEKLY, YEARLY). To make them comparable, every salary value is converted to annual USD using the following function, applied row-wise:
HOURS_YEAR = 2080 # 40 h/week × 52 weeks
def annual_salary(row):
"""Normalize salary to annual USD based on pay period."""
# Priority: sal_med > sal_max > sal_min > fields from postings
for field in ['sal_med', 'sal_max', 'sal_min',
'med_salary', 'max_salary', 'min_salary']:
val = row.get(field)
if not pd.isna(val):
break
else:
return np.nan
period = str(row.get('sal_period', row.get('pay_period', ''))).upper()
if 'HOUR' in period: return val * HOURS_YEAR # 40h × 52w
if 'MONTH' in period: return val * 12
if 'WEEK' in period: return val * 52
return val # YEARLY or other -> assumed annual
data_roles['salary_annual'] = data_roles.apply(annual_salary, axis=1)
The function checks multiple salary columns in priority order (preferring
sal_med from the salaries table, then falling back to the postings columns) and applies the appropriate multiplier. Rows with no valid salary value in any column return np.nan.Imputation is applied selectively. Only columns where filling missing values does not introduce analytical bias receive imputation.
formatted_experience_level"mid-senior level")views5)applies4)views; Easy Apply onlycompany_name"Unknown"salary_annualsalary_annual is never imputed. Any analysis that involves salary — distributions, role comparisons, regression models — must be scoped exclusively to the 6,108-row subset (data_roles_salario.csv) that contains real, validated salary records. Applying mean or median imputation to the remaining ~76% of missing salary rows would manufacture an artificial salary distribution and invalidate every downstream finding.Even after normalization, the salary distribution contains extreme values that skew visualizations and distort summary statistics. Outliers are removed using a fenced IQR approach applied to the data roles with salary subset:
Q1 = data_roles_sal['salary_annual'].quantile(0.25) # $90,000
Q3 = data_roles_sal['salary_annual'].quantile(0.75) # $166,400
IQR = Q3 - Q1 # $76,400
lower_bound = max(Q1 - 1.5 * IQR, 10_000) # $10,000 (floor applied)
upper_bound = Q3 + 1.5 * IQR # $281,000
data_roles_sal_clean = data_roles_sal[
data_roles_sal['salary_annual'].between(lower_bound, upper_bound)
]
Output Files
| File | Rows | Columns | Description |
|---|---|---|---|
data_maestro_completo.csv | 123,849 | 49 | Full master dataset — all roles, all postings, no role filter applied |
data_roles_completo.csv | 19,725 | 50 | Data roles only (keyword-filtered), includes salary_annual column, salary not required |
data_roles_salario.csv | 6,108 | 50 | Data roles with clean salary only — outliers removed, ready for compensation analysis |
data_roles_completo.csv. Phases that examine salary use data_roles_salario.csv exclusively.
Column counts differ by one between
data_maestro_completo.csv (49 columns) and the two data-roles files (50 columns) because the salary_annual engineered feature is only added during the role-filtering step in Step 5. The master file retains the raw salary columns (min_salary, max_salary, med_salary) without the normalized annual composite.