Data Preprocessing: Merging, Normalizing, and Cleaning

The raw LinkedIn Job Postings dataset arrives as 11 separate CSV files spanning four directories. Before any analysis can run, those files must be joined into a coherent master DataFrame, stripped of structural noise (duplicate columns from joins, redundant index columns), and enriched with derived features — most importantly a single salary_annual column that places every salary figure on the same annual-USD basis regardless of the pay period it was originally quoted in. The pipeline then filters the full 123,849-row corpus down to the 19,725 data-related roles that are the focus of HRIA’s core analyses, applies targeted imputation to columns where missingness is safe to fill, removes salary outliers using a fenced IQR approach, and writes three publication-ready output files at different levels of completeness. Every step is documented below.

Preprocessing Pipeline

Step 1 — Load All 11 CSV Files

Each file is read into a dedicated pandas DataFrame with explicit dtype coercion on key identifier columns (job_id, company_id) to prevent silent type mismatches during joins. The mappings/ lookup tables are loaded first so they are available as reference objects throughout the rest of the pipeline.

Step 2 — Multi-Table JOIN into a Master DataFrame

All 11 files are merged into a single flat DataFrame using pandas.merge(). The merge strategy preserves every row in postings.csv (the fact table) while enriching each row with columns from the satellite tables.

Join architecture:

postings.csv ← LEFT JOIN → companies/companies.csv on company_id

result ← LEFT JOIN → jobs/salaries.csv on job_id

result ← LEFT JOIN → companies/employee_counts.csv on company_id

N:M tables (one job maps to many skills, industries, or benefits) are not joined row-by-row, which would explode the DataFrame to millions of rows. Instead, they are aggregated prior to joining:

# Aggregate job skills into a comma-separated string per job_id
skills_agg = (
    job_skills
    .merge(skills_map, on='skill_abr', how='left')
    .groupby('job_id')['skill_name']
    .apply(lambda x: ', '.join(x.dropna().unique()))
    .reset_index()
    .rename(columns={'skill_name': 'skills_list'})
)

# Aggregate job industries similarly
industries_agg = (
    job_industries
    .merge(industries_map, on='industry_id', how='left')
    .groupby('job_id')['industry_name']
    .apply(lambda x: ', '.join(x.dropna().unique()))
    .reset_index()
    .rename(columns={'industry_name': 'industries_list'})
)

This approach keeps the master DataFrame at exactly 123,849 rows × 49 columns — one row per posting, no duplication.

Step 3 — Text Normalization

Eight free-text and categorical columns are normalized to prevent the same value being counted as multiple distinct categories due to case or whitespace differences:

TEXT_COLS = [
    'title', 'company_name', 'location',
    'formatted_work_type', 'formatted_experience_level',
    'compensation_type', 'pay_period', 'industries_list'
]

for col in TEXT_COLS:
    master[col] = (
        master[col]
        .astype(str)
        .str.lower()
        .str.strip()
        .str.replace(r'\s+', ' ', regex=True)
    )

Without this step, “Data Engineer”, “data engineer”, and “Data Engineer ” are treated as three distinct job titles in value-count operations and pivot tables.

Step 4 — Role Filtering

The full 123,849-row master dataset is filtered down to data and tech roles using a keyword match against the normalized title column. A posting is included if its title contains any of the following terms:

DATA_KEYWORDS = [
    'data', 'analyst', 'scientist', 'engineer',
    'machine learning', 'analytics', 'intelligence',
    'statistician', 'quantitative', 'etl', 'warehouse',
    'big data', 'nlp', 'sql developer', 'python developer',
    'ai engineer', 'ml engineer'
]

pattern = '|'.join(DATA_KEYWORDS)
data_roles = master[master['title'].str.contains(pattern, na=False)].copy()

This yields 19,725 postings — 15.9% of the total corpus. The filtered DataFrame gets one additional column (salary_annual) added in the next step, bringing it to 50 columns.

Step 5 — Salary Normalization with annual_salary()

Salary figures in the raw data are quoted in four different pay periods (HOURLY, MONTHLY, WEEKLY, YEARLY). To make them comparable, every salary value is converted to annual USD using the following function, applied row-wise:

HOURS_YEAR = 2080  # 40 h/week × 52 weeks

def annual_salary(row):
    """Normalize salary to annual USD based on pay period."""
    # Priority: sal_med > sal_max > sal_min > fields from postings
    for field in ['sal_med', 'sal_max', 'sal_min',
                  'med_salary', 'max_salary', 'min_salary']:
        val = row.get(field)
        if not pd.isna(val):
            break
    else:
        return np.nan

    period = str(row.get('sal_period', row.get('pay_period', ''))).upper()
    if 'HOUR'  in period: return val * HOURS_YEAR   # 40h × 52w
    if 'MONTH' in period: return val * 12
    if 'WEEK'  in period: return val * 52
    return val  # YEARLY or other -> assumed annual

data_roles['salary_annual'] = data_roles.apply(annual_salary, axis=1)

The function checks multiple salary columns in priority order (preferring sal_med from the salaries table, then falling back to the postings columns) and applies the appropriate multiplier. Rows with no valid salary value in any column return np.nan.

Step 6 — Missing Value Treatment

Imputation is applied selectively. Only columns where filling missing values does not introduce analytical bias receive imputation.

ColumnStrategyReasonformatted_experience_levelImpute with mode ("mid-senior level")Ordinal with clear modal class; losing 23.7% of rows would distort role distribution chartsviewsImpute with median (5)Highly right-skewed; median is robust to influencer postings with 10K+ viewsappliesImpute with median (4)Same distribution profile as views; Easy Apply onlycompany_nameFill with "Unknown"Prevents valid postings from being dropped in company-level aggregationssalary_annualNo imputationInventing salary figures introduces fabricated economic signal — see warning below

salary_annual is never imputed. Any analysis that involves salary — distributions, role comparisons, regression models — must be scoped exclusively to the 6,108-row subset (data_roles_salario.csv) that contains real, validated salary records. Applying mean or median imputation to the remaining ~76% of missing salary rows would manufacture an artificial salary distribution and invalidate every downstream finding.

Step 7 — Salary Outlier Removal (IQR Method)

Even after normalization, the salary distribution contains extreme values that skew visualizations and distort summary statistics. Outliers are removed using a fenced IQR approach applied to the data roles with salary subset:

Q1 = data_roles_sal['salary_annual'].quantile(0.25)   # $90,000
Q3 = data_roles_sal['salary_annual'].quantile(0.75)   # $166,400
IQR = Q3 - Q1                                         # $76,400

lower_bound = max(Q1 - 1.5 * IQR, 10_000)            # $10,000  (floor applied)
upper_bound = Q3 + 1.5 * IQR                          # $281,000

data_roles_sal_clean = data_roles_sal[
    data_roles_sal['salary_annual'].between(lower_bound, upper_bound)
]

Outlier removal summary:

MetricValueQ1$90,000Q3$166,400IQR$76,400Lower bound$10,000 (floored — raw Q1−1.5×IQR was negative)Upper bound$281,000Rows removed247 (3.9% of salary-complete rows)Rows remaining6,108

The

10,000 floor is applied deliberately — purely mathematical IQR fencing would produce a negative lower bound, which is nonsensical for an annual salary. A

10K minimum is a conservative economic sanity check.

Step 8 — Write Output Files

Three output files are written at different scopes and levels of completeness:

master.to_csv('data_maestro_completo.csv', index=False)
data_roles.to_csv('data_roles_completo.csv', index=False)
data_roles_sal_clean.to_csv('data_roles_salario.csv', index=False)

Output Files

File	Rows	Columns	Description
`data_maestro_completo.csv`	123,849	49	Full master dataset — all roles, all postings, no role filter applied
`data_roles_completo.csv`	19,725	50	Data roles only (keyword-filtered), includes `salary_annual` column, salary not required
`data_roles_salario.csv`	6,108	50	Data roles with clean salary only — outliers removed, ready for compensation analysis

The three files represent a progressive narrowing of scope:

123,849 postings  (all roles)
    └─ 19,725 postings  (data/tech roles only)
            └─ 6,355 postings  (data roles with salary)
                    └─ 6,108 postings  (salary outliers removed — analysis-ready)

Each downstream analysis phase in HRIA begins by loading whichever file matches the analysis scope. Phases that examine skill demand or company distribution use data_roles_completo.csv. Phases that examine salary use data_roles_salario.csv exclusively.

Column counts differ by one between data_maestro_completo.csv (49 columns) and the two data-roles files (50 columns) because the salary_annual engineered feature is only added during the role-filtering step in Step 5. The master file retains the raw salary columns (min_salary, max_salary, med_salary) without the normalized annual composite.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Preprocessing Pipeline

Output Files

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​Preprocessing Pipeline

​Output Files

Build docs developers (and LLMs) love

Preprocessing Pipeline

Output Files