Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

The raw LinkedIn Job Postings dataset arrives as 11 separate CSV files spanning four directories. Before any analysis can run, those files must be joined into a coherent master DataFrame, stripped of structural noise (duplicate columns from joins, redundant index columns), and enriched with derived features — most importantly a single salary_annual column that places every salary figure on the same annual-USD basis regardless of the pay period it was originally quoted in. The pipeline then filters the full 123,849-row corpus down to the 19,725 data-related roles that are the focus of HRIA’s core analyses, applies targeted imputation to columns where missingness is safe to fill, removes salary outliers using a fenced IQR approach, and writes three publication-ready output files at different levels of completeness. Every step is documented below.

Preprocessing Pipeline

1
Step 1 — Load All 11 CSV Files
2
Each file is read into a dedicated pandas DataFrame with explicit dtype coercion on key identifier columns (job_id, company_id) to prevent silent type mismatches during joins. The mappings/ lookup tables are loaded first so they are available as reference objects throughout the rest of the pipeline.
3
Step 2 — Multi-Table JOIN into a Master DataFrame
4
All 11 files are merged into a single flat DataFrame using pandas.merge(). The merge strategy preserves every row in postings.csv (the fact table) while enriching each row with columns from the satellite tables.
5
Join architecture:
6
  • postings.csvLEFT JOINcompanies/companies.csv on company_id
  • result ← LEFT JOINjobs/salaries.csv on job_id
  • result ← LEFT JOINcompanies/employee_counts.csv on company_id
  • 7
    N:M tables (one job maps to many skills, industries, or benefits) are not joined row-by-row, which would explode the DataFrame to millions of rows. Instead, they are aggregated prior to joining:
    8
    # Aggregate job skills into a comma-separated string per job_id
    skills_agg = (
        job_skills
        .merge(skills_map, on='skill_abr', how='left')
        .groupby('job_id')['skill_name']
        .apply(lambda x: ', '.join(x.dropna().unique()))
        .reset_index()
        .rename(columns={'skill_name': 'skills_list'})
    )
    
    # Aggregate job industries similarly
    industries_agg = (
        job_industries
        .merge(industries_map, on='industry_id', how='left')
        .groupby('job_id')['industry_name']
        .apply(lambda x: ', '.join(x.dropna().unique()))
        .reset_index()
        .rename(columns={'industry_name': 'industries_list'})
    )
    
    9
    This approach keeps the master DataFrame at exactly 123,849 rows × 49 columns — one row per posting, no duplication.
    10
    Step 3 — Text Normalization
    11
    Eight free-text and categorical columns are normalized to prevent the same value being counted as multiple distinct categories due to case or whitespace differences:
    12
    TEXT_COLS = [
        'title', 'company_name', 'location',
        'formatted_work_type', 'formatted_experience_level',
        'compensation_type', 'pay_period', 'industries_list'
    ]
    
    for col in TEXT_COLS:
        master[col] = (
            master[col]
            .astype(str)
            .str.lower()
            .str.strip()
            .str.replace(r'\s+', ' ', regex=True)
        )
    
    13
    Without this step, “Data Engineer”, “data engineer”, and “Data Engineer ” are treated as three distinct job titles in value-count operations and pivot tables.
    14
    Step 4 — Role Filtering
    15
    The full 123,849-row master dataset is filtered down to data and tech roles using a keyword match against the normalized title column. A posting is included if its title contains any of the following terms:
    16
    DATA_KEYWORDS = [
        'data', 'analyst', 'scientist', 'engineer',
        'machine learning', 'analytics', 'intelligence',
        'statistician', 'quantitative', 'etl', 'warehouse',
        'big data', 'nlp', 'sql developer', 'python developer',
        'ai engineer', 'ml engineer'
    ]
    
    pattern = '|'.join(DATA_KEYWORDS)
    data_roles = master[master['title'].str.contains(pattern, na=False)].copy()
    
    17
    This yields 19,725 postings15.9% of the total corpus. The filtered DataFrame gets one additional column (salary_annual) added in the next step, bringing it to 50 columns.
    18
    Step 5 — Salary Normalization with annual_salary()
    19
    Salary figures in the raw data are quoted in four different pay periods (HOURLY, MONTHLY, WEEKLY, YEARLY). To make them comparable, every salary value is converted to annual USD using the following function, applied row-wise:
    20
    HOURS_YEAR = 2080  # 40 h/week × 52 weeks
    
    def annual_salary(row):
        """Normalize salary to annual USD based on pay period."""
        # Priority: sal_med > sal_max > sal_min > fields from postings
        for field in ['sal_med', 'sal_max', 'sal_min',
                      'med_salary', 'max_salary', 'min_salary']:
            val = row.get(field)
            if not pd.isna(val):
                break
        else:
            return np.nan
    
        period = str(row.get('sal_period', row.get('pay_period', ''))).upper()
        if 'HOUR'  in period: return val * HOURS_YEAR   # 40h × 52w
        if 'MONTH' in period: return val * 12
        if 'WEEK'  in period: return val * 52
        return val  # YEARLY or other -> assumed annual
    
    data_roles['salary_annual'] = data_roles.apply(annual_salary, axis=1)
    
    21
    The function checks multiple salary columns in priority order (preferring sal_med from the salaries table, then falling back to the postings columns) and applies the appropriate multiplier. Rows with no valid salary value in any column return np.nan.
    22
    Step 6 — Missing Value Treatment
    23
    Imputation is applied selectively. Only columns where filling missing values does not introduce analytical bias receive imputation.
    24
    ColumnStrategyReasonformatted_experience_levelImpute with mode ("mid-senior level")Ordinal with clear modal class; losing 23.7% of rows would distort role distribution chartsviewsImpute with median (5)Highly right-skewed; median is robust to influencer postings with 10K+ viewsappliesImpute with median (4)Same distribution profile as views; Easy Apply onlycompany_nameFill with "Unknown"Prevents valid postings from being dropped in company-level aggregationssalary_annualNo imputationInventing salary figures introduces fabricated economic signal — see warning below
    25
    salary_annual is never imputed. Any analysis that involves salary — distributions, role comparisons, regression models — must be scoped exclusively to the 6,108-row subset (data_roles_salario.csv) that contains real, validated salary records. Applying mean or median imputation to the remaining ~76% of missing salary rows would manufacture an artificial salary distribution and invalidate every downstream finding.
    26
    Step 7 — Salary Outlier Removal (IQR Method)
    27
    Even after normalization, the salary distribution contains extreme values that skew visualizations and distort summary statistics. Outliers are removed using a fenced IQR approach applied to the data roles with salary subset:
    28
    Q1 = data_roles_sal['salary_annual'].quantile(0.25)   # $90,000
    Q3 = data_roles_sal['salary_annual'].quantile(0.75)   # $166,400
    IQR = Q3 - Q1                                         # $76,400
    
    lower_bound = max(Q1 - 1.5 * IQR, 10_000)            # $10,000  (floor applied)
    upper_bound = Q3 + 1.5 * IQR                          # $281,000
    
    data_roles_sal_clean = data_roles_sal[
        data_roles_sal['salary_annual'].between(lower_bound, upper_bound)
    ]
    
    29
    Outlier removal summary:
    30
    MetricValueQ1$90,000Q3$166,400IQR$76,400Lower bound$10,000 (floored — raw Q1−1.5×IQR was negative)Upper bound$281,000Rows removed247 (3.9% of salary-complete rows)Rows remaining6,108
    31
    The 10,000floorisapplieddeliberatelypurelymathematicalIQRfencingwouldproduceanegativelowerbound,whichisnonsensicalforanannualsalary.A10,000 floor is applied deliberately — purely mathematical IQR fencing would produce a negative lower bound, which is nonsensical for an annual salary. A 10K minimum is a conservative economic sanity check.
    32
    Step 8 — Write Output Files
    33
    Three output files are written at different scopes and levels of completeness:
    34
    master.to_csv('data_maestro_completo.csv', index=False)
    data_roles.to_csv('data_roles_completo.csv', index=False)
    data_roles_sal_clean.to_csv('data_roles_salario.csv', index=False)
    

    Output Files

    FileRowsColumnsDescription
    data_maestro_completo.csv123,84949Full master dataset — all roles, all postings, no role filter applied
    data_roles_completo.csv19,72550Data roles only (keyword-filtered), includes salary_annual column, salary not required
    data_roles_salario.csv6,10850Data roles with clean salary only — outliers removed, ready for compensation analysis
    The three files represent a progressive narrowing of scope:
    123,849 postings  (all roles)
        └─ 19,725 postings  (data/tech roles only)
                └─ 6,355 postings  (data roles with salary)
                        └─ 6,108 postings  (salary outliers removed — analysis-ready)
    
    Each downstream analysis phase in HRIA begins by loading whichever file matches the analysis scope. Phases that examine skill demand or company distribution use data_roles_completo.csv. Phases that examine salary use data_roles_salario.csv exclusively.
    Column counts differ by one between data_maestro_completo.csv (49 columns) and the two data-roles files (50 columns) because the salary_annual engineered feature is only added during the role-filtering step in Step 5. The master file retains the raw salary columns (min_salary, max_salary, med_salary) without the normalized annual composite.

    Build docs developers (and LLMs) love