Phase 2: Data Cleaning and Master Dataset Preparation

Phase 2 transforms the 11 raw, profiled CSV files from Phase 1 into a single, analysis-ready master DataFrame. By the end of this phase, every salary figure is expressed in annual USD, every data-related role has been isolated from the broader 123,849-row dataset, and statistical outliers have been removed via IQR fencing. The three output CSVs produced here are the sole inputs for all Phase 3 and Phase 4 analyses, making this phase the most critical data-quality checkpoint in the entire HRIA pipeline.

Notebook

Fase2_Limpieza_Preparacion.ipynb

Libraries

Library	Version	Purpose
`pandas`	2.2.2	DataFrame merges, transformations, and CSV export
`NumPy`	2.0.2	Vectorized math and `np.nan` handling
`scipy.stats`	—	Z-score computation for outlier validation
`warnings`	stdlib	Suppress non-critical warnings
`sys` / `os`	stdlib	File path resolution and environment checks

Cleaning Pipeline

Load 11 CSV Files

All files from archive/ are loaded into individual DataFrames using the same pattern established in Phase 1. Dtypes are validated and object columns are kept as strings to prevent silent coercion during the merge step.

Multi-Table JOIN — 49 Columns, 123,849 Rows

The fact table (postings.csv) is joined to all 10 dimension tables using left joins on job_id and company_id. Because several relationships are N:M (one posting maps to multiple industries, skills, or specialities), those tables are aggregated before joining — their values are collapsed into comma-separated strings within a single cell. This preserves all information without inflating row counts.The resulting master DataFrame contains 123,849 rows × 49 columns.

Drop Redundant and Irrelevant Columns (49 → 36)

Thirteen columns are removed after auditing for analytical value. See the Columns Dropped table below for the full breakdown by category.

Role Filtering — 19,725 Data-Related Rows (15.9 %)

A keyword filter on title and description isolates data-related postings (e.g., data engineer, data analyst, machine learning, data scientist). This yields 19,725 rows, representing 15.9 % of the full corpus and forming the primary analytical population for all downstream work.

Text Normalization

All free-text and categorical string columns are normalized with .lower() followed by .strip() to eliminate casing inconsistencies and leading/trailing whitespace. This is especially important for formatted_experience_level, formatted_work_type, and the aggregated skills column before any frequency counting.

Salary Normalization to Annual USD

Raw salary figures exist across six columns (sal_med, sal_max, sal_min, med_salary, max_salary, min_salary) and carry mixed pay-period labels (HOURLY, MONTHLY, WEEKLY, YEARLY). The annual_salary() function (see below) unifies all records into a single salary_annual column expressed in annual USD.

Missing Value Imputation

Non-salary missing values are handled contextually: high-missingness categorical columns receive an explicit 'unknown' sentinel rather than being dropped, preserving row counts for the exploratory analyses that do not require salary data.

IQR Outlier Removal — 247 Rows Removed

After normalization, IQR fencing is applied to salary_annual. The fence is computed over the 6,355 rows that carry a non-null salary figure. 247 rows fall outside the fence, leaving 6,108 clean salary records for statistical analysis. See the code block in the Outlier Removal section below.

Save Three Output CSVs

Three files are written to output/ for use by Phases 3 and 4. See Output Files for details.

Salary Normalization Function

def annual_salary(row):
    period = str(row.get('sal_pay_period', '')).upper()
    for col in ['sal_med', 'sal_max', 'sal_min',
                'med_salary', 'max_salary', 'min_salary']:
        val = row.get(col)
        if pd.notna(val) and val > 0:
            if period == 'HOURLY':  return val * 2080
            if period == 'MONTHLY': return val * 12
            if period == 'WEEKLY':  return val * 52
            return val
    return np.nan

df['salary_annual'] = df.apply(annual_salary, axis=1)

The function iterates through salary source columns in priority order (median → max → min) and applies the appropriate multiplier based on the pay-period label. Hourly rates use the standard 2,080-hour work year (52 weeks × 40 hours). Rows with no valid salary value in any of the six columns return np.nan.

Columns Dropped

Category	Columns Removed
Duplicate identifiers	`job_id` (retained as index), `company_id` (merged into master)
Redundant salary fields	`sal_min`, `sal_max`, `sal_med` (superseded by `salary_annual`)
Platform URLs	`job_posting_url`, `application_url`, `company_linkedin_url`
High-cardinality geo	`zip_code`, `address`, `city`, `state` (raw strings — geo analysis uses `location` only)
Low-signal metadata	`sponsored`, `listed_time`, `expiry_time`, `posting_domain`

IQR Outlier Removal

from scipy import stats

Q1 = df_sal['salary_annual'].quantile(0.25)  # $90,000
Q3 = df_sal['salary_annual'].quantile(0.75)  # $166,400
IQR = Q3 - Q1                                # $76,400
lower = max(Q1 - 1.5 * IQR, 10_000)         # $10,000
upper = Q3 + 1.5 * IQR                       # $281,000

df_clean = df_sal[(df_sal['salary_annual'] >= lower) &
                  (df_sal['salary_annual'] <= upper)]
print(f"Removed: {len(df_sal) - len(df_clean)} rows")
# Removed: 247 rows

The lower fence is clamped to a business-logic floor of $10,000/year** to catch unrealistically low conversions from hourly rates. The upper fence of **$ 281,000 removes the extreme outliers (including the erroneous $535M+ entry identified in Phase 1) while retaining legitimate executive compensation.

Output Files

File	Rows	Columns	Contents
`data_roles_completo.csv`	19,725	50	All data-related postings, all columns, including nulls in salary
`data_roles_salario.csv`	6,108	50	Data-related postings with clean, IQR-filtered `salary_annual`
`data_maestro_completo.csv`	123,849	49	Full cleaned dataset (all roles) for market-wide comparisons

Always use data_roles_salario.csv as the input for any salary-related analysis. It is the only file that has been through the full normalization and IQR-filtering pipeline. Using data_roles_completo.csv or data_maestro_completo.csv directly for salary statistics will include nulls and extreme outliers that will skew every metric.

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Notebook

Libraries

Cleaning Pipeline

Salary Normalization Function

Columns Dropped

IQR Outlier Removal

Output Files

Build docs developers (and LLMs) love

Overview

Dataset

Analysis Phases

Bias Analysis

Findings & Recommendations

Documentation Index

​Notebook

​Libraries

​Cleaning Pipeline

​Salary Normalization Function

​Columns Dropped

​IQR Outlier Removal

​Output Files

Build docs developers (and LLMs) love

Notebook

Libraries

Cleaning Pipeline

Salary Normalization Function

Columns Dropped

IQR Outlier Removal

Output Files