Phase 2 transforms the 11 raw, profiled CSV files from Phase 1 into a single, analysis-ready master DataFrame. By the end of this phase, every salary figure is expressed in annual USD, every data-related role has been isolated from the broader 123,849-row dataset, and statistical outliers have been removed via IQR fencing. The three output CSVs produced here are the sole inputs for all Phase 3 and Phase 4 analyses, making this phase the most critical data-quality checkpoint in the entire HRIA pipeline.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
Notebook
Fase2_Limpieza_Preparacion.ipynb
Libraries
| Library | Version | Purpose |
|---|---|---|
pandas | 2.2.2 | DataFrame merges, transformations, and CSV export |
NumPy | 2.0.2 | Vectorized math and np.nan handling |
scipy.stats | — | Z-score computation for outlier validation |
warnings | stdlib | Suppress non-critical warnings |
sys / os | stdlib | File path resolution and environment checks |
Cleaning Pipeline
Load 11 CSV Files
All files from
archive/ are loaded into individual DataFrames using the same pattern
established in Phase 1. Dtypes are validated and object columns are kept as strings to
prevent silent coercion during the merge step.Multi-Table JOIN — 49 Columns, 123,849 Rows
The fact table (
postings.csv) is joined to all 10 dimension tables using left joins on
job_id and company_id. Because several relationships are N:M (one posting maps to
multiple industries, skills, or specialities), those tables are aggregated before joining —
their values are collapsed into comma-separated strings within a single cell. This preserves
all information without inflating row counts.The resulting master DataFrame contains 123,849 rows × 49 columns.Drop Redundant and Irrelevant Columns (49 → 36)
Thirteen columns are removed after auditing for analytical value. See the
Columns Dropped table below for the full breakdown by category.
Role Filtering — 19,725 Data-Related Rows (15.9 %)
A keyword filter on
title and description isolates data-related postings
(e.g., data engineer, data analyst, machine learning, data scientist).
This yields 19,725 rows, representing 15.9 % of the full corpus and forming
the primary analytical population for all downstream work.Text Normalization
All free-text and categorical string columns are normalized with
.lower() followed by
.strip() to eliminate casing inconsistencies and leading/trailing whitespace. This is
especially important for formatted_experience_level, formatted_work_type, and the
aggregated skills column before any frequency counting.Salary Normalization to Annual USD
Raw salary figures exist across six columns (
sal_med, sal_max, sal_min,
med_salary, max_salary, min_salary) and carry mixed pay-period labels
(HOURLY, MONTHLY, WEEKLY, YEARLY). The annual_salary() function (see below)
unifies all records into a single salary_annual column expressed in annual USD.Missing Value Imputation
Non-salary missing values are handled contextually: high-missingness categorical columns
receive an explicit
'unknown' sentinel rather than being dropped, preserving row counts
for the exploratory analyses that do not require salary data.IQR Outlier Removal — 247 Rows Removed
After normalization, IQR fencing is applied to
salary_annual. The fence is computed over
the 6,355 rows that carry a non-null salary figure. 247 rows fall outside the fence,
leaving 6,108 clean salary records for statistical analysis. See the code block
in the Outlier Removal section below.Save Three Output CSVs
Three files are written to
output/ for use by Phases 3 and 4. See
Output Files for details.Salary Normalization Function
np.nan.
Columns Dropped
| Category | Columns Removed |
|---|---|
| Duplicate identifiers | job_id (retained as index), company_id (merged into master) |
| Redundant salary fields | sal_min, sal_max, sal_med (superseded by salary_annual) |
| Platform URLs | job_posting_url, application_url, company_linkedin_url |
| High-cardinality geo | zip_code, address, city, state (raw strings — geo analysis uses location only) |
| Low-signal metadata | sponsored, listed_time, expiry_time, posting_domain |
IQR Outlier Removal
Output Files
| File | Rows | Columns | Contents |
|---|---|---|---|
data_roles_completo.csv | 19,725 | 50 | All data-related postings, all columns, including nulls in salary |
data_roles_salario.csv | 6,108 | 50 | Data-related postings with clean, IQR-filtered salary_annual |
data_maestro_completo.csv | 123,849 | 49 | Full cleaned dataset (all roles) for market-wide comparisons |