Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

Phase 2 transforms the 11 raw, profiled CSV files from Phase 1 into a single, analysis-ready master DataFrame. By the end of this phase, every salary figure is expressed in annual USD, every data-related role has been isolated from the broader 123,849-row dataset, and statistical outliers have been removed via IQR fencing. The three output CSVs produced here are the sole inputs for all Phase 3 and Phase 4 analyses, making this phase the most critical data-quality checkpoint in the entire HRIA pipeline.

Notebook

Fase2_Limpieza_Preparacion.ipynb

Libraries

LibraryVersionPurpose
pandas2.2.2DataFrame merges, transformations, and CSV export
NumPy2.0.2Vectorized math and np.nan handling
scipy.statsZ-score computation for outlier validation
warningsstdlibSuppress non-critical warnings
sys / osstdlibFile path resolution and environment checks

Cleaning Pipeline

1

Load 11 CSV Files

All files from archive/ are loaded into individual DataFrames using the same pattern established in Phase 1. Dtypes are validated and object columns are kept as strings to prevent silent coercion during the merge step.
2

Multi-Table JOIN — 49 Columns, 123,849 Rows

The fact table (postings.csv) is joined to all 10 dimension tables using left joins on job_id and company_id. Because several relationships are N:M (one posting maps to multiple industries, skills, or specialities), those tables are aggregated before joining — their values are collapsed into comma-separated strings within a single cell. This preserves all information without inflating row counts.The resulting master DataFrame contains 123,849 rows × 49 columns.
3

Drop Redundant and Irrelevant Columns (49 → 36)

Thirteen columns are removed after auditing for analytical value. See the Columns Dropped table below for the full breakdown by category.
4

Role Filtering — 19,725 Data-Related Rows (15.9 %)

A keyword filter on title and description isolates data-related postings (e.g., data engineer, data analyst, machine learning, data scientist). This yields 19,725 rows, representing 15.9 % of the full corpus and forming the primary analytical population for all downstream work.
5

Text Normalization

All free-text and categorical string columns are normalized with .lower() followed by .strip() to eliminate casing inconsistencies and leading/trailing whitespace. This is especially important for formatted_experience_level, formatted_work_type, and the aggregated skills column before any frequency counting.
6

Salary Normalization to Annual USD

Raw salary figures exist across six columns (sal_med, sal_max, sal_min, med_salary, max_salary, min_salary) and carry mixed pay-period labels (HOURLY, MONTHLY, WEEKLY, YEARLY). The annual_salary() function (see below) unifies all records into a single salary_annual column expressed in annual USD.
7

Missing Value Imputation

Non-salary missing values are handled contextually: high-missingness categorical columns receive an explicit 'unknown' sentinel rather than being dropped, preserving row counts for the exploratory analyses that do not require salary data.
8

IQR Outlier Removal — 247 Rows Removed

After normalization, IQR fencing is applied to salary_annual. The fence is computed over the 6,355 rows that carry a non-null salary figure. 247 rows fall outside the fence, leaving 6,108 clean salary records for statistical analysis. See the code block in the Outlier Removal section below.
9

Save Three Output CSVs

Three files are written to output/ for use by Phases 3 and 4. See Output Files for details.

Salary Normalization Function

def annual_salary(row):
    period = str(row.get('sal_pay_period', '')).upper()
    for col in ['sal_med', 'sal_max', 'sal_min',
                'med_salary', 'max_salary', 'min_salary']:
        val = row.get(col)
        if pd.notna(val) and val > 0:
            if period == 'HOURLY':  return val * 2080
            if period == 'MONTHLY': return val * 12
            if period == 'WEEKLY':  return val * 52
            return val
    return np.nan

df['salary_annual'] = df.apply(annual_salary, axis=1)
The function iterates through salary source columns in priority order (median → max → min) and applies the appropriate multiplier based on the pay-period label. Hourly rates use the standard 2,080-hour work year (52 weeks × 40 hours). Rows with no valid salary value in any of the six columns return np.nan.

Columns Dropped

CategoryColumns Removed
Duplicate identifiersjob_id (retained as index), company_id (merged into master)
Redundant salary fieldssal_min, sal_max, sal_med (superseded by salary_annual)
Platform URLsjob_posting_url, application_url, company_linkedin_url
High-cardinality geozip_code, address, city, state (raw strings — geo analysis uses location only)
Low-signal metadatasponsored, listed_time, expiry_time, posting_domain

IQR Outlier Removal

from scipy import stats

Q1 = df_sal['salary_annual'].quantile(0.25)  # $90,000
Q3 = df_sal['salary_annual'].quantile(0.75)  # $166,400
IQR = Q3 - Q1                                # $76,400
lower = max(Q1 - 1.5 * IQR, 10_000)         # $10,000
upper = Q3 + 1.5 * IQR                       # $281,000

df_clean = df_sal[(df_sal['salary_annual'] >= lower) &
                  (df_sal['salary_annual'] <= upper)]
print(f"Removed: {len(df_sal) - len(df_clean)} rows")
# Removed: 247 rows
The lower fence is clamped to a business-logic floor of 10,000/yeartocatchunrealisticallylowconversionsfromhourlyrates.Theupperfenceof10,000/year** to catch unrealistically low conversions from hourly rates. The upper fence of **281,000 removes the extreme outliers (including the erroneous $535M+ entry identified in Phase 1) while retaining legitimate executive compensation.

Output Files

FileRowsColumnsContents
data_roles_completo.csv19,72550All data-related postings, all columns, including nulls in salary
data_roles_salario.csv6,10850Data-related postings with clean, IQR-filtered salary_annual
data_maestro_completo.csv123,84949Full cleaned dataset (all roles) for market-wide comparisons
Always use data_roles_salario.csv as the input for any salary-related analysis. It is the only file that has been through the full normalization and IQR-filtering pipeline. Using data_roles_completo.csv or data_maestro_completo.csv directly for salary statistics will include nulls and extreme outliers that will skew every metric.

Build docs developers (and LLMs) love