Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt

Use this file to discover all available pages before exploring further.

Phase 3 moves beyond data preparation and into rigorous statistical interrogation. Using the clean outputs from Phase 2, this phase computes descriptive statistics for all key variables, builds a correlation matrix, and performs a series of GroupBy aggregations that expose how salary, demand, and opportunity are distributed across experience levels, contract types, industries, and skills. Crucially, Phase 3 is also where the dataset’s structural biases are formally classified — transforming Phase 1’s qualitative observations into quantified, named findings that directly inform the recommendations delivered in Phase 4.

Notebooks

  • Fase3_Analisis_Estadistico_Sesgos.ipynb — primary analysis notebook covering Sections 1–6
  • Fase3_1_Informe_de_Sesgos.ipynb — deep-dive bias report with expanded commentary and embedded visualizations for each identified bias

Libraries

LibraryPurpose
pandasGroupBy aggregations, pivot tables, and correlation matrices
NumPyStatistical helpers, array operations
scipy.statsSkewness, kurtosis, and conditional probability utilities
warningsSuppress non-critical output

Section 2 — Descriptive Statistics

Full distributional summaries are computed for all numerical variables. The key statistics for salary_annual (n = 6,108, sourced from data_roles_salario.csv) are:
StatisticValue
Count6,108
Mean$128,596
Median$124,800
Std Dev$53,775
Min$13,333
Q1 (25th pct)$90,000
Q3 (75th pct)$162,240
Max$281,000
Skewness0.442
Kurtosis−0.248
The skewness of 0.442 confirms a mild right-leaning distribution — the salary data is approximately normal with a modest tail toward higher values. The kurtosis of −0.248 (platykurtic) indicates slightly lighter tails than a perfect normal distribution, which is consistent with the IQR outlier removal applied in Phase 2. views and applies are also profiled in this section, revealing that the median posting receives 165 views and 11 applications, with both distributions exhibiting strong right-skew driven by viral or high-prestige listings.

Section 3 — Correlation Matrix

A Pearson correlation matrix is computed across all numerical columns using .corr(). Key findings:
  • salary_annual shows weak positive correlation with views (r ≈ 0.12) and near-zero correlation with applies (r ≈ 0.04), meaning higher-paying jobs do not systematically attract more applications in this dataset
  • Experience level (encoded ordinally) has the strongest predictive signal for salary among available features
  • views and applies are moderately correlated with each other (r ≈ 0.61), as expected — more-viewed postings receive more applications

Section 4 — GroupBy Analysis

4.0 — Gini Index for Salary Inequality

A Gini index is computed over the salary_annual distribution to quantify compensation inequality across data roles. A value approaching 0 indicates perfect equality; a value approaching 1 indicates maximum concentration.

4.1 — Salary by Experience Level

Median and mean salary are aggregated by formatted_experience_level to produce a compensation ladder from entry to executive:
salary_by_exp = (
    df_sal.groupby('formatted_experience_level')['salary_annual']
    .agg(['mean', 'median', 'count'])
    .sort_values('median', ascending=False)
)
print(salary_by_exp.to_string())
The output confirms a monotonic increase in median salary from entry level through executive, with the largest step-change occurring between mid-senior and director levels.

4.2 — Offers and Salary by Contract Type

Postings and salary figures are segmented by formatted_work_type (Full-time, Contract, Part-time). Full-time roles dominate in volume (≈80 %) but contract roles show competitive or higher median salaries, likely reflecting premium market rates for specialist contractors.

4.3 — Top Industries by Offer Count and Median Salary

Industries are ranked by both posting volume and median salary_annual. IT and Software Development sectors lead in offer count; Finance and Investment Management show the highest median salaries among data roles.

4.4 — Top Skills by Demand Frequency

The aggregated skills column (comma-separated skill lists) is exploded and counted to rank skills by raw demand frequency. Python, SQL, and Machine Learning consistently occupy the top positions, followed by cloud platform skills (AWS, Azure, GCP).

4.5 — Pivot Table: Median Salary by Experience × Contract Type

A two-dimensional pivot table crosses formatted_experience_level (rows) against formatted_work_type (columns), with salary_annual median as the aggregated value. This reveals, for example, that contract-based entry-level data roles can pay comparably to full-time associate roles — a finding directly relevant to DataTalent Solutions S.L.’s hiring strategy.

Section 5 — Conditional Probability P(A | B)

Conditional probabilities are computed to quantify the relationship between categorical variables. Representative examples:
  • P(high salary | senior level): probability that a mid-senior or above posting falls in the top salary quartile
  • P(data role | IT industry): probability that an IT-industry posting is a data-specific role
  • P(salary disclosed | large company): probability that a posting includes salary data given that the employer exceeds a headcount threshold — a direct empirical test of the MNAR hypothesis

Section 6 — Formally Detected Biases

Phase 3 classifies four structural biases within the dataset (a full eight-bias taxonomy including visualization-level biases is elaborated in Fase3_1_Informe_de_Sesgos.ipynb):
#BiasDescription
1MNAR — Salary Missingness Not At RandomCompanies with below-market compensation systematically omit salary data. The conditional probability analysis in Section 5 confirms that large, established companies disclose salary at a significantly higher rate than small or unknown employers, introducing a systematic upward bias into any salary statistic computed on the disclosed subset.
2Geographic BiasOver 95 % of postings are US-based. Salary benchmarks, skill demand rankings, and experience-level distributions are not representative of non-US markets, including Spain — the primary market for DataTalent Solutions S.L. Any direct application of these figures to European hiring strategy requires explicit market adjustment.
3Selection Bias (Platform)The dataset contains only postings published on LinkedIn. Companies that recruit primarily through other channels (direct applications, recruiters, job boards, internal mobility) are entirely absent. LinkedIn’s own algorithmic promotion of certain postings further distorts demand signals.
4Absence of Gender DataNo gender information is present in any field. This is a structurally undisclosed protected attribute, making it impossible to detect or quantify gender-based salary gaps within the dataset. Any fairness claims derived from this data are therefore incomplete.
The four biases above are the ones formally detected through statistical testing in Phase 3. The companion notebook Fase3_1_Informe_de_Sesgos.ipynb expands this to a full 8-bias taxonomy that includes visualization-level distortions, temporal bias (posting dates skewed toward recent months), and company-size representation bias. See the Bias Analysis page for the complete treatment of each bias.
Fase3_1_Informe_de_Sesgos.ipynb is approximately 8.5 MB due to embedded plot outputs and expanded statistical commentary. Opening it in resource-constrained environments (e.g., GitHub’s notebook viewer or low-memory JupyterHub instances) may cause rendering delays or timeouts. Use jupyter nbconvert --clear-output to strip outputs before committing, or view via the exported HTML report instead.

Build docs developers (and LLMs) love