Phase 3 moves beyond data preparation and into rigorous statistical interrogation. Using the clean outputs from Phase 2, this phase computes descriptive statistics for all key variables, builds a correlation matrix, and performs a series of GroupBy aggregations that expose how salary, demand, and opportunity are distributed across experience levels, contract types, industries, and skills. Crucially, Phase 3 is also where the dataset’s structural biases are formally classified — transforming Phase 1’s qualitative observations into quantified, named findings that directly inform the recommendations delivered in Phase 4.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/MajoRodri/HRIA/llms.txt
Use this file to discover all available pages before exploring further.
Notebooks
Fase3_Analisis_Estadistico_Sesgos.ipynb— primary analysis notebook covering Sections 1–6Fase3_1_Informe_de_Sesgos.ipynb— deep-dive bias report with expanded commentary and embedded visualizations for each identified bias
Libraries
| Library | Purpose |
|---|---|
pandas | GroupBy aggregations, pivot tables, and correlation matrices |
NumPy | Statistical helpers, array operations |
scipy.stats | Skewness, kurtosis, and conditional probability utilities |
warnings | Suppress non-critical output |
Section 2 — Descriptive Statistics
Full distributional summaries are computed for all numerical variables. The key statistics forsalary_annual (n = 6,108, sourced from data_roles_salario.csv) are:
| Statistic | Value |
|---|---|
| Count | 6,108 |
| Mean | $128,596 |
| Median | $124,800 |
| Std Dev | $53,775 |
| Min | $13,333 |
| Q1 (25th pct) | $90,000 |
| Q3 (75th pct) | $162,240 |
| Max | $281,000 |
| Skewness | 0.442 |
| Kurtosis | −0.248 |
views and applies are also profiled in this section, revealing that the median posting receives 165 views and 11 applications, with both distributions exhibiting strong right-skew driven by viral or high-prestige listings.
Section 3 — Correlation Matrix
A Pearson correlation matrix is computed across all numerical columns using.corr(). Key findings:
salary_annualshows weak positive correlation withviews(r ≈ 0.12) and near-zero correlation withapplies(r ≈ 0.04), meaning higher-paying jobs do not systematically attract more applications in this dataset- Experience level (encoded ordinally) has the strongest predictive signal for salary among available features
viewsandappliesare moderately correlated with each other (r ≈ 0.61), as expected — more-viewed postings receive more applications
Section 4 — GroupBy Analysis
4.0 — Gini Index for Salary Inequality
A Gini index is computed over thesalary_annual distribution to quantify compensation inequality across data roles. A value approaching 0 indicates perfect equality; a value approaching 1 indicates maximum concentration.
4.1 — Salary by Experience Level
Median and mean salary are aggregated byformatted_experience_level to produce a compensation ladder from entry to executive:
4.2 — Offers and Salary by Contract Type
Postings and salary figures are segmented byformatted_work_type (Full-time, Contract, Part-time). Full-time roles dominate in volume (≈80 %) but contract roles show competitive or higher median salaries, likely reflecting premium market rates for specialist contractors.
4.3 — Top Industries by Offer Count and Median Salary
Industries are ranked by both posting volume and mediansalary_annual. IT and Software Development sectors lead in offer count; Finance and Investment Management show the highest median salaries among data roles.
4.4 — Top Skills by Demand Frequency
The aggregatedskills column (comma-separated skill lists) is exploded and counted to rank skills by raw demand frequency. Python, SQL, and Machine Learning consistently occupy the top positions, followed by cloud platform skills (AWS, Azure, GCP).
4.5 — Pivot Table: Median Salary by Experience × Contract Type
A two-dimensional pivot table crossesformatted_experience_level (rows) against formatted_work_type (columns), with salary_annual median as the aggregated value. This reveals, for example, that contract-based entry-level data roles can pay comparably to full-time associate roles — a finding directly relevant to DataTalent Solutions S.L.’s hiring strategy.
Section 5 — Conditional Probability P(A | B)
Conditional probabilities are computed to quantify the relationship between categorical variables. Representative examples:- P(high salary | senior level): probability that a mid-senior or above posting falls in the top salary quartile
- P(data role | IT industry): probability that an IT-industry posting is a data-specific role
- P(salary disclosed | large company): probability that a posting includes salary data given that the employer exceeds a headcount threshold — a direct empirical test of the MNAR hypothesis
Section 6 — Formally Detected Biases
Phase 3 classifies four structural biases within the dataset (a full eight-bias taxonomy including visualization-level biases is elaborated inFase3_1_Informe_de_Sesgos.ipynb):
| # | Bias | Description |
|---|---|---|
| 1 | MNAR — Salary Missingness Not At Random | Companies with below-market compensation systematically omit salary data. The conditional probability analysis in Section 5 confirms that large, established companies disclose salary at a significantly higher rate than small or unknown employers, introducing a systematic upward bias into any salary statistic computed on the disclosed subset. |
| 2 | Geographic Bias | Over 95 % of postings are US-based. Salary benchmarks, skill demand rankings, and experience-level distributions are not representative of non-US markets, including Spain — the primary market for DataTalent Solutions S.L. Any direct application of these figures to European hiring strategy requires explicit market adjustment. |
| 3 | Selection Bias (Platform) | The dataset contains only postings published on LinkedIn. Companies that recruit primarily through other channels (direct applications, recruiters, job boards, internal mobility) are entirely absent. LinkedIn’s own algorithmic promotion of certain postings further distorts demand signals. |
| 4 | Absence of Gender Data | No gender information is present in any field. This is a structurally undisclosed protected attribute, making it impossible to detect or quantify gender-based salary gaps within the dataset. Any fairness claims derived from this data are therefore incomplete. |
The four biases above are the ones formally detected through statistical testing in Phase 3.
The companion notebook
Fase3_1_Informe_de_Sesgos.ipynb expands this to a full 8-bias
taxonomy that includes visualization-level distortions, temporal bias (posting dates
skewed toward recent months), and company-size representation bias.
See the Bias Analysis page for the complete treatment of each bias.