Exploratory Analysis: Data Quality and Salary Stats

This is the first notebook in a series of three (analisis_01_exploracion_y_descriptiva.ipynb). Before any modeling or probabilistic analysis can take place, the data foundation must be solid — and that means rigorously understanding what the datasets contain, what they are missing, and whether their distributions meet the assumptions required for standard statistical tests. This notebook inspects both the Tecnoempleo scraped dataset and the DS Salaries reference dataset for data quality, computes descriptive statistics across key salary variables, and applies the Shapiro-Wilk normality test to determine whether parametric or non-parametric methods are appropriate for the rest of the analysis pipeline.

Diagnostic and Data Quality

The first step in any honest data analysis is confronting the raw state of your data before applying transformations or drawing conclusions. Dimensional inspection reveals the scope of each dataset, while type auditing and null quantification expose structural problems that can silently invalidate downstream results.

Dimensional Inspection

Use .shape, .dtypes, and .describe() to understand the size, column types, and summary statistics of each dataset. This gives an immediate picture of how many records and features are available and whether numeric columns contain sensible ranges.

Null Quantification

Apply .isnull().sum() to every column to measure missingness. This is not merely a data-cleaning step — the pattern of missingness carries analytical meaning. In the Tecnoempleo dataset, 80.7% of salary fields are null, which is far too high to be random noise. This pattern is classified as MNAR (Missing Not At Random): the absence of a salary value is systematically correlated with the type of job posting, not with chance.

Load and Inspect

The following code loads the cleaned Tecnoempleo dataset and runs the full diagnostic suite:

import pandas as pd

df = pd.read_csv('data/processed/clean_tecnoempleo_jobs.csv')
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
print(df.describe())

Run this block first and review the output before proceeding. Pay particular attention to the salario column null count — it will confirm the MNAR finding discussed in Notebook 3.

The .describe() method only summarises numeric columns by default. To include categorical columns, use df.describe(include='all'). For Tecnoempleo data, categorical columns such as modalidad and busqueda are equally important to inspect.

Descriptive Statistics

Once data quality has been assessed, the next step is characterising the central tendency and dispersion of the salary variables. For skewed distributions — which salary data almost always exhibits — the choice of summary statistic has real-world consequences. The analysis of the DS Salaries dataset reveals a right-skewed salary distribution:

Mean Salary

€103,314 — pulled upward by high-earning outliers in senior and leadership roles.

Median Salary

€93,444 — the midpoint of the actual distribution, unaffected by extreme values.

Skew Direction

Right-skewed — mean exceeds median, confirming a long upper tail of high salaries.

Because the mean is inflated by a relatively small number of very high salaries, it is a misleading benchmark for most candidates. A developer at the median earns €9,870 less per year than the mean suggests. Always report the median salary to candidates — the mean creates false expectations and can lead to poor negotiation outcomes. Standard deviation and interquartile range (IQR) are the appropriate dispersion measures for a skewed distribution. These metrics are computed as part of .describe() and should be highlighted in any public-facing salary summary.

Salary Distribution Analysis

Visualising the salary distribution provides intuition that summary statistics alone cannot convey. Histograms show the frequency of salary values across bins, while KDE (Kernel Density Estimation) curves overlay a smooth, continuous probability density estimate — effectively a smoothed histogram that reveals the overall shape of the distribution without bin-size artefacts.

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

df_sal = pd.read_csv('data/raw/ds_salaries.csv')
df_sal['salary_in_eur'] = df_sal['salary_in_usd'] * 0.92

# KDE + histogram
sns.histplot(df_sal['salary_in_eur'], kde=True)
plt.axvline(df_sal['salary_in_eur'].median(), color='green', linestyle='--', label='Median')
plt.axvline(df_sal['salary_in_eur'].mean(), color='red', linestyle='--', label='Mean')
plt.legend()
plt.title('Salary Distribution (DS Salaries)')
plt.show()

# Shapiro-Wilk test
stat, p = stats.shapiro(df_sal['salary_in_eur'].dropna())
print(f'Shapiro-Wilk: W={stat:.4f}, p={p:.6f}')
# p < 0.05 → reject normality hypothesis

The KDE curve will visibly show a long right tail — the signature of a right-skewed distribution. The gap between the green (median) and red (mean) vertical lines makes the skew immediately legible to non-technical stakeholders.

Shapiro-Wilk Normality Test

The Shapiro-Wilk test formally evaluates the null hypothesis that the data was drawn from a normal distribution. The result for DS Salaries salary data is unambiguous: p < 0.05, which means the null hypothesis of normality is rejected. Salary does not follow a normal distribution. This finding has direct methodological consequences: any statistical technique that assumes normality — such as Pearson correlation with inferential claims, or t-tests for group comparisons — must be replaced with non-parametric equivalents in subsequent notebooks.

The Shapiro-Wilk test is most reliable for small-to-medium samples (typically n < 5,000). For larger datasets, consider the Kolmogorov-Smirnov test or the Anderson-Darling test, both of which are better calibrated for high sample counts where Shapiro-Wilk can become oversensitive to minor deviations from normality.

Key Findings from Notebook 1

The following findings from this notebook directly shape the methodology of Notebooks 2 and 3:

Finding	Detail	Consequence
Right-skewed salary distribution	Mean (€103,314) > Median (€93,444)	Report median to candidates; mean misleads
80.7% of Tecnoempleo salary fields are null	MNAR pattern — not random missingness	Tecnoempleo cannot be used as a salary source
Normal distribution hypothesis rejected	Shapiro-Wilk p < 0.05	Use non-parametric statistics throughout
DS Salaries USD→EUR conversion required	Rate: 1 USD = 0.92 EUR	All monetary comparisons use `salary_in_eur`

Run Notebook 2 (analisis_02_correlaciones_agrupaciones_probabilidad.ipynb) only after completing this one — the analysis pipeline is sequential. Notebook 2 relies on the cleaned column salary_in_eur and the confirmed non-parametric assumption established here. Skipping this notebook will lead to incorrectly specified models downstream.

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Exploratory Analysis: Data Quality and Salary Stats

Diagnostic and Data Quality

Descriptive Statistics

Mean Salary

Median Salary

Skew Direction

Salary Distribution Analysis

Shapiro-Wilk Normality Test

Key Findings from Notebook 1

Build docs developers (and LLMs) love

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Documentation Index

​Diagnostic and Data Quality

​Descriptive Statistics

Mean Salary

Median Salary

Skew Direction

​Salary Distribution Analysis

​Shapiro-Wilk Normality Test

​Key Findings from Notebook 1

Build docs developers (and LLMs) love

Diagnostic and Data Quality

Descriptive Statistics

Salary Distribution Analysis

Shapiro-Wilk Normality Test

Key Findings from Notebook 1