This is the first notebook in a series of three (Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt
Use this file to discover all available pages before exploring further.
analisis_01_exploracion_y_descriptiva.ipynb). Before any modeling or probabilistic analysis can take place, the data foundation must be solid — and that means rigorously understanding what the datasets contain, what they are missing, and whether their distributions meet the assumptions required for standard statistical tests. This notebook inspects both the Tecnoempleo scraped dataset and the DS Salaries reference dataset for data quality, computes descriptive statistics across key salary variables, and applies the Shapiro-Wilk normality test to determine whether parametric or non-parametric methods are appropriate for the rest of the analysis pipeline.
Diagnostic and Data Quality
The first step in any honest data analysis is confronting the raw state of your data before applying transformations or drawing conclusions. Dimensional inspection reveals the scope of each dataset, while type auditing and null quantification expose structural problems that can silently invalidate downstream results.Dimensional Inspection
Use
.shape, .dtypes, and .describe() to understand the size, column types, and summary statistics of each dataset. This gives an immediate picture of how many records and features are available and whether numeric columns contain sensible ranges.Null Quantification
Apply
.isnull().sum() to every column to measure missingness. This is not merely a data-cleaning step — the pattern of missingness carries analytical meaning. In the Tecnoempleo dataset, 80.7% of salary fields are null, which is far too high to be random noise. This pattern is classified as MNAR (Missing Not At Random): the absence of a salary value is systematically correlated with the type of job posting, not with chance.The
.describe() method only summarises numeric columns by default. To include categorical columns, use df.describe(include='all'). For Tecnoempleo data, categorical columns such as modalidad and busqueda are equally important to inspect.Descriptive Statistics
Once data quality has been assessed, the next step is characterising the central tendency and dispersion of the salary variables. For skewed distributions — which salary data almost always exhibits — the choice of summary statistic has real-world consequences. The analysis of the DS Salaries dataset reveals a right-skewed salary distribution:Mean Salary
€103,314 — pulled upward by high-earning outliers in senior and leadership roles.
Median Salary
€93,444 — the midpoint of the actual distribution, unaffected by extreme values.
Skew Direction
Right-skewed — mean exceeds median, confirming a long upper tail of high salaries.
.describe() and should be highlighted in any public-facing salary summary.
Salary Distribution Analysis
Visualising the salary distribution provides intuition that summary statistics alone cannot convey. Histograms show the frequency of salary values across bins, while KDE (Kernel Density Estimation) curves overlay a smooth, continuous probability density estimate — effectively a smoothed histogram that reveals the overall shape of the distribution without bin-size artefacts.Shapiro-Wilk Normality Test
The Shapiro-Wilk test formally evaluates the null hypothesis that the data was drawn from a normal distribution. The result for DS Salaries salary data is unambiguous: p < 0.05, which means the null hypothesis of normality is rejected. Salary does not follow a normal distribution. This finding has direct methodological consequences: any statistical technique that assumes normality — such as Pearson correlation with inferential claims, or t-tests for group comparisons — must be replaced with non-parametric equivalents in subsequent notebooks.The Shapiro-Wilk test is most reliable for small-to-medium samples (typically n < 5,000). For larger datasets, consider the Kolmogorov-Smirnov test or the Anderson-Darling test, both of which are better calibrated for high sample counts where Shapiro-Wilk can become oversensitive to minor deviations from normality.
Key Findings from Notebook 1
The following findings from this notebook directly shape the methodology of Notebooks 2 and 3:| Finding | Detail | Consequence |
|---|---|---|
| Right-skewed salary distribution | Mean (€103,314) > Median (€93,444) | Report median to candidates; mean misleads |
| 80.7% of Tecnoempleo salary fields are null | MNAR pattern — not random missingness | Tecnoempleo cannot be used as a salary source |
| Normal distribution hypothesis rejected | Shapiro-Wilk p < 0.05 | Use non-parametric statistics throughout |
| DS Salaries USD→EUR conversion required | Rate: 1 USD = 0.92 EUR | All monetary comparisons use salary_in_eur |