Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt

Use this file to discover all available pages before exploring further.

This is the second notebook in the series (analisis_02_correlaciones_agrupaciones_probabilidad.ipynb), building directly on the descriptive foundation established in Notebook 1. Where the first notebook characterised the shape and quality of the data, this one uses it to answer specific business questions. Through multivariate correlation analysis, ranked skill-demand groupings, and conditional probability modeling, it translates raw job market data into actionable intelligence — revealing which variables genuinely predict salary, which company types offer the most remote flexibility, and how dramatically the probability of earning above the median shifts with experience level.

Multivariate Analysis

Understanding which numeric variables move together — and how strongly — is the starting point for any salary prediction or segmentation model. The correlation matrix captures pairwise linear relationships across the three key continuous variables in the DS Salaries dataset.
import seaborn as sns
import matplotlib.pyplot as plt

cols = ['salary_in_eur', 'remote_ratio', 'work_year']
corr_matrix = df_sal[cols].corr().round(3)
sns.heatmap(corr_matrix, annot=True, cmap='RdPu', vmin=-1, vmax=1)
plt.title('Correlation Matrix — DS Salaries')
plt.show()
The heatmap makes the relationships immediately readable: all three correlations are positive, but the magnitudes are telling.

Salary vs. Remote Ratio

Correlation: 0.13 — effectively negligible. Remote work status has almost no linear relationship with salary level.

Salary vs. Work Year

Correlation: ~0.17 — the strongest of the three, but still very weak. Salaries have drifted upward slightly over the years covered.

Remote Ratio vs. Work Year

Correlation: positive but weak — remote work availability increased over time, but not dramatically within this dataset.
The maximum correlation observed across all three pairs is 0.17. This means that none of these numeric variables are strong linear predictors of salary in isolation. Experience level — a categorical variable — is a far more powerful predictor, as the conditional probability analysis below demonstrates.

Business Groupings

Aggregation by category reveals the structure of the job market in ways that correlations cannot. Two groupings are particularly important for TinderJob’s product: skill demand from Tecnoempleo, and salary by experience level from DS Salaries.

Top 20 Skills by Demand

Skills are stored as comma-separated strings in the skills column. Exploding this column and counting occurrences gives an unambiguous demand ranking:
skill_counts = (
    df['skills'].dropna()
    .str.split(',')
    .explode()
    .str.strip()
    .value_counts()
    .head(20)
)
The top five most demanded skills across Tecnoempleo listings are:
RankSkillJob Postings
1Python168
2Java159
3SQL96
4Angular61
5Azure58
Python and Java together appear in more job listings than the next fifteen skills combined. SQL’s third-place ranking — ahead of cloud and frontend frameworks — confirms that data literacy remains a baseline expectation across virtually all tech roles in the Spanish market.

Salary Median by Experience Level

Grouping DS Salaries records by experience label reveals how salary scales with seniority:
salary_by_exp = (
    df_sal.groupby('experience_label')['salary_in_eur']
    .agg(['median', 'mean', 'count'])
    .round(0)
)
print(salary_by_exp)

Work Modality Distribution by City

A pivot table cross-referencing city and work modality (presencial, híbrido, remoto) reveals geographic patterns in how companies offer flexible arrangements:
modality_pivot = df.pivot_table(
    index='ciudad',
    columns='modalidad',
    values='oferta_id',
    aggfunc='count',
    fill_value=0
)
modality_pivot['total'] = modality_pivot.sum(axis=1)
modality_pivot = modality_pivot[modality_pivot['total'] >= 5].sort_values('total', ascending=False)
print(modality_pivot)

Conditional Probability Modeling

Conditional probability — P(A | B), the probability of event A given that condition B is known — transforms the grouping analysis into predictive statements. Three business scenarios are modeled below using the DS Salaries dataset and the Tecnoempleo dataset respectively.
1

P(High Salary | Experience Level)

Business question: What is the probability that a candidate earns above the global salary median, given their experience level?
mediana_global = df_sal['salary_in_eur'].median()
df_sal['salario_alto'] = df_sal['salary_in_eur'] > mediana_global

prob_nivel = (
    df_sal.groupby('experience_label')['salario_alto']
    .agg(prob=lambda x: x.sum() / len(x), n='count')
    .reset_index()
)
prob_nivel['prob_pct'] = (prob_nivel['prob'] * 100).round(1)
Results by experience level:
Experience LevelP(High Salary)
Junior11.4%
Mid-level~40%
Senior73.2%
The most significant inflection point in salary probability is the Mid-level → Senior transition — not Junior → Mid-level. This insight should inform TinderJob’s career progression guidance: the most valuable investment a candidate can make is acquiring the credentials and experience to cross the Senior threshold.
2

P(Remote Work | Company Size)

Business question: What is the probability of 100% remote work, given company size?Company size is encoded in DS Salaries as S (small, <50 employees), M (medium, 50–250), and L (large, >250).
Company SizeP(100% Remote)
Small (<50)Moderate
Medium (50–250)69.3% — highest
Large (>250)53.5%
Medium-sized companies offer the highest probability of full remote work — likely because they have the operational flexibility of larger firms without the rigid office-attendance policies that large enterprises tend to enforce. Candidates prioritising remote work should specifically target the medium-size segment.
3

P(Flexible Work | City)

Business question: What is the probability of hybrid or remote work given the city of the job listing, for cities with at least 5 postings?
CityP(Hybrid or Remote)
Alcobendas86%
Almería76%
Barcelona45%
Madrid44%
Alcobendas — home to many multinational tech firms’ Spanish headquarters — leads the ranking with 86% flexible modality probability. Madrid and Barcelona, despite their volume of listings, offer roughly equal and comparatively lower flexibility rates, suggesting that density of competition may correlate with stricter in-office requirements.

Key Findings

Remote ≠ Salary Predictor

Remote ratio correlates with salary at only 0.13 — it is not a meaningful predictor. Experience level is the dominant variable.

Target Medium Companies

Candidates seeking remote work should focus on medium-sized companies (50–250 employees), where P(100% remote) reaches 69.3%.

Senior Threshold Effect

Reaching Senior level triples the probability of a high salary compared to Junior level (73.2% vs. 11.4%). The Mid → Senior jump is the most financially impactful career move.
The conditional probability framework used here can be extended to any binary outcome. Future iterations of TinderJob’s analytics could apply the same pattern to model P(Job Offer | Skill Set) or P(Salary Negotiation Success | Market Context) using richer datasets.

Build docs developers (and LLMs) love