TinderJob Setup: Install, Scrape, Clean, and Launch

This page walks you through a complete local setup of TinderJob from scratch. By the end you will have cloned the repository, installed all Python dependencies, executed the Tecnoempleo scraper to collect live job listings, run the cleaning pipeline to produce a normalised dataset, and launched the Streamlit dashboard — including the TinderMatch CV engine — in your browser. The full process takes under 10 minutes on a standard internet connection.

Prerequisites

Before you begin, make sure you have the following installed:

Python 3.8 or higher — verify with python --version
pip — comes bundled with Python 3.8+
Git — for cloning the repository
An active internet connection for the scraper step

Setup Steps

Clone the Repository

Clone the TinderJob repository from GitHub and navigate into the project directory:

git clone https://github.com/HelenDiMo/TinderJob.git
cd TinderJob

This creates a TinderJob/ folder containing the full project structure: app/, data/, notebooks/, src/, and requirements.txt.

Create and Activate a Virtual Environment

Create an isolated Python virtual environment named .venv to avoid dependency conflicts with other projects on your system:

python -m venv .venv

Then activate it according to your operating system:

Windows
Linux / macOS

.\.venv\Scripts\activate

You will see (.venv) prepended to your terminal prompt when the environment is active.

source .venv/bin/activate

You will see (.venv) prepended to your terminal prompt when the environment is active.

To deactivate the environment at any time, simply run deactivate.

Install Dependencies

With the virtual environment active, install all required packages from requirements.txt:

pip install -r requirements.txt

This installs the full dependency stack. Key packages and their roles in TinderJob:

Package	Version	Purpose
`streamlit`	latest	Main dashboard UI framework
`plotly`	5.22.0	Interactive charts, histograms, scatter plots, and KDE curves
`pdfplumber`	0.11.9	Extracts structured text from candidate CVs (PDF format) for TinderMatch
`pandas`	≥ 2.3.3	Core data manipulation throughout the pipeline
`beautifulsoup4`	4.12.3	HTML parsing for the Tecnoempleo scraper
`requests`	2.32.3	HTTP client with custom User-Agent headers for the scraper
`scipy`	latest	Statistical tests (Shapiro-Wilk normality test) in the EDA notebooks
`numpy`	latest	Numerical operations and array handling
`seaborn`	0.13.2	Statistical visualisation in the analysis notebooks
`matplotlib`	latest	Base plotting library used by seaborn
`statsmodels`	latest	Advanced statistical modelling in the notebooks
`notebook`	7.2.0	Jupyter Notebook interface for running the EDA notebooks
`openpyxl`	3.1.4	Excel export support
`selenium`	4.21.0	Browser automation (available for future dynamic-page scraping)
`webdriver-manager`	4.0.2	Automatic driver management for Selenium

Installation typically takes 1–3 minutes depending on your connection speed.

Run the Scraper

Execute the Tecnoempleo scraper to collect live job listings from Spain’s leading tech job portal:

python src/scraper/extract_tecnoempleo.py

The scraper iterates through 24 predefined tech search profiles — including data-scientist, data-analyst, data-engineer, machine-learning, devops, ciberseguridad, full-stack, cloud, big-data, and more — fetching up to 3 pages per profile from tecnoempleo.com.For each job listing card it extracts:

Title and company name from the card header
Location and contract type by following each offer’s detail page
Salary (when explicitly published, in € notation)
Technical skills from badge tags on the listing card
Direct URL to the original offer on Tecnoempleo (used by TinderMatch for one-click access)

Expected output:

data/
└── raw/
    ├── tecnoempleo_jobs.csv     ← main output (all deduplicated offers)
    └── debug_page.html          ← HTML snapshot of the first page (for debugging)

A deduplicated CSV is saved to data/raw/tecnoempleo_jobs.csv with columns: titulo, empresa, ubicacion, salario, tipo_contrato, skills, busqueda, url.The original README project run yielded 1,148 unique offers across the 24 profiles.

The scraper applies ethical rate-limiting — a 1-second delay between listing pages and a 0.4-second delay between detail page requests — using Python’s time.sleep() to avoid overwhelming Tecnoempleo’s servers and getting the IP blocked. Do not increase max_paginas beyond 3–5 on your first test run. If you need to re-run the scraper frequently during development, consider caching the raw HTML output in data/raw/debug_page.html and working from that file instead.

Run the Cleaning Pipeline

Once the raw CSV exists, run the data cleaning and normalisation script:

python src/data_processing/clean_tecnoempleo_data.py

The pipeline executes the following transformations in sequence:

Lowercasing — standardises case for titulo, empresa, ubicacion, tipo_contrato, skills, and busqueda via normalizar_texto(). The url column is intentionally not lowercased, as URLs are case-sensitive.
Deduplication — drops duplicate rows based on the exact combination of titulo, empresa, ubicacion, salario, and tipo_contrato, keeping the first occurrence.
Skills cleaning — splits comma-separated skill strings, strips whitespace from each token, lowercases, and removes exact duplicates within a single offer’s skill list.
Modalidad extraction — derives a new modalidad column (En Remoto, Híbrido, Presencial, No especificado) by parsing keywords in the ubicacion field.
Ciudad extraction — derives a clean ciudad column by stripping modality qualifiers (e.g. (híbrido), - España, 100% remoto) from the location string.
Salary parsing — parses raw salary strings (which may be annual ranges, monthly figures, or band expressions) into three normalised numeric columns:
- salario_min — lower bound of the salary range
- salario_max — upper bound of the salary range
- salario_medio — arithmetic mean of min and max Monthly figures (mes, b/m) are automatically annualised by multiplying by 12.
Outlier detection via IQR — computes Q₁ (25th percentile) and Q₃ (75th percentile) on salario_medio, derives the Interquartile Range (IQR = Q₃ − Q₁), and flags records as es_outlier = True if they fall below Q₁ − 1.5×IQR or above Q₃ + 3×IQR. Records with salario_min < 10,000 (non-quantifiable salary strings) are removed.
URL preservation — the script checks whether the url column exists in the raw CSV and logs a warning if it is missing (which would mean the scraper needs to be re-run). When present, the column is carried through unchanged to the processed output so TinderMatch can display direct offer links.

Expected output:

data/
└── processed/
    └── clean_tecnoempleo_jobs.csv   ← cleaned, normalised, salary-parsed dataset

The script prints a summary to stdout showing rows before/after deduplication, salary IQR limits, and the count of valid URLs preserved.

The DS Salaries dataset (data/raw/ds_salaries.csv) is not generated by the scraper — it must be downloaded separately from its public source and placed at exactly that path before you launch the dashboard. Without it, the 💵 Salary Analysis tab in Streamlit will fail to load. See the Introduction page for details on the dataset’s provenance.

Launch the Dashboard

With the processed data in place, start the Streamlit application:

streamlit run app/streamlit_app.py

Streamlit will compile the app and open it automatically in your default browser at:

http://localhost:8501

The dashboard is divided into two main blocks:📊 Statistical Analysis Modules — four analytical tabs:

Tab	Content
📍 Mercado España	Demand radiograph for IT profiles, Top 20 skills ranking, and distribution by work modality
💵 Análisis Salarial	Salary band exploration with histograms, KDE curves, salary-by-experience charts, boxplots, pivot tables, and scatter plots
🎲 Probabilidad Condicional	Conditional probability heatmaps: P(High Salary \| Level), P(Remote \| Company Size), P(Flexible \| City)
⚖️ Sesgos	Interactive visualisation of MNAR phenomena, selection bias, and strategic recommendations for ethical hiring algorithms

🔥 TinderMatch — Find your ideal offer: Upload a PDF or plain-text CV, and the engine extracts your tech skills from a dictionary of 80+ recognised technologies, compares them against all Tecnoempleo offers in the processed dataset, and returns a ranked list of matches showing: skills you already have that match, missing skills (reskilling opportunities), direct links to each original offer, and advanced filters by city, modality, and minimum match percentage. Results can be exported to CSV.

Expected Output Structure

After completing all six steps, your data/ directory should look like this:

data/
├── metadata/
│   └── (schema and field dictionaries)
├── processed/
│   └── clean_tecnoempleo_jobs.csv    ← normalised, salary-parsed, outlier-flagged
└── raw/
    ├── debug_page.html               ← HTML snapshot of first scraper page
    ├── ds_salaries.csv               ← manually downloaded DS Salaries dataset
    └── tecnoempleo_jobs.csv          ← raw scraper output (1,148 deduplicated offers)

Full Requirements Reference

TinderJob requires Python 3.8 or higher. The complete requirements.txt as shipped in the repository:

requests==2.32.3
beautifulsoup4==4.12.3
selenium==4.21.0
webdriver-manager==4.0.2
pandas>=2.3.3
numpy
scipy
statsmodels
streamlit
matplotlib
seaborn==0.13.2
plotly==5.22.0
notebook==7.2.0
openpyxl==3.1.4
pdfplumber==0.11.9

If you are on an Apple Silicon Mac (M1/M2/M3), some packages — particularly scipy and statsmodels — may require Rosetta or a native ARM build. Using a Conda environment with conda-forge channels is recommended in that case over plain pip.

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

TinderJob Setup: Install, Scrape, Clean, and Launch

Prerequisites

Setup Steps

Expected Output Structure

Full Requirements Reference

Build docs developers (and LLMs) love

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Documentation Index

​Prerequisites

​Setup Steps

​Expected Output Structure

​Full Requirements Reference

Build docs developers (and LLMs) love

Prerequisites

Setup Steps

Expected Output Structure

Full Requirements Reference