Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt

Use this file to discover all available pages before exploring further.

This page walks you through a complete local setup of TinderJob from scratch. By the end you will have cloned the repository, installed all Python dependencies, executed the Tecnoempleo scraper to collect live job listings, run the cleaning pipeline to produce a normalised dataset, and launched the Streamlit dashboard — including the TinderMatch CV engine — in your browser. The full process takes under 10 minutes on a standard internet connection.

Prerequisites

Before you begin, make sure you have the following installed:
  • Python 3.8 or higher — verify with python --version
  • pip — comes bundled with Python 3.8+
  • Git — for cloning the repository
  • An active internet connection for the scraper step

Setup Steps

1

Clone the Repository

Clone the TinderJob repository from GitHub and navigate into the project directory:
git clone https://github.com/HelenDiMo/TinderJob.git
cd TinderJob
This creates a TinderJob/ folder containing the full project structure: app/, data/, notebooks/, src/, and requirements.txt.
2

Create and Activate a Virtual Environment

Create an isolated Python virtual environment named .venv to avoid dependency conflicts with other projects on your system:
python -m venv .venv
Then activate it according to your operating system:
.\.venv\Scripts\activate
You will see (.venv) prepended to your terminal prompt when the environment is active.
To deactivate the environment at any time, simply run deactivate.
3

Install Dependencies

With the virtual environment active, install all required packages from requirements.txt:
pip install -r requirements.txt
This installs the full dependency stack. Key packages and their roles in TinderJob:
PackageVersionPurpose
streamlitlatestMain dashboard UI framework
plotly5.22.0Interactive charts, histograms, scatter plots, and KDE curves
pdfplumber0.11.9Extracts structured text from candidate CVs (PDF format) for TinderMatch
pandas≥ 2.3.3Core data manipulation throughout the pipeline
beautifulsoup44.12.3HTML parsing for the Tecnoempleo scraper
requests2.32.3HTTP client with custom User-Agent headers for the scraper
scipylatestStatistical tests (Shapiro-Wilk normality test) in the EDA notebooks
numpylatestNumerical operations and array handling
seaborn0.13.2Statistical visualisation in the analysis notebooks
matplotliblatestBase plotting library used by seaborn
statsmodelslatestAdvanced statistical modelling in the notebooks
notebook7.2.0Jupyter Notebook interface for running the EDA notebooks
openpyxl3.1.4Excel export support
selenium4.21.0Browser automation (available for future dynamic-page scraping)
webdriver-manager4.0.2Automatic driver management for Selenium
Installation typically takes 1–3 minutes depending on your connection speed.
4

Run the Scraper

Execute the Tecnoempleo scraper to collect live job listings from Spain’s leading tech job portal:
python src/scraper/extract_tecnoempleo.py
The scraper iterates through 24 predefined tech search profiles — including data-scientist, data-analyst, data-engineer, machine-learning, devops, ciberseguridad, full-stack, cloud, big-data, and more — fetching up to 3 pages per profile from tecnoempleo.com.For each job listing card it extracts:
  • Title and company name from the card header
  • Location and contract type by following each offer’s detail page
  • Salary (when explicitly published, in notation)
  • Technical skills from badge tags on the listing card
  • Direct URL to the original offer on Tecnoempleo (used by TinderMatch for one-click access)
Expected output:
data/
└── raw/
    ├── tecnoempleo_jobs.csv     ← main output (all deduplicated offers)
    └── debug_page.html          ← HTML snapshot of the first page (for debugging)
A deduplicated CSV is saved to data/raw/tecnoempleo_jobs.csv with columns: titulo, empresa, ubicacion, salario, tipo_contrato, skills, busqueda, url.The original README project run yielded 1,148 unique offers across the 24 profiles.
The scraper applies ethical rate-limiting — a 1-second delay between listing pages and a 0.4-second delay between detail page requests — using Python’s time.sleep() to avoid overwhelming Tecnoempleo’s servers and getting the IP blocked. Do not increase max_paginas beyond 3–5 on your first test run. If you need to re-run the scraper frequently during development, consider caching the raw HTML output in data/raw/debug_page.html and working from that file instead.
5

Run the Cleaning Pipeline

Once the raw CSV exists, run the data cleaning and normalisation script:
python src/data_processing/clean_tecnoempleo_data.py
The pipeline executes the following transformations in sequence:
  1. Lowercasing — standardises case for titulo, empresa, ubicacion, tipo_contrato, skills, and busqueda via normalizar_texto(). The url column is intentionally not lowercased, as URLs are case-sensitive.
  2. Deduplication — drops duplicate rows based on the exact combination of titulo, empresa, ubicacion, salario, and tipo_contrato, keeping the first occurrence.
  3. Skills cleaning — splits comma-separated skill strings, strips whitespace from each token, lowercases, and removes exact duplicates within a single offer’s skill list.
  4. Modalidad extraction — derives a new modalidad column (En Remoto, Híbrido, Presencial, No especificado) by parsing keywords in the ubicacion field.
  5. Ciudad extraction — derives a clean ciudad column by stripping modality qualifiers (e.g. (híbrido), - España, 100% remoto) from the location string.
  6. Salary parsing — parses raw salary strings (which may be annual ranges, monthly figures, or band expressions) into three normalised numeric columns:
    • salario_min — lower bound of the salary range
    • salario_max — upper bound of the salary range
    • salario_medio — arithmetic mean of min and max Monthly figures (mes, b/m) are automatically annualised by multiplying by 12.
  7. Outlier detection via IQR — computes Q₁ (25th percentile) and Q₃ (75th percentile) on salario_medio, derives the Interquartile Range (IQR = Q₃ − Q₁), and flags records as es_outlier = True if they fall below Q₁ − 1.5×IQR or above Q₃ + 3×IQR. Records with salario_min < 10,000 (non-quantifiable salary strings) are removed.
  8. URL preservation — the script checks whether the url column exists in the raw CSV and logs a warning if it is missing (which would mean the scraper needs to be re-run). When present, the column is carried through unchanged to the processed output so TinderMatch can display direct offer links.
Expected output:
data/
└── processed/
    └── clean_tecnoempleo_jobs.csv   ← cleaned, normalised, salary-parsed dataset
The script prints a summary to stdout showing rows before/after deduplication, salary IQR limits, and the count of valid URLs preserved.
The DS Salaries dataset (data/raw/ds_salaries.csv) is not generated by the scraper — it must be downloaded separately from its public source and placed at exactly that path before you launch the dashboard. Without it, the 💵 Salary Analysis tab in Streamlit will fail to load. See the Introduction page for details on the dataset’s provenance.
6

Launch the Dashboard

With the processed data in place, start the Streamlit application:
streamlit run app/streamlit_app.py
Streamlit will compile the app and open it automatically in your default browser at:
http://localhost:8501
The dashboard is divided into two main blocks:📊 Statistical Analysis Modules — four analytical tabs:
TabContent
📍 Mercado EspañaDemand radiograph for IT profiles, Top 20 skills ranking, and distribution by work modality
💵 Análisis SalarialSalary band exploration with histograms, KDE curves, salary-by-experience charts, boxplots, pivot tables, and scatter plots
🎲 Probabilidad CondicionalConditional probability heatmaps: P(High Salary | Level), P(Remote | Company Size), P(Flexible | City)
⚖️ SesgosInteractive visualisation of MNAR phenomena, selection bias, and strategic recommendations for ethical hiring algorithms
🔥 TinderMatch — Find your ideal offer: Upload a PDF or plain-text CV, and the engine extracts your tech skills from a dictionary of 80+ recognised technologies, compares them against all Tecnoempleo offers in the processed dataset, and returns a ranked list of matches showing: skills you already have that match, missing skills (reskilling opportunities), direct links to each original offer, and advanced filters by city, modality, and minimum match percentage. Results can be exported to CSV.

Expected Output Structure

After completing all six steps, your data/ directory should look like this:
data/
├── metadata/
│   └── (schema and field dictionaries)
├── processed/
│   └── clean_tecnoempleo_jobs.csv    ← normalised, salary-parsed, outlier-flagged
└── raw/
    ├── debug_page.html               ← HTML snapshot of first scraper page
    ├── ds_salaries.csv               ← manually downloaded DS Salaries dataset
    └── tecnoempleo_jobs.csv          ← raw scraper output (1,148 deduplicated offers)

Full Requirements Reference

TinderJob requires Python 3.8 or higher. The complete requirements.txt as shipped in the repository:
requests==2.32.3
beautifulsoup4==4.12.3
selenium==4.21.0
webdriver-manager==4.0.2
pandas>=2.3.3
numpy
scipy
statsmodels
streamlit
matplotlib
seaborn==0.13.2
plotly==5.22.0
notebook==7.2.0
openpyxl==3.1.4
pdfplumber==0.11.9
If you are on an Apple Silicon Mac (M1/M2/M3), some packages — particularly scipy and statsmodels — may require Rosetta or a native ARM build. Using a Conda environment with conda-forge channels is recommended in that case over plain pip.

Build docs developers (and LLMs) love