This page walks you through a complete local setup of TinderJob from scratch. By the end you will have cloned the repository, installed all Python dependencies, executed the Tecnoempleo scraper to collect live job listings, run the cleaning pipeline to produce a normalised dataset, and launched the Streamlit dashboard — including the TinderMatch CV engine — in your browser. The full process takes under 10 minutes on a standard internet connection.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, make sure you have the following installed:- Python 3.8 or higher — verify with
python --version - pip — comes bundled with Python 3.8+
- Git — for cloning the repository
- An active internet connection for the scraper step
Setup Steps
Clone the Repository
Clone the TinderJob repository from GitHub and navigate into the project directory:This creates a
TinderJob/ folder containing the full project structure: app/, data/, notebooks/, src/, and requirements.txt.Create and Activate a Virtual Environment
Create an isolated Python virtual environment named Then activate it according to your operating system:You will see To deactivate the environment at any time, simply run
.venv to avoid dependency conflicts with other projects on your system:- Windows
- Linux / macOS
(.venv) prepended to your terminal prompt when the environment is active.deactivate.Install Dependencies
With the virtual environment active, install all required packages from This installs the full dependency stack. Key packages and their roles in TinderJob:
Installation typically takes 1–3 minutes depending on your connection speed.
requirements.txt:| Package | Version | Purpose |
|---|---|---|
streamlit | latest | Main dashboard UI framework |
plotly | 5.22.0 | Interactive charts, histograms, scatter plots, and KDE curves |
pdfplumber | 0.11.9 | Extracts structured text from candidate CVs (PDF format) for TinderMatch |
pandas | ≥ 2.3.3 | Core data manipulation throughout the pipeline |
beautifulsoup4 | 4.12.3 | HTML parsing for the Tecnoempleo scraper |
requests | 2.32.3 | HTTP client with custom User-Agent headers for the scraper |
scipy | latest | Statistical tests (Shapiro-Wilk normality test) in the EDA notebooks |
numpy | latest | Numerical operations and array handling |
seaborn | 0.13.2 | Statistical visualisation in the analysis notebooks |
matplotlib | latest | Base plotting library used by seaborn |
statsmodels | latest | Advanced statistical modelling in the notebooks |
notebook | 7.2.0 | Jupyter Notebook interface for running the EDA notebooks |
openpyxl | 3.1.4 | Excel export support |
selenium | 4.21.0 | Browser automation (available for future dynamic-page scraping) |
webdriver-manager | 4.0.2 | Automatic driver management for Selenium |
Run the Scraper
Execute the Tecnoempleo scraper to collect live job listings from Spain’s leading tech job portal:The scraper iterates through 24 predefined tech search profiles — including A deduplicated CSV is saved to
data-scientist, data-analyst, data-engineer, machine-learning, devops, ciberseguridad, full-stack, cloud, big-data, and more — fetching up to 3 pages per profile from tecnoempleo.com.For each job listing card it extracts:- Title and company name from the card header
- Location and contract type by following each offer’s detail page
- Salary (when explicitly published, in
€notation) - Technical skills from badge tags on the listing card
- Direct URL to the original offer on Tecnoempleo (used by TinderMatch for one-click access)
data/raw/tecnoempleo_jobs.csv with columns: titulo, empresa, ubicacion, salario, tipo_contrato, skills, busqueda, url.The original README project run yielded 1,148 unique offers across the 24 profiles.Run the Cleaning Pipeline
Once the raw CSV exists, run the data cleaning and normalisation script:The pipeline executes the following transformations in sequence:The script prints a summary to stdout showing rows before/after deduplication, salary IQR limits, and the count of valid URLs preserved.
- Lowercasing — standardises case for
titulo,empresa,ubicacion,tipo_contrato,skills, andbusquedavianormalizar_texto(). Theurlcolumn is intentionally not lowercased, as URLs are case-sensitive. - Deduplication — drops duplicate rows based on the exact combination of
titulo,empresa,ubicacion,salario, andtipo_contrato, keeping the first occurrence. - Skills cleaning — splits comma-separated skill strings, strips whitespace from each token, lowercases, and removes exact duplicates within a single offer’s skill list.
- Modalidad extraction — derives a new
modalidadcolumn (En Remoto,Híbrido,Presencial,No especificado) by parsing keywords in theubicacionfield. - Ciudad extraction — derives a clean
ciudadcolumn by stripping modality qualifiers (e.g.(híbrido),- España,100% remoto) from the location string. - Salary parsing — parses raw salary strings (which may be annual ranges, monthly figures, or band expressions) into three normalised numeric columns:
salario_min— lower bound of the salary rangesalario_max— upper bound of the salary rangesalario_medio— arithmetic mean of min and max Monthly figures (mes,b/m) are automatically annualised by multiplying by 12.
- Outlier detection via IQR — computes Q₁ (25th percentile) and Q₃ (75th percentile) on
salario_medio, derives the Interquartile Range (IQR = Q₃ − Q₁), and flags records ases_outlier = Trueif they fall below Q₁ − 1.5×IQR or above Q₃ + 3×IQR. Records withsalario_min < 10,000(non-quantifiable salary strings) are removed. - URL preservation — the script checks whether the
urlcolumn exists in the raw CSV and logs a warning if it is missing (which would mean the scraper needs to be re-run). When present, the column is carried through unchanged to the processed output so TinderMatch can display direct offer links.
The DS Salaries dataset (
data/raw/ds_salaries.csv) is not generated by the scraper — it must be downloaded separately from its public source and placed at exactly that path before you launch the dashboard. Without it, the 💵 Salary Analysis tab in Streamlit will fail to load. See the Introduction page for details on the dataset’s provenance.Launch the Dashboard
With the processed data in place, start the Streamlit application:Streamlit will compile the app and open it automatically in your default browser at:The dashboard is divided into two main blocks:📊 Statistical Analysis Modules — four analytical tabs:
🔥 TinderMatch — Find your ideal offer:
Upload a PDF or plain-text CV, and the engine extracts your tech skills from a dictionary of 80+ recognised technologies, compares them against all Tecnoempleo offers in the processed dataset, and returns a ranked list of matches showing: skills you already have that match, missing skills (reskilling opportunities), direct links to each original offer, and advanced filters by city, modality, and minimum match percentage. Results can be exported to CSV.
| Tab | Content |
|---|---|
| 📍 Mercado España | Demand radiograph for IT profiles, Top 20 skills ranking, and distribution by work modality |
| 💵 Análisis Salarial | Salary band exploration with histograms, KDE curves, salary-by-experience charts, boxplots, pivot tables, and scatter plots |
| 🎲 Probabilidad Condicional | Conditional probability heatmaps: P(High Salary | Level), P(Remote | Company Size), P(Flexible | City) |
| ⚖️ Sesgos | Interactive visualisation of MNAR phenomena, selection bias, and strategic recommendations for ethical hiring algorithms |
Expected Output Structure
After completing all six steps, yourdata/ directory should look like this:
Full Requirements Reference
TinderJob requires Python 3.8 or higher. The completerequirements.txt as shipped in the repository: