The TinderJob scraper is a purpose-built Python module that extracts live technology job listings from Tecnoempleo, Spain’s leading tech job portal. It iterates over 24 predefined search terms — spanning data science, software engineering, DevOps, cloud, and more — and consolidates everything into a single deduplicated dataset. The final output is written toDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt
Use this file to discover all available pages before exploring further.
data/raw/tecnoempleo_jobs.csv and serves as the raw input for the entire TinderJob analytics pipeline.
Architecture
The scraper uses a two-step extraction strategy to capture both listing-level and detail-level data for every job offer: Step 1 — Paginated Listing Scrape (scrape_busqueda)
For each of the 24 search terms, the scraper fetches up to 3 pages of search results from https://www.tecnoempleo.com/ofertas-trabajo/{termino}. Each result page contains a grid of job cards (div.p-3.border.rounded.mb-3.bg-white). The scraper collects all cards per page and passes each to extraer_oferta().
Step 2 — Detail Page Scrape (extraer_datos_detalle)
For every individual job card, the scraper follows the offer’s URL to its detail page. This secondary request parses li.list-item elements to extract the precise ubicacion (location) and tipo_contrato (contract type), fields that are not reliably structured in the listing cards. A 0.4 s sleep is introduced between each detail page request to avoid hammering the server.
After all 24 searches complete, the main() function deduplicates the aggregated results on titulo + empresa and saves the final CSV.
Search Terms
The scraper covers the following 24BUSQUEDAS, chosen to represent the full breadth of tech roles relevant to the Spanish job market:
/ofertas-trabajo/data-scientist), so no additional URL encoding is required.
Function Reference
scrape_busqueda
max_paginas result pages for a single search term. Constructs paginated URLs automatically (?pagina=N for pages 2 and beyond). Stops early if a page returns a non-200 status or yields no job cards. Returns a flat list of job offer dicts.
extraer_oferta
extraer_datos_detalle() to enrich the result with location and contract type from the offer’s detail page. Returns a dict with the following keys:
| Key | Description |
|---|---|
titulo | Job title |
empresa | Company name |
ubicacion | Location (from detail page) |
salario | Salary range string (detected by € symbol) |
tipo_contrato | Contract type (from detail page) |
skills | Deduplicated badge skills as a comma-separated string |
busqueda | The search term that surfaced this offer |
url | Full absolute URL to the offer on Tecnoempleo |
/oferta/123); the function prepends https://www.tecnoempleo.com automatically.
extraer_datos_detalle
li.list-item elements. Identifies the location and contract type rows by scanning each item’s text for the keywords "ubicación" / "ubicacion" and "tipo contrato" / "tipo de contrato", then extracts the value from the corresponding span.float-end. Returns a dict with keys ubicacion and tipo_contrato (both default to None on request failure).
limpiar_lineas
main
BUSQUEDAS, calls scrape_busqueda() for each, concatenates results, deduplicates on titulo + empresa, enforces the canonical column order, and saves the output to data/raw/tecnoempleo_jobs.csv (UTF-8 with BOM).
Running the Scraper
Execute the scraper from the project root:Output Schema
The scraper writes an 8-column CSV todata/raw/tecnoempleo_jobs.csv:
| Column | Description |
|---|---|
titulo | Job offer title |
empresa | Hiring company name |
ubicacion | Location with work modality, e.g. "Madrid (Híbrido)" |
salario | Raw salary range text, e.g. "30.000€ - 45.000€" |
tipo_contrato | Contract type, e.g. "Indefinido" |
skills | Comma-separated list of required tech skills |
busqueda | The search term that generated this offer |
url | Full URL to the offer detail page on Tecnoempleo |
Rate Limiting
The scraper applies two layers of deliberate rate limiting to remain a respectful client:- Between pages:
time.sleep(1)— a 1-second pause after each paginated listing page is fetched, giving the server breathing room between bulk result requests. - Between detail requests:
time.sleep(0.4)— a 400 ms pause after each individual offer detail page is fetched insideextraer_oferta().
requests.Session object is reused across all pages within a single search term, which reduces TCP handshake overhead and keeps connection behaviour more browser-like.