Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt

Use this file to discover all available pages before exploring further.

The TinderJob scraper is a purpose-built Python module that extracts live technology job listings from Tecnoempleo, Spain’s leading tech job portal. It iterates over 24 predefined search terms — spanning data science, software engineering, DevOps, cloud, and more — and consolidates everything into a single deduplicated dataset. The final output is written to data/raw/tecnoempleo_jobs.csv and serves as the raw input for the entire TinderJob analytics pipeline.

Architecture

The scraper uses a two-step extraction strategy to capture both listing-level and detail-level data for every job offer: Step 1 — Paginated Listing Scrape (scrape_busqueda) For each of the 24 search terms, the scraper fetches up to 3 pages of search results from https://www.tecnoempleo.com/ofertas-trabajo/{termino}. Each result page contains a grid of job cards (div.p-3.border.rounded.mb-3.bg-white). The scraper collects all cards per page and passes each to extraer_oferta(). Step 2 — Detail Page Scrape (extraer_datos_detalle) For every individual job card, the scraper follows the offer’s URL to its detail page. This secondary request parses li.list-item elements to extract the precise ubicacion (location) and tipo_contrato (contract type), fields that are not reliably structured in the listing cards. A 0.4 s sleep is introduced between each detail page request to avoid hammering the server. After all 24 searches complete, the main() function deduplicates the aggregated results on titulo + empresa and saves the final CSV.

Search Terms

The scraper covers the following 24 BUSQUEDAS, chosen to represent the full breadth of tech roles relevant to the Spanish job market:
BUSQUEDAS = [
    "data-scientist",
    "data-analyst",
    "data-engineer",
    "machine-learning",
    "business-intelligence",
    "programador",
    "analista-programador",
    "arquitecto-tic",
    "desarrollador-web",
    "full-stack",
    "devops",
    "ciberseguridad",
    "tester",
    "junior",
    "soporte-tecnico",
    "administrador-sistemas",
    "redes",
    "dba",
    "base-datos",
    "big-data",
    "cloud",
    "mobile",
    "software",
    "web",
]
These terms map directly to Tecnoempleo’s URL slug system (e.g., /ofertas-trabajo/data-scientist), so no additional URL encoding is required.

Function Reference

scrape_busqueda

scrape_busqueda(termino: str, max_paginas: int = 3) -> list
Scrapes up to max_paginas result pages for a single search term. Constructs paginated URLs automatically (?pagina=N for pages 2 and beyond). Stops early if a page returns a non-200 status or yields no job cards. Returns a flat list of job offer dicts.

extraer_oferta

extraer_oferta(card, termino: str, session: requests.Session) -> dict
Extracts structured data from a single job listing card. Internally calls extraer_datos_detalle() to enrich the result with location and contract type from the offer’s detail page. Returns a dict with the following keys:
KeyDescription
tituloJob title
empresaCompany name
ubicacionLocation (from detail page)
salarioSalary range string (detected by symbol)
tipo_contratoContract type (from detail page)
skillsDeduplicated badge skills as a comma-separated string
busquedaThe search term that surfaced this offer
urlFull absolute URL to the offer on Tecnoempleo
Tecnoempleo listing cards use relative URLs (e.g., /oferta/123); the function prepends https://www.tecnoempleo.com automatically.

extraer_datos_detalle

extraer_datos_detalle(url: str, session: requests.Session) -> dict
Fetches the offer’s detail page and parses its li.list-item elements. Identifies the location and contract type rows by scanning each item’s text for the keywords "ubicación" / "ubicacion" and "tipo contrato" / "tipo de contrato", then extracts the value from the corresponding span.float-end. Returns a dict with keys ubicacion and tipo_contrato (both default to None on request failure).

limpiar_lineas

limpiar_lineas(texto: str) -> list
Splits a multi-line text string on newlines, strips whitespace from each line, and filters out blank lines. Used internally to parse the salary and info block from listing cards.

main

main() -> None
Orchestrates the full scraping run: iterates over all 24 BUSQUEDAS, calls scrape_busqueda() for each, concatenates results, deduplicates on titulo + empresa, enforces the canonical column order, and saves the output to data/raw/tecnoempleo_jobs.csv (UTF-8 with BOM).

Running the Scraper

Execute the scraper from the project root:
python src/scraper/extract_tecnoempleo.py
The script will print progress for each search term and page as it runs:
Buscando: 'data-scientist'...
    Página 1: https://www.tecnoempleo.com/ofertas-trabajo/data-scientist
    Página 2: https://www.tecnoempleo.com/ofertas-trabajo/data-scientist?pagina=2
    ...
  → 47 ofertas
Buscando: 'data-analyst'...
...
CSV generado con 369 ofertas → data/raw/tecnoempleo_jobs.csv
The scraper makes real HTTP requests to Tecnoempleo on every run. Execute responsibly and avoid scheduling it at high frequency. Do not increase max_paginas beyond 5 without testing — each additional page triggers a round of detail-page requests that compounds the total request count significantly.
On the very first run, the scraper writes the raw HTML of the first result page for the "data-scientist" search to data/raw/debug_page.html. Open this file in a browser whenever Tecnoempleo updates its markup to quickly identify which CSS selectors have changed.

Output Schema

The scraper writes an 8-column CSV to data/raw/tecnoempleo_jobs.csv:
ColumnDescription
tituloJob offer title
empresaHiring company name
ubicacionLocation with work modality, e.g. "Madrid (Híbrido)"
salarioRaw salary range text, e.g. "30.000€ - 45.000€"
tipo_contratoContract type, e.g. "Indefinido"
skillsComma-separated list of required tech skills
busquedaThe search term that generated this offer
urlFull URL to the offer detail page on Tecnoempleo

Rate Limiting

The scraper applies two layers of deliberate rate limiting to remain a respectful client:
  • Between pages: time.sleep(1) — a 1-second pause after each paginated listing page is fetched, giving the server breathing room between bulk result requests.
  • Between detail requests: time.sleep(0.4) — a 400 ms pause after each individual offer detail page is fetched inside extraer_oferta().
All requests are sent with a realistic browser User-Agent header to avoid being blocked by basic bot-detection filters:
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
}
A requests.Session object is reused across all pages within a single search term, which reduces TCP handshake overhead and keeps connection behaviour more browser-like.

Build docs developers (and LLMs) love