Tecnoempleo Scraper: Automated Job Listing Extraction

The TinderJob scraper is a purpose-built Python module that extracts live technology job listings from Tecnoempleo, Spain’s leading tech job portal. It iterates over 24 predefined search terms — spanning data science, software engineering, DevOps, cloud, and more — and consolidates everything into a single deduplicated dataset. The final output is written to data/raw/tecnoempleo_jobs.csv and serves as the raw input for the entire TinderJob analytics pipeline.

Architecture

The scraper uses a two-step extraction strategy to capture both listing-level and detail-level data for every job offer: Step 1 — Paginated Listing Scrape (scrape_busqueda) For each of the 24 search terms, the scraper fetches up to 3 pages of search results from https://www.tecnoempleo.com/ofertas-trabajo/{termino}. Each result page contains a grid of job cards (div.p-3.border.rounded.mb-3.bg-white). The scraper collects all cards per page and passes each to extraer_oferta(). Step 2 — Detail Page Scrape (extraer_datos_detalle) For every individual job card, the scraper follows the offer’s URL to its detail page. This secondary request parses li.list-item elements to extract the precise ubicacion (location) and tipo_contrato (contract type), fields that are not reliably structured in the listing cards. A 0.4 s sleep is introduced between each detail page request to avoid hammering the server. After all 24 searches complete, the main() function deduplicates the aggregated results on titulo + empresa and saves the final CSV.

Search Terms

The scraper covers the following 24 BUSQUEDAS, chosen to represent the full breadth of tech roles relevant to the Spanish job market:

BUSQUEDAS = [
    "data-scientist",
    "data-analyst",
    "data-engineer",
    "machine-learning",
    "business-intelligence",
    "programador",
    "analista-programador",
    "arquitecto-tic",
    "desarrollador-web",
    "full-stack",
    "devops",
    "ciberseguridad",
    "tester",
    "junior",
    "soporte-tecnico",
    "administrador-sistemas",
    "redes",
    "dba",
    "base-datos",
    "big-data",
    "cloud",
    "mobile",
    "software",
    "web",
]

These terms map directly to Tecnoempleo’s URL slug system (e.g., /ofertas-trabajo/data-scientist), so no additional URL encoding is required.

Function Reference

`scrape_busqueda`

scrape_busqueda(termino: str, max_paginas: int = 3) -> list

Scrapes up to max_paginas result pages for a single search term. Constructs paginated URLs automatically (?pagina=N for pages 2 and beyond). Stops early if a page returns a non-200 status or yields no job cards. Returns a flat list of job offer dicts.

`extraer_oferta`

extraer_oferta(card, termino: str, session: requests.Session) -> dict

Extracts structured data from a single job listing card. Internally calls extraer_datos_detalle() to enrich the result with location and contract type from the offer’s detail page. Returns a dict with the following keys:

Key	Description
`titulo`	Job title
`empresa`	Company name
`ubicacion`	Location (from detail page)
`salario`	Salary range string (detected by `€` symbol)
`tipo_contrato`	Contract type (from detail page)
`skills`	Deduplicated badge skills as a comma-separated string
`busqueda`	The search term that surfaced this offer
`url`	Full absolute URL to the offer on Tecnoempleo

Tecnoempleo listing cards use relative URLs (e.g., /oferta/123); the function prepends https://www.tecnoempleo.com automatically.

`extraer_datos_detalle`

extraer_datos_detalle(url: str, session: requests.Session) -> dict

Fetches the offer’s detail page and parses its li.list-item elements. Identifies the location and contract type rows by scanning each item’s text for the keywords "ubicación" / "ubicacion" and "tipo contrato" / "tipo de contrato", then extracts the value from the corresponding span.float-end. Returns a dict with keys ubicacion and tipo_contrato (both default to None on request failure).

`limpiar_lineas`

limpiar_lineas(texto: str) -> list

Splits a multi-line text string on newlines, strips whitespace from each line, and filters out blank lines. Used internally to parse the salary and info block from listing cards.

`main`

main() -> None

Orchestrates the full scraping run: iterates over all 24 BUSQUEDAS, calls scrape_busqueda() for each, concatenates results, deduplicates on titulo + empresa, enforces the canonical column order, and saves the output to data/raw/tecnoempleo_jobs.csv (UTF-8 with BOM).

Running the Scraper

Execute the scraper from the project root:

python src/scraper/extract_tecnoempleo.py

The script will print progress for each search term and page as it runs:

Buscando: 'data-scientist'...
    Página 1: https://www.tecnoempleo.com/ofertas-trabajo/data-scientist
    Página 2: https://www.tecnoempleo.com/ofertas-trabajo/data-scientist?pagina=2
    ...
  → 47 ofertas
Buscando: 'data-analyst'...
...
CSV generado con 369 ofertas → data/raw/tecnoempleo_jobs.csv

The scraper makes real HTTP requests to Tecnoempleo on every run. Execute responsibly and avoid scheduling it at high frequency. Do not increase max_paginas beyond 5 without testing — each additional page triggers a round of detail-page requests that compounds the total request count significantly.

On the very first run, the scraper writes the raw HTML of the first result page for the "data-scientist" search to data/raw/debug_page.html. Open this file in a browser whenever Tecnoempleo updates its markup to quickly identify which CSS selectors have changed.

Output Schema

The scraper writes an 8-column CSV to data/raw/tecnoempleo_jobs.csv:

Column	Description
`titulo`	Job offer title
`empresa`	Hiring company name
`ubicacion`	Location with work modality, e.g. `"Madrid (Híbrido)"`
`salario`	Raw salary range text, e.g. `"30.000€ - 45.000€"`
`tipo_contrato`	Contract type, e.g. `"Indefinido"`
`skills`	Comma-separated list of required tech skills
`busqueda`	The search term that generated this offer
`url`	Full URL to the offer detail page on Tecnoempleo

Rate Limiting

The scraper applies two layers of deliberate rate limiting to remain a respectful client:

Between pages: time.sleep(1) — a 1-second pause after each paginated listing page is fetched, giving the server breathing room between bulk result requests.
Between detail requests: time.sleep(0.4) — a 400 ms pause after each individual offer detail page is fetched inside extraer_oferta().

All requests are sent with a realistic browser User-Agent header to avoid being blocked by basic bot-detection filters:

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
}

A requests.Session object is reused across all pages within a single search term, which reduces TCP handshake overhead and keeps connection behaviour more browser-like.

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Tecnoempleo Scraper: Automated Job Listing Extraction

Architecture

Search Terms

Function Reference

`scrape_busqueda`

`extraer_oferta`

`extraer_datos_detalle`

`limpiar_lineas`

`main`

Running the Scraper

Output Schema

Rate Limiting

Build docs developers (and LLMs) love

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Documentation Index

​Architecture

​Search Terms

​Function Reference

​scrape_busqueda

​extraer_oferta

​extraer_datos_detalle

​limpiar_lineas

​main

​Running the Scraper

​Output Schema

​Rate Limiting

Build docs developers (and LLMs) love

Architecture

Search Terms

Function Reference

`scrape_busqueda`

`extraer_oferta`

`extraer_datos_detalle`

`limpiar_lineas`

`main`

Running the Scraper

Output Schema

Rate Limiting