Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

Collecting fresh Spanish job-market data turned out to be the most technically challenging phase of the project. Three static datasets came pre-loaded from the bootcamp, but they either skewed international or lacked enough recent Spanish listings. To close that gap, the team (led by David, Data Analyst, May 2026) attempted automated scraping of the main Spanish job portals before landing on a reliable, fully ethical alternative: the public Adzuna API.

Pre-loaded datasets

Before any collection work began, the project already had three sources to work with:

tecnoempleo_spain_2026.csv

600 Spanish job offers with location data. Salary field present but 78 % null.

data_science_job_posts_2025.csv

944 international listings, of which 143 are in Spain. Skills and salary columns have no nulls.

stackoverflow_2025_results.csv

Stack Overflow annual survey — used for demographic and technology-preference analysis.
These datasets form the analytical backbone of the EDA, but the team needed more volume of recent Spanish offers with full descriptions to make the analysis meaningful.

Scraping portals — what was tried and why it failed

The team systematically evaluated the two dominant Spanish job portals using several automation techniques. Every approach was blocked:
PortalMethodResult
InfoJobsrequests + BeautifulSoupCAPTCHA
InfoJobsPlaywright (real browser)Distil Networks block
Indeedrequests + BeautifulSoup403 Security Check
IndeedPlaywright visible panelCloudflare Turnstile (not bypassed)
IndeedPlaywright + clicks + pauseCloudflare + context close
InfoJobs APIApp registrationRegistration closed temporarily
Attempting to bypass CAPTCHA or Cloudflare Turnstile systems violates portal terms of service. All scraping experiments were stopped as soon as blocks were detected — no circumvention techniques were used.
The official InfoJobs API would have been the correct path, but new application registrations were closed at the time of the project. This left the team without a sanctioned route to either of the two largest Spanish portals.

Solution: the Adzuna public API

Adzuna is an international job aggregator with a well-documented public API that aggregates listings from multiple Spanish sources.

Free tier

1,000 requests per month — more than sufficient for this project’s 56-combination search matrix (7 roles × 8 cities).

Simple auth

app_id + app_key passed as URL query parameters. No OAuth flow required.

Rich payload

Each result returns: title, company, location, description, salary_min, salary_max, contract_type, category, and redirect_url.

Native pagination

Up to 5 pages of 50 results each per search query, accessed via a page number in the URL path.
No CAPTCHAs or rate-limit errors were encountered during testing. The 1.1-second sleep between pages is a precautionary courtesy delay, not a hard requirement.

Credential setup

API keys are never written into the notebook code. They are stored in a .env file (git-ignored) and loaded at runtime with python-dotenv.
1

Copy the example file

A .env.example file with placeholder values is included in the project root. Copy and rename it:
cp .env.example .env
2

Register on Adzuna Developer

Create a free account at developer.adzuna.com/signup. Your Application ID and Application Key appear in the dashboard immediately after registration.
3

Fill in your credentials

Open .env in any text editor and replace the placeholder values:
ADZUNA_APP_ID=your_real_app_id
ADZUNA_APP_KEY=your_real_app_key
The .env file is already listed in .gitignore — it will never be committed.
4

Install dependencies

From the project root, install the required packages:
pip install -r requirements.txt
This installs requests, pandas, python-dotenv, and the rest of the project dependencies.

Notebook walkthrough

1

Imports and path setup

The notebook loads credentials, sets the output path, and raises a descriptive ValueError immediately if either key is missing:
import os
import requests
import time
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

APP_ID = os.getenv("ADZUNA_APP_ID")
APP_KEY = os.getenv("ADZUNA_APP_KEY")

if not APP_ID or not APP_KEY:
    raise ValueError(
        "Falta ADZUNA_APP_ID o ADZUNA_APP_KEY en el archivo .env. "
        "Revisa .env.example para más información."
    )

PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), ".."))
DEFAULT_OUTPUT = "scraping_jobs_raw.csv"
2

Define the scrape_adzuna() function

The core extraction function paginates through Adzuna results, accumulates offer dicts, builds a unified salary column, deduplicates by enlace (URL), and appends to the output CSV on disk:
def scrape_adzuna(keyword, location, output_file=DEFAULT_OUTPUT, max_pages=5):
    base_url = "https://api.adzuna.com/v1/api/jobs/es/search"
    all_offers = []

    for page in range(1, max_pages + 1):
        params = {
            "app_id": APP_ID,
            "app_key": APP_KEY,
            "what": keyword,
            "where": location,
            "content-type": "application/json",
            "results_per_page": 50,
        }
        resp = requests.get(f"{base_url}/{page}", params=params)

        if resp.status_code != 200:
            print(f"  [!] Error en página {page}: {resp.status_code}")
            break

        offers = resp.json().get("results", [])
        if not offers:
            print(f"  [i] No hay más resultados en página {page}. Finalizando.")
            break

        for job in offers:
            all_offers.append({
                "titulo":      job.get("title"),
                "empresa":     job.get("company", {}).get("display_name"),
                "ubicacion":   job.get("location", {}).get("display_name"),
                "enlace":      job.get("redirect_url"),
                "metadatos":   f"{job.get('contract_type','')} | "
                               f"{job.get('category',{}).get('label','')}",
                "descripcion": job.get("description"),
                "salario_min": job.get("salary_min"),
                "salario_max": job.get("salary_max"),
                "salario_moneda": job.get("salary_currency", "EUR"),
            })

        time.sleep(1.1)  # courtesy delay / retraso de cortesía

    # ... build salary column, deduplicate, save to CSV
3

Quick test

A single two-page search verifies that the credentials and network connection work before launching the full batch:
df_test = scrape_adzuna("data scientist", "Madrid", max_pages=2)
df_test[["titulo", "empresa", "descripcion"]].head()
4

Bulk searches

The full extraction iterates over 7 roles × 8 cities = 56 combinations. Each combination runs with max_pages=1 to stay well within the 1,000-request monthly free tier:
roles = [
    "data scientist", "data engineer", "data analyst",
    "machine learning engineer", "business intelligence",
    "big data", "analista de datos",
]

ciudades = [
    "Madrid", "Barcelona", "Valencia", "Bilbao",
    "Sevilla", "Zaragoza", "Malaga", "remoto",
]

for rol in roles:
    for ciudad in ciudades:
        scrape_adzuna(rol, ciudad, max_pages=1)
All results accumulate in data/raw/scraping_jobs_raw.csv, deduplicated by URL on every write.

Adzuna API reference

ParameterTypeDescription
app_idstringApplication ID from the Adzuna developer dashboard
app_keystringApplication Key from the Adzuna developer dashboard
whatstringSearch keywords (e.g. "data engineer")
wherestringLocation string (e.g. "Madrid", "remoto")
results_per_pageintegerNumber of results per page (max 50)
content-typestringSet to "application/json"
The page number is appended to the base URL path: https://api.adzuna.com/v1/api/jobs/es/search/{page}.

Lessons learned

Both InfoJobs and Indeed employ enterprise-grade bot-detection systems (Distil Networks and Cloudflare Turnstile respectively). These cannot be bypassed ethically or reliably in a professional context without explicit authorization from the portal owner.
The salary_min and salary_max fields are null for a significant portion of offers — a pattern consistent with the 78 % null rate seen in tecnoempleo_spain_2026.csv. Salary analysis downstream must account for this structural absence.
Using max_pages=1 to conserve API quota means only the most recently posted offers per search are captured. Older listings that have not been refreshed may be systematically underrepresented.
The eight cities in the bulk search cover Spain’s main urban areas, but smaller provinces and rural areas are entirely absent. The scraped dataset should not be used to draw conclusions about the national job market outside major cities.
The scraping dataset (scraping_jobs_raw.csv) feeds directly into the cleaning pipeline in 02_cleaning.ipynb, where it is normalised to the same column schema as the other sources and integrated into jobs_all_clean.csv.

Build docs developers (and LLMs) love