Collecting fresh Spanish job-market data turned out to be the most technically challenging phase of the project. Three static datasets came pre-loaded from the bootcamp, but they either skewed international or lacked enough recent Spanish listings. To close that gap, the team (led by David, Data Analyst, May 2026) attempted automated scraping of the main Spanish job portals before landing on a reliable, fully ethical alternative: the public Adzuna API.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
Pre-loaded datasets
Before any collection work began, the project already had three sources to work with:tecnoempleo_spain_2026.csv
600 Spanish job offers with location data. Salary field present but 78 % null.
data_science_job_posts_2025.csv
944 international listings, of which 143 are in Spain. Skills and salary columns have no nulls.
stackoverflow_2025_results.csv
Stack Overflow annual survey — used for demographic and technology-preference analysis.
Scraping portals — what was tried and why it failed
The team systematically evaluated the two dominant Spanish job portals using several automation techniques. Every approach was blocked:| Portal | Method | Result |
|---|---|---|
| InfoJobs | requests + BeautifulSoup | CAPTCHA |
| InfoJobs | Playwright (real browser) | Distil Networks block |
| Indeed | requests + BeautifulSoup | 403 Security Check |
| Indeed | Playwright visible panel | Cloudflare Turnstile (not bypassed) |
| Indeed | Playwright + clicks + pause | Cloudflare + context close |
| InfoJobs API | App registration | Registration closed temporarily |
Solution: the Adzuna public API
Adzuna is an international job aggregator with a well-documented public API that aggregates listings from multiple Spanish sources.Free tier
1,000 requests per month — more than sufficient for this project’s 56-combination search matrix (7 roles × 8 cities).
Simple auth
app_id + app_key passed as URL query parameters. No OAuth flow required.Rich payload
Each result returns:
title, company, location, description, salary_min, salary_max, contract_type, category, and redirect_url.Native pagination
Up to 5 pages of 50 results each per search query, accessed via a page number in the URL path.
Credential setup
API keys are never written into the notebook code. They are stored in a.env file (git-ignored) and loaded at runtime with python-dotenv.
Copy the example file
A
.env.example file with placeholder values is included in the project root. Copy and rename it:Register on Adzuna Developer
Create a free account at developer.adzuna.com/signup. Your
Application ID and Application Key appear in the dashboard immediately after registration.Fill in your credentials
Open The
.env in any text editor and replace the placeholder values:.env file is already listed in .gitignore — it will never be committed.Notebook walkthrough
Imports and path setup
The notebook loads credentials, sets the output path, and raises a descriptive
ValueError immediately if either key is missing:Define the scrape_adzuna() function
The core extraction function paginates through Adzuna results, accumulates offer dicts, builds a unified
salary column, deduplicates by enlace (URL), and appends to the output CSV on disk:Quick test
A single two-page search verifies that the credentials and network connection work before launching the full batch:
Adzuna API reference
| Parameter | Type | Description |
|---|---|---|
app_id | string | Application ID from the Adzuna developer dashboard |
app_key | string | Application Key from the Adzuna developer dashboard |
what | string | Search keywords (e.g. "data engineer") |
where | string | Location string (e.g. "Madrid", "remoto") |
results_per_page | integer | Number of results per page (max 50) |
content-type | string | Set to "application/json" |
https://api.adzuna.com/v1/api/jobs/es/search/{page}.
Lessons learned
Anti-bot protection is now standard on major job portals
Anti-bot protection is now standard on major job portals
Both InfoJobs and Indeed employ enterprise-grade bot-detection systems (Distil Networks and Cloudflare Turnstile respectively). These cannot be bypassed ethically or reliably in a professional context without explicit authorization from the portal owner.
Adzuna does not publish salary for every listing
Adzuna does not publish salary for every listing
The
salary_min and salary_max fields are null for a significant portion of offers — a pattern consistent with the 78 % null rate seen in tecnoempleo_spain_2026.csv. Salary analysis downstream must account for this structural absence.Freshness bias from single-page extraction
Freshness bias from single-page extraction
Using
max_pages=1 to conserve API quota means only the most recently posted offers per search are captured. Older listings that have not been refreshed may be systematically underrepresented.Geographic coverage is uneven
Geographic coverage is uneven
The eight cities in the bulk search cover Spain’s main urban areas, but smaller provinces and rural areas are entirely absent. The scraped dataset should not be used to draw conclusions about the national job market outside major cities.
The scraping dataset (
scraping_jobs_raw.csv) feeds directly into the cleaning pipeline in 02_cleaning.ipynb, where it is normalised to the same column schema as the other sources and integrated into jobs_all_clean.csv.