This guide walks you through setting up the project locally and running the full five-notebook analysis pipeline. You can complete most of the analysis without API credentials — the Adzuna API is only required if you want to re-collect raw data from scratch.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, make sure you have the following installed:- Python 3.10+
- pip (or a virtual environment manager such as
venvorconda) - Jupyter (included via
ipykernelinrequirements.txt) - Git
Clone the Repository and Install Dependencies
Clone the project from GitHub and install all required Python packages into your environment.This installs the full stack used across all notebooks:
| Category | Key Packages |
|---|---|
| Data manipulation | pandas 3.0.3, numpy 2.4.6 |
| Visualization | matplotlib 3.10.9, seaborn 0.13.2, plotly 6.7.0 |
| Statistics | scipy 1.17.1, statsmodels 0.14.6 |
| API / env | requests 2.34.2, python-dotenv 1.2.2 |
| Notebooks | ipykernel 7.2.0, ipython 9.14.0, nbformat 5.10.4 |
| Reporting | python-docx 1.2.0, kaleido 1.3.0, squarify 0.4.4 |
Configure Environment Variables
The data collection notebook uses the Adzuna API, which requires credentials. Create a Then open To obtain credentials, register for a free account at developer.adzuna.com. The free tier provides up to 1,000 requests per month, which is sufficient to reproduce the data collection step.
.env file at the project root by copying the provided example:.env and fill in your credentials:Adzuna credentials are only required for notebook 01 (
01_data_collection.ipynb). Notebooks 02 through 05 read from the pre-cleaned CSVs in data/clean/ and will run without any API key configured.Run the Notebooks in Order
Launch Jupyter and execute the notebooks sequentially. Each notebook depends on the outputs of the previous one.Then open and run the notebooks in this order:Here is a summary of what each notebook does and what it produces:
- Full Pipeline (01 → 05)
- Skip Data Collection (02 → 05)
Run all five notebooks if you want to reproduce every step from raw API calls to bias analysis:Use Kernel → Restart & Run All in each notebook to ensure a clean execution state before proceeding to the next.
01_data_collection.ipynb — Adzuna API Ingestion
01_data_collection.ipynb — Adzuna API Ingestion
Authenticates with the Adzuna API using your
.env credentials and fetches job listings for data roles in Spain. Saves raw JSON responses and outputs initial CSV files to data/raw/.Requires: ADZUNA_APP_ID, ADZUNA_APP_KEY in .env
Produces: Raw CSVs in data/raw/02_cleaning.ipynb — Data Cleaning & Unification
02_cleaning.ipynb — Data Cleaning & Unification
Merges the three source datasets (Tecnoempleo, Data Science Job Posts, Stack Overflow survey) with the Adzuna data. Standardises column names, normalises job family labels, handles null values, and exports the unified dataset.Requires: Raw files in
data/raw/ (or pre-existing files)
Produces: data/clean/jobs_all_clean.csv, data/clean/jobs_clean.csv, data/clean/tecno_jobs_clean.csv, and supplementary CSVs03_eda.ipynb — Exploratory Data Analysis
03_eda.ipynb — Exploratory Data Analysis
Computes summary statistics, skill frequency distributions, salary breakdowns, and geographic analysis. Identifies Python as the most demanded skill and Madrid as the dominant city.Requires:
data/clean/jobs_all_clean.csv
Produces: EDA outputs in data/eda/04-visualizations.ipynb — Chart Generation
04-visualizations.ipynb — Chart Generation
Produces charts using matplotlib, seaborn, and plotly (with kaleido for static export). Includes treemaps via squarify. All figures are saved to
images/.Requires: data/clean/ CSVs, data/eda/ outputs
Produces: Chart images in images/05_bias_analysis.ipynb — Bias Examination
05_bias_analysis.ipynb — Bias Examination
Analyses potential sources of bias: the 78% salary null rate in Tecnoempleo, Madrid’s geographic over-representation, and structural choices in dataset construction that could skew results.Requires:
data/clean/jobs_all_clean.csv
Produces: Bias analysis outputs and supplementary figuresExplore the Outputs
Once the notebooks have run, results are available in two locations:Key files to look at after a full run:
Cleaned Datasets
All processed CSV files are saved to
data/clean/. The primary unified dataset is jobs_all_clean.csv — 1,542 offers with 17 columns, including job family, location, salary, and skills.Generated Charts
Visualizations produced by notebook 04 are saved to
images/. These include skill frequency charts, geographic distribution maps, salary box plots, and treemaps by job family.The
data/raw/ directory is listed in .gitignore and is not included in the repository. It will be populated only if you run notebook 01 with valid Adzuna credentials.What’s Next?
With the pipeline running, explore the individual notebook documentation pages for a deeper explanation of the methods, findings, and code used in each stage.Data Collection
Adzuna API setup and ingestion logic
Cleaning
Merge strategy and normalisation decisions
EDA
Key findings: skills, salaries, and geography
Visualizations
Chart gallery and plotting techniques