Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

This guide walks you through setting up the project locally and running the full five-notebook analysis pipeline. You can complete most of the analysis without API credentials — the Adzuna API is only required if you want to re-collect raw data from scratch.

Prerequisites

Before you begin, make sure you have the following installed:
  • Python 3.10+
  • pip (or a virtual environment manager such as venv or conda)
  • Jupyter (included via ipykernel in requirements.txt)
  • Git
1

Clone the Repository and Install Dependencies

Clone the project from GitHub and install all required Python packages into your environment.
git clone https://github.com/Gema-Villanueva/proyecto-eda-roles-datos.git
cd proyecto-eda-roles-datos
pip install -r requirements.txt
This installs the full stack used across all notebooks:
CategoryKey Packages
Data manipulationpandas 3.0.3, numpy 2.4.6
Visualizationmatplotlib 3.10.9, seaborn 0.13.2, plotly 6.7.0
Statisticsscipy 1.17.1, statsmodels 0.14.6
API / envrequests 2.34.2, python-dotenv 1.2.2
Notebooksipykernel 7.2.0, ipython 9.14.0, nbformat 5.10.4
Reportingpython-docx 1.2.0, kaleido 1.3.0, squarify 0.4.4
It’s good practice to install inside a virtual environment to avoid dependency conflicts with other projects:
python -m venv .venv
source .venv/bin/activate   # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
2

Configure Environment Variables

The data collection notebook uses the Adzuna API, which requires credentials. Create a .env file at the project root by copying the provided example:
cp .env.example .env
Then open .env and fill in your credentials:
ADZUNA_APP_ID=your_app_id
ADZUNA_APP_KEY=your_app_key
To obtain credentials, register for a free account at developer.adzuna.com. The free tier provides up to 1,000 requests per month, which is sufficient to reproduce the data collection step.
Adzuna credentials are only required for notebook 01 (01_data_collection.ipynb). Notebooks 02 through 05 read from the pre-cleaned CSVs in data/clean/ and will run without any API key configured.
Never commit your .env file to version control. It is already listed in .gitignore, but double-check before pushing to a shared repository.
3

Run the Notebooks in Order

Launch Jupyter and execute the notebooks sequentially. Each notebook depends on the outputs of the previous one.
jupyter notebook
Then open and run the notebooks in this order:
Run all five notebooks if you want to reproduce every step from raw API calls to bias analysis:
notebooks/01_data_collection.ipynb   ← requires Adzuna API key
notebooks/02_cleaning.ipynb
notebooks/03_eda.ipynb
notebooks/04-visualizations.ipynb
notebooks/05_bias_analysis.ipynb
Use Kernel → Restart & Run All in each notebook to ensure a clean execution state before proceeding to the next.
Notebooks 02–05 work with the existing datasets in data/clean/ and do not require an Adzuna API key. If you just want to explore the analysis and visualizations, you can skip notebook 01 entirely and jump straight to cleaning or EDA.
Here is a summary of what each notebook does and what it produces:
Authenticates with the Adzuna API using your .env credentials and fetches job listings for data roles in Spain. Saves raw JSON responses and outputs initial CSV files to data/raw/.Requires: ADZUNA_APP_ID, ADZUNA_APP_KEY in .env Produces: Raw CSVs in data/raw/
Merges the three source datasets (Tecnoempleo, Data Science Job Posts, Stack Overflow survey) with the Adzuna data. Standardises column names, normalises job family labels, handles null values, and exports the unified dataset.Requires: Raw files in data/raw/ (or pre-existing files) Produces: data/clean/jobs_all_clean.csv, data/clean/jobs_clean.csv, data/clean/tecno_jobs_clean.csv, and supplementary CSVs
Computes summary statistics, skill frequency distributions, salary breakdowns, and geographic analysis. Identifies Python as the most demanded skill and Madrid as the dominant city.Requires: data/clean/jobs_all_clean.csv Produces: EDA outputs in data/eda/
Produces charts using matplotlib, seaborn, and plotly (with kaleido for static export). Includes treemaps via squarify. All figures are saved to images/.Requires: data/clean/ CSVs, data/eda/ outputs Produces: Chart images in images/
Analyses potential sources of bias: the 78% salary null rate in Tecnoempleo, Madrid’s geographic over-representation, and structural choices in dataset construction that could skew results.Requires: data/clean/jobs_all_clean.csv Produces: Bias analysis outputs and supplementary figures
4

Explore the Outputs

Once the notebooks have run, results are available in two locations:

Cleaned Datasets

All processed CSV files are saved to data/clean/. The primary unified dataset is jobs_all_clean.csv1,542 offers with 17 columns, including job family, location, salary, and skills.

Generated Charts

Visualizations produced by notebook 04 are saved to images/. These include skill frequency charts, geographic distribution maps, salary box plots, and treemaps by job family.
Key files to look at after a full run:
data/clean/
├── jobs_all_clean.csv              ← unified dataset (1,542 offers)
├── jobs_clean.csv                  ← cleaned Adzuna + international offers
├── tecno_jobs_clean.csv            ← cleaned Tecnoempleo offers
├── technology_rankings.csv         ← skill frequency rankings
└── technologies_clean_long_format.csv

images/                             ← all generated chart files
data/eda/                           ← EDA intermediate outputs
The data/raw/ directory is listed in .gitignore and is not included in the repository. It will be populated only if you run notebook 01 with valid Adzuna credentials.

What’s Next?

With the pipeline running, explore the individual notebook documentation pages for a deeper explanation of the methods, findings, and code used in each stage.

Data Collection

Adzuna API setup and ingestion logic

Cleaning

Merge strategy and normalisation decisions

EDA

Key findings: skills, salaries, and geography

Visualizations

Chart gallery and plotting techniques

Build docs developers (and LLMs) love