Installation and Setup Guide for EDA Roles de Datos

Getting the project running locally takes only a few minutes. You will need Python 3.10 or later, a terminal, and either Jupyter Lab or VS Code with the Jupyter extension installed. Plan for roughly 500 MB of free disk space to accommodate the cleaned CSV outputs and generated chart images — the raw data directory starts empty and is populated by notebook 01.

Prerequisites

Before you begin, confirm your environment meets the following requirements:

Python 3.10+

Any 3.10, 3.11, or 3.12 release works. Python 3.11 is recommended for broadest library compatibility.

Jupyter Environment

Jupyter Lab, classic Jupyter Notebook, or VS Code with the official Jupyter extension.

~500 MB Disk Space

Required for cleaned CSVs in data/clean/, EDA outputs in data/eda/, and chart PNGs in images/.

An Adzuna API account is only required if you want to run notebook 01 (live data collection). Notebooks 02 through 05 work entirely from CSV files already present in data/clean/ and data/raw/ — no API credentials needed.

Installation

Clone the repository

Download the project source to your machine. Replace the URL with your fork if you have one.

git clone https://github.com/Gema-Villanueva/proyecto-eda-roles-datos.git
cd proyecto-eda-roles-datos

Create a virtual environment

Isolate the project dependencies from your system Python. Choose the tool you prefer:

venv (standard library)
conda

# Create the environment
python -m venv .venv

# Activate — Linux / macOS
source .venv/bin/activate

# Activate — Windows (Command Prompt)
.venv\Scripts\activate

# Activate — Windows (PowerShell)
.venv\Scripts\Activate.ps1

VS Code detects .venv automatically. Open the Command Palette → Python: Select Interpreter and choose the .venv entry if it isn’t already selected.

# Create a named environment with Python 3.11
conda create -n eda-roles python=3.11

# Activate the environment
conda activate eda-roles

If you use conda but prefer pip for package management (recommended here because the project pins exact versions in requirements.txt), run pip install -r requirements.txt inside the activated conda environment.

Install dependencies

With your environment activated, install all pinned packages from requirements.txt:

pip install -r requirements.txt

This installs the full stack — data handling, visualization, statistics, notebook tooling, and document generation. See the Environment reference for a package-by-package breakdown.

Expected output (summary)

You should see pip resolving and downloading packages across these groups:

Group	Key packages
Data handling	`numpy 2.4.6`, `pandas 3.0.3`
Data collection	`requests 2.34.2`, `python-dotenv 1.2.2`
Notebooks	`ipykernel 7.2.0`, `ipython 9.14.0`, `nbformat 5.10.4`
Visualization	`matplotlib 3.10.9`, `seaborn 0.13.2`, `plotly 6.7.0`, `kaleido 1.3.0`, `squarify 0.4.4`
Statistics	`scipy 1.17.1`, `statsmodels 0.14.6`
Document generation	`python-docx 1.2.0`

Configure environment variables

Copy the provided example file and fill in your Adzuna credentials. This step is only required for notebook 01.

cp .env.example .env

Open .env in your editor and replace the placeholder values:

# .env
ADZUNA_APP_ID=your_real_app_id_here
ADZUNA_APP_KEY=your_real_app_key_here

Never commit .env to version control. It is already listed in .gitignore, but double-check before pushing to a public fork.

If you are only running notebooks 02–05, you can leave .env with the example placeholders — those notebooks never read API credentials. See the Environment reference for full details on every variable.

Verify the installation

Launch Jupyter to confirm everything is wired up correctly:

# Option A — Jupyter Lab (recommended)
jupyter lab

# Option B — Classic notebook interface
jupyter notebook

Your browser should open automatically. If it does not, copy the localhost URL printed in the terminal output.

In VS Code, open any .ipynb file directly. VS Code will prompt you to select a kernel — choose the interpreter from the virtual environment you created in step 2.

Run notebooks in order

Open the notebooks/ folder and execute the notebooks sequentially. Each notebook depends on the outputs of the one before it.

01_data_collection.ipynb

Fetches job listings from the Adzuna API. Requires credentials. Skip if you are working from existing CSVs.

02_cleaning.ipynb

Normalises, deduplicates, and unifies the three source datasets into cleaned CSVs in data/clean/.

03_eda.ipynb

Performs structure analysis, null analysis, distribution analysis, and ranking generation.

04-visualizations.ipynb

Produces 8 visualization blocks using matplotlib, seaborn, and Plotly, exported as PNGs to images/.

05_bias_analysis.ipynb

Identifies and quantifies representation, location, seniority, and salary-data biases.

Skipping notebook 01: If you do not have Adzuna credentials, notebooks 02–05 still run in full using the static datasets (data_science_job_posts_2025.csv, tecnoempleo_spain_2026.csv, stackoverflow_2025_results.csv) that are included in the repository. The data/raw/ directory ships with a .gitkeep placeholder — raw outputs from notebook 01 are gitignored and must be generated or obtained separately.

Project directory structure

After setup your workspace should look like this:

proyecto-eda-roles-datos/
├── .venv/                     # Virtual environment (gitignored)
├── .env                       # Your real credentials (gitignored)
├── .env.example               # Committed template
├── notebooks/
│   ├── 01_data_collection.ipynb
│   ├── 02_cleaning.ipynb
│   ├── 03_eda.ipynb
│   ├── 04-visualizations.ipynb
│   └── 05_bias_analysis.ipynb
├── data/
│   ├── raw/                   # Populated by notebook 01 (gitignored)
│   ├── clean/                 # Populated by notebook 02
│   └── eda/                   # Populated by notebook 03
├── images/                    # Populated by notebook 04
├── scripts/
│   └── build_notebook_docx_guide.py
├── docs/
│   └── guia_presentacion_notebooks_cleaning_eda.docx
└── requirements.txt

Troubleshooting

pip install fails with dependency conflicts

Make sure you are running pip install inside your activated virtual environment (your prompt should show .venv or eda-roles). If conflicts persist, try upgrading pip first:

pip install --upgrade pip
pip install -r requirements.txt

Jupyter does not see the virtual environment kernel

python -m ipykernel install --user --name=eda-roles --display-name "EDA Roles (Python 3.11)"

Then restart Jupyter and select EDA Roles (Python 3.11) from the kernel picker.

Plotly charts do not display in Jupyter Lab

Plotly requires the jupyterlab renderer for interactive output. If charts appear blank, add this cell at the top of the notebook:

import plotly.io as pio
pio.renderers.default = "jupyterlab"

kaleido fails to export static PNGs

kaleido is the Plotly static image engine. If PNG export fails, confirm kaleido installed correctly:

pip show kaleido

On some Linux systems you may also need to install libgobject or run inside a headed display (use xvfb-run on headless servers).

Configuración

Scripts y Utilidades

Installation and Setup Guide for EDA Roles de Datos

Prerequisites

Python 3.10+

Jupyter Environment

~500 MB Disk Space

Installation

01_data_collection.ipynb

02_cleaning.ipynb

03_eda.ipynb

04-visualizations.ipynb

05_bias_analysis.ipynb

Project directory structure

Troubleshooting

Build docs developers (and LLMs) love

Configuración

Scripts y Utilidades

Documentation Index

​Prerequisites

Python 3.10+

Jupyter Environment

~500 MB Disk Space

​Installation

01_data_collection.ipynb

02_cleaning.ipynb

03_eda.ipynb

04-visualizations.ipynb

05_bias_analysis.ipynb

​Project directory structure

​Troubleshooting

Build docs developers (and LLMs) love

Prerequisites

Installation

Project directory structure

Troubleshooting