Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

Getting the project running locally takes only a few minutes. You will need Python 3.10 or later, a terminal, and either Jupyter Lab or VS Code with the Jupyter extension installed. Plan for roughly 500 MB of free disk space to accommodate the cleaned CSV outputs and generated chart images — the raw data directory starts empty and is populated by notebook 01.

Prerequisites

Before you begin, confirm your environment meets the following requirements:

Python 3.10+

Any 3.10, 3.11, or 3.12 release works. Python 3.11 is recommended for broadest library compatibility.

Jupyter Environment

Jupyter Lab, classic Jupyter Notebook, or VS Code with the official Jupyter extension.

~500 MB Disk Space

Required for cleaned CSVs in data/clean/, EDA outputs in data/eda/, and chart PNGs in images/.
An Adzuna API account is only required if you want to run notebook 01 (live data collection). Notebooks 02 through 05 work entirely from CSV files already present in data/clean/ and data/raw/ — no API credentials needed.

Installation

1

Clone the repository

Download the project source to your machine. Replace the URL with your fork if you have one.
git clone https://github.com/Gema-Villanueva/proyecto-eda-roles-datos.git
cd proyecto-eda-roles-datos
2

Create a virtual environment

Isolate the project dependencies from your system Python. Choose the tool you prefer:
# Create the environment
python -m venv .venv

# Activate — Linux / macOS
source .venv/bin/activate

# Activate — Windows (Command Prompt)
.venv\Scripts\activate

# Activate — Windows (PowerShell)
.venv\Scripts\Activate.ps1
VS Code detects .venv automatically. Open the Command Palette → Python: Select Interpreter and choose the .venv entry if it isn’t already selected.
3

Install dependencies

With your environment activated, install all pinned packages from requirements.txt:
pip install -r requirements.txt
This installs the full stack — data handling, visualization, statistics, notebook tooling, and document generation. See the Environment reference for a package-by-package breakdown.
You should see pip resolving and downloading packages across these groups:
GroupKey packages
Data handlingnumpy 2.4.6, pandas 3.0.3
Data collectionrequests 2.34.2, python-dotenv 1.2.2
Notebooksipykernel 7.2.0, ipython 9.14.0, nbformat 5.10.4
Visualizationmatplotlib 3.10.9, seaborn 0.13.2, plotly 6.7.0, kaleido 1.3.0, squarify 0.4.4
Statisticsscipy 1.17.1, statsmodels 0.14.6
Document generationpython-docx 1.2.0
4

Configure environment variables

Copy the provided example file and fill in your Adzuna credentials. This step is only required for notebook 01.
cp .env.example .env
Open .env in your editor and replace the placeholder values:
# .env
ADZUNA_APP_ID=your_real_app_id_here
ADZUNA_APP_KEY=your_real_app_key_here
Never commit .env to version control. It is already listed in .gitignore, but double-check before pushing to a public fork.
If you are only running notebooks 02–05, you can leave .env with the example placeholders — those notebooks never read API credentials. See the Environment reference for full details on every variable.
5

Verify the installation

Launch Jupyter to confirm everything is wired up correctly:
# Option A — Jupyter Lab (recommended)
jupyter lab

# Option B — Classic notebook interface
jupyter notebook
Your browser should open automatically. If it does not, copy the localhost URL printed in the terminal output.
In VS Code, open any .ipynb file directly. VS Code will prompt you to select a kernel — choose the interpreter from the virtual environment you created in step 2.
6

Run notebooks in order

Open the notebooks/ folder and execute the notebooks sequentially. Each notebook depends on the outputs of the one before it.

01_data_collection.ipynb

Fetches job listings from the Adzuna API. Requires credentials. Skip if you are working from existing CSVs.

02_cleaning.ipynb

Normalises, deduplicates, and unifies the three source datasets into cleaned CSVs in data/clean/.

03_eda.ipynb

Performs structure analysis, null analysis, distribution analysis, and ranking generation.

04-visualizations.ipynb

Produces 8 visualization blocks using matplotlib, seaborn, and Plotly, exported as PNGs to images/.

05_bias_analysis.ipynb

Identifies and quantifies representation, location, seniority, and salary-data biases.
Skipping notebook 01: If you do not have Adzuna credentials, notebooks 02–05 still run in full using the static datasets (data_science_job_posts_2025.csv, tecnoempleo_spain_2026.csv, stackoverflow_2025_results.csv) that are included in the repository. The data/raw/ directory ships with a .gitkeep placeholder — raw outputs from notebook 01 are gitignored and must be generated or obtained separately.

Project directory structure

After setup your workspace should look like this:
proyecto-eda-roles-datos/
├── .venv/                     # Virtual environment (gitignored)
├── .env                       # Your real credentials (gitignored)
├── .env.example               # Committed template
├── notebooks/
│   ├── 01_data_collection.ipynb
│   ├── 02_cleaning.ipynb
│   ├── 03_eda.ipynb
│   ├── 04-visualizations.ipynb
│   └── 05_bias_analysis.ipynb
├── data/
│   ├── raw/                   # Populated by notebook 01 (gitignored)
│   ├── clean/                 # Populated by notebook 02
│   └── eda/                   # Populated by notebook 03
├── images/                    # Populated by notebook 04
├── scripts/
│   └── build_notebook_docx_guide.py
├── docs/
│   └── guia_presentacion_notebooks_cleaning_eda.docx
└── requirements.txt

Troubleshooting

Make sure you are running pip install inside your activated virtual environment (your prompt should show .venv or eda-roles). If conflicts persist, try upgrading pip first:
pip install --upgrade pip
pip install -r requirements.txt
Register the environment as a Jupyter kernel manually:
python -m ipykernel install --user --name=eda-roles --display-name "EDA Roles (Python 3.11)"
Then restart Jupyter and select EDA Roles (Python 3.11) from the kernel picker.
Plotly requires the jupyterlab renderer for interactive output. If charts appear blank, add this cell at the top of the notebook:
import plotly.io as pio
pio.renderers.default = "jupyterlab"
kaleido is the Plotly static image engine. If PNG export fails, confirm kaleido installed correctly:
pip show kaleido
On some Linux systems you may also need to install libgobject or run inside a headed display (use xvfb-run on headless servers).

Build docs developers (and LLMs) love