Getting the project running locally takes only a few minutes. You will need Python 3.10 or later, a terminal, and either Jupyter Lab or VS Code with the Jupyter extension installed. Plan for roughly 500 MB of free disk space to accommodate the cleaned CSV outputs and generated chart images — the raw data directory starts empty and is populated by notebook 01.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
Before you begin, confirm your environment meets the following requirements:Python 3.10+
Any 3.10, 3.11, or 3.12 release works. Python 3.11 is recommended for broadest library compatibility.
Jupyter Environment
Jupyter Lab, classic Jupyter Notebook, or VS Code with the official Jupyter extension.
~500 MB Disk Space
Required for cleaned CSVs in
data/clean/, EDA outputs in data/eda/, and chart PNGs in images/.An Adzuna API account is only required if you want to run notebook 01 (live data collection). Notebooks 02 through 05 work entirely from CSV files already present in
data/clean/ and data/raw/ — no API credentials needed.Installation
Clone the repository
Download the project source to your machine. Replace the URL with your fork if you have one.
Create a virtual environment
Isolate the project dependencies from your system Python. Choose the tool you prefer:
- venv (standard library)
- conda
Install dependencies
With your environment activated, install all pinned packages from This installs the full stack — data handling, visualization, statistics, notebook tooling, and document generation. See the Environment reference for a package-by-package breakdown.
requirements.txt:Expected output (summary)
Expected output (summary)
You should see pip resolving and downloading packages across these groups:
| Group | Key packages |
|---|---|
| Data handling | numpy 2.4.6, pandas 3.0.3 |
| Data collection | requests 2.34.2, python-dotenv 1.2.2 |
| Notebooks | ipykernel 7.2.0, ipython 9.14.0, nbformat 5.10.4 |
| Visualization | matplotlib 3.10.9, seaborn 0.13.2, plotly 6.7.0, kaleido 1.3.0, squarify 0.4.4 |
| Statistics | scipy 1.17.1, statsmodels 0.14.6 |
| Document generation | python-docx 1.2.0 |
Configure environment variables
Copy the provided example file and fill in your Adzuna credentials. This step is only required for notebook 01.Open If you are only running notebooks 02–05, you can leave
.env in your editor and replace the placeholder values:.env with the example placeholders — those notebooks never read API credentials. See the Environment reference for full details on every variable.Verify the installation
Launch Jupyter to confirm everything is wired up correctly:Your browser should open automatically. If it does not, copy the
localhost URL printed in the terminal output.Run notebooks in order
Open the
notebooks/ folder and execute the notebooks sequentially. Each notebook depends on the outputs of the one before it.01_data_collection.ipynb
Fetches job listings from the Adzuna API. Requires credentials. Skip if you are working from existing CSVs.
02_cleaning.ipynb
Normalises, deduplicates, and unifies the three source datasets into cleaned CSVs in
data/clean/.03_eda.ipynb
Performs structure analysis, null analysis, distribution analysis, and ranking generation.
04-visualizations.ipynb
Produces 8 visualization blocks using matplotlib, seaborn, and Plotly, exported as PNGs to
images/.05_bias_analysis.ipynb
Identifies and quantifies representation, location, seniority, and salary-data biases.
Skipping notebook 01: If you do not have Adzuna credentials, notebooks 02–05 still run in full using the static datasets (
data_science_job_posts_2025.csv, tecnoempleo_spain_2026.csv, stackoverflow_2025_results.csv) that are included in the repository. The data/raw/ directory ships with a .gitkeep placeholder — raw outputs from notebook 01 are gitignored and must be generated or obtained separately.Project directory structure
After setup your workspace should look like this:Troubleshooting
pip install fails with dependency conflicts
pip install fails with dependency conflicts
Make sure you are running
pip install inside your activated virtual environment (your prompt should show .venv or eda-roles). If conflicts persist, try upgrading pip first:Jupyter does not see the virtual environment kernel
Jupyter does not see the virtual environment kernel
Register the environment as a Jupyter kernel manually:Then restart Jupyter and select EDA Roles (Python 3.11) from the kernel picker.
Plotly charts do not display in Jupyter Lab
Plotly charts do not display in Jupyter Lab
Plotly requires the
jupyterlab renderer for interactive output. If charts appear blank, add this cell at the top of the notebook:kaleido fails to export static PNGs
kaleido fails to export static PNGs
kaleido is the Plotly static image engine. If PNG export fails, confirm kaleido installed correctly:libgobject or run inside a headed display (use xvfb-run on headless servers).