TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt
Use this file to discover all available pages before exploring further.
scripts/ directory contains utility code that lives outside the notebooks but supports the project workflow. Currently it holds one script: build_notebook_docx_guide.py, which generates a formatted Word document intended as a speaking guide for presenting the cleaning and EDA notebooks to an audience.
build_notebook_docx_guide.py
Purpose: Programmatically builds a .docx Word document that summarises notebooks 02 (cleaning) and 03 (EDA) in a presenter-friendly format. The document includes section-by-section talking points, dataset summaries, library descriptions, methodology notes, and suggested presentation phrases — everything needed to walk through the notebooks live without reading from the code.
Output file: docs/guia_presentacion_notebooks_cleaning_eda.docx
Depends on: python-docx 1.2.0 (installed via requirements.txt)
Running the script
Execute from the project root — no arguments required:docs/ and prints a confirmation message. If docs/ does not exist it is created automatically.
The script does not read any environment variables or external data files. It constructs the document entirely from hardcoded content and formatting logic, so it runs without Adzuna credentials or the CSV outputs from the notebooks.
Document structure
The generated Word document is divided into nine logical sections:Title and subtitle
Document header with the project name and a descriptive subtitle, styled with the project’s primary blue (
#2E74B5).General flow overview
A narrative summary of the end-to-end pipeline: raw data → cleaning → EDA → visualizations, with transition phrases for each handoff.
Notebook 02 — Cleaning
Block-by-block breakdown of the cleaning notebook: which datasets are loaded, what transformations are applied, and which output files are produced.
Datasets generated
A labeled table listing every CSV written to
data/clean/ with column counts and a short description.Column unification
Explanation of the schema harmonisation step — how three heterogeneous sources are mapped to a single English snake_case schema.
Notebook 03 — EDA
Section-by-section walkthrough of the EDA notebook: structure analysis, null analysis, distribution analysis, and ranking generation.
Libraries used
A reference table of every library used in notebooks 02 and 03 with a one-line description of its role in the analysis.
Methodology, phrases, and limitations
Methodology explanation suitable for a non-technical audience, suggested presentation phrases, and a checklist of limitations to be transparent about during the presentation.
Function reference
build_document()
The main entry point. Instantiates a new python-docx Document object, calls every section builder in order, applies post-processing, and saves the result to docs/guia_presentacion_notebooks_cleaning_eda.docx.
configure_document(document)
Applies global page setup to the Word document: page size, margins, default font, heading styles, and a footer with the project name.
Configuration details
Configuration details
| Setting | Value |
|---|---|
| Page size | 8.5 × 11 inches (US Letter) |
| Margins | 1 inch on all sides |
| Default font | Calibri |
| Heading 1 color | #2E74B5 (blue) |
| Heading 2 color | #2E74B5 (blue) |
| Heading 3 color | #1F4D78 (dark blue) |
| Footer | Project name, right-aligned |
accent_document(document)
Performs a post-processing pass over all paragraph runs in the document, replacing unaccented Spanish words with their correctly accented forms (e.g., "Guia" → "Guía", "analisis" → "análisis"). The replacement dictionary covers common words used throughout the document.
This function is called once at the end of
build_document(), after all content has been added. It iterates over every paragraph and table cell in the document body and applies string replacements to each run’s text.add_callout(document, title, body)
Inserts a single-cell table styled as a callout box with a shaded background. Used for highlighted notes, warnings, or key takeaways within a section.
Parameters
Parameters
#F4F6F9 (off-white blue). Border color: #D9E2EC. Cell padding: 6 pt on all sides.
add_label_detail_table(document, rows, col_widths, header)
Inserts a two-column key-value table — the left column holds a label (bold) and the right column holds its detail. Suitable for dataset summaries, column listings, and parameter descriptions.
Parameters
Parameters
The active
python-docx Document instance.List of
(label, detail) tuples. Each tuple becomes one row in the table.Two-element tuple specifying the width in dxa units (twentieths of a point) of the label column and the detail column respectively. Default:
(2700, 6660).Optional header string. When provided, a shaded header row spanning both columns is prepended to the table.
add_matrix_table(document, headers, rows, widths)
Inserts a multi-column matrix table with a styled header row. Used for comparisons, library listings, and structured data that requires more than two columns.
Parameters
Parameters
The active
python-docx Document instance.Column header labels. Length must match the length of each inner list in
rows and the length of widths.Data rows. Each inner list is one table row; values map positionally to
headers.Column widths in dxa units (twentieths of a point). Must have the same length as
headers.#E8EEF5 (light blue) with dark-blue (#1F4D78) bold text.
add_paragraph(document, text, style, bold_prefix)
Inserts a single paragraph with a specified Word paragraph style and an optional bold prefix string.
Parameters
Parameters
The active
python-docx Document instance.Main paragraph text, rendered in the specified style weight.
Word paragraph style name (default:
None, which Word renders as Normal). Common values used in this script: "Heading 1", "Heading 2", "Normal".When provided, this string is inserted as a bold run before the main
text. Useful for inline labels such as "Output:" or "Nota:".add_bullets(document, items)
Inserts a bulleted list using Word’s built-in "List Bullet" paragraph style.
Parameters
Parameters
add_numbered(document, items)
Inserts a numbered list using Word’s built-in "List Number" paragraph style.
Parameters
Parameters
Colour palette
The script defines six named colours used consistently throughout the document:| Name | Hex | Usage |
|---|---|---|
| Blue | #2E74B5 | Heading 1, Heading 2, label-detail table header text, title text |
| Dark blue | #1F4D78 | Heading 3, matrix table header text, label column bold text, title run |
| Light blue | #E8EEF5 | Label-detail table header row background, matrix table header row background |
| Light gray | #F2F4F7 | Label column (left cell) background in data rows of label-detail tables |
| Border | #B7C7D9 | Default table border colour |
| Muted | #595959 | Footer text |
Extending the script
To add a new section to the generated document, create a helper function following the established pattern and call it insidebuild_document():