Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/Gema-Villanueva/proyecto-eda-roles-datos/llms.txt

Use this file to discover all available pages before exploring further.

The scripts/ directory contains utility code that lives outside the notebooks but supports the project workflow. Currently it holds one script: build_notebook_docx_guide.py, which generates a formatted Word document intended as a speaking guide for presenting the cleaning and EDA notebooks to an audience.

build_notebook_docx_guide.py

Purpose: Programmatically builds a .docx Word document that summarises notebooks 02 (cleaning) and 03 (EDA) in a presenter-friendly format. The document includes section-by-section talking points, dataset summaries, library descriptions, methodology notes, and suggested presentation phrases — everything needed to walk through the notebooks live without reading from the code. Output file: docs/guia_presentacion_notebooks_cleaning_eda.docx Depends on: python-docx 1.2.0 (installed via requirements.txt)

Running the script

Execute from the project root — no arguments required:
# From the project root directory
python scripts/build_notebook_docx_guide.py
The script writes the output file to docs/ and prints a confirmation message. If docs/ does not exist it is created automatically.
The script does not read any environment variables or external data files. It constructs the document entirely from hardcoded content and formatting logic, so it runs without Adzuna credentials or the CSV outputs from the notebooks.

Document structure

The generated Word document is divided into nine logical sections:
1

Title and subtitle

Document header with the project name and a descriptive subtitle, styled with the project’s primary blue (#2E74B5).
2

General flow overview

A narrative summary of the end-to-end pipeline: raw data → cleaning → EDA → visualizations, with transition phrases for each handoff.
3

Notebook 02 — Cleaning

Block-by-block breakdown of the cleaning notebook: which datasets are loaded, what transformations are applied, and which output files are produced.
4

Datasets generated

A labeled table listing every CSV written to data/clean/ with column counts and a short description.
5

Column unification

Explanation of the schema harmonisation step — how three heterogeneous sources are mapped to a single English snake_case schema.
6

Notebook 03 — EDA

Section-by-section walkthrough of the EDA notebook: structure analysis, null analysis, distribution analysis, and ranking generation.
7

Libraries used

A reference table of every library used in notebooks 02 and 03 with a one-line description of its role in the analysis.
8

Methodology, phrases, and limitations

Methodology explanation suitable for a non-technical audience, suggested presentation phrases, and a checklist of limitations to be transparent about during the presentation.
9

Closing statement

A prepared closing paragraph summarising the project’s contribution and next steps.

Function reference

build_document()

The main entry point. Instantiates a new python-docx Document object, calls every section builder in order, applies post-processing, and saves the result to docs/guia_presentacion_notebooks_cleaning_eda.docx.
def build_document():
    """Orchestrates document creation and writes the output file."""
Call chain:
build_document()
  ├── configure_document(document)
  ├── [section builder calls...]
  ├── accent_document(document)
  └── document.save(OUTPUT_PATH)  → docs/guia_presentacion_notebooks_cleaning_eda.docx

configure_document(document)

Applies global page setup to the Word document: page size, margins, default font, heading styles, and a footer with the project name.
def configure_document(document):
    """Set page margins, font (Calibri), heading styles, and footer."""
SettingValue
Page size8.5 × 11 inches (US Letter)
Margins1 inch on all sides
Default fontCalibri
Heading 1 color#2E74B5 (blue)
Heading 2 color#2E74B5 (blue)
Heading 3 color#1F4D78 (dark blue)
FooterProject name, right-aligned

accent_document(document)

Performs a post-processing pass over all paragraph runs in the document, replacing unaccented Spanish words with their correctly accented forms (e.g., "Guia""Guía", "analisis""análisis"). The replacement dictionary covers common words used throughout the document.
def accent_document(document):
    """Apply Spanish accent corrections across all runs in the document."""
This function is called once at the end of build_document(), after all content has been added. It iterates over every paragraph and table cell in the document body and applies string replacements to each run’s text.

add_callout(document, title, body)

Inserts a single-cell table styled as a callout box with a shaded background. Used for highlighted notes, warnings, or key takeaways within a section.
def add_callout(document, title, body):
    """Add a shaded single-cell callout table with a bold title and body text."""
document
Document
required
The active python-docx Document instance to which the callout is appended.
title
str
required
Bold label displayed at the top of the callout cell (e.g., "Nota", "Importante").
body
str
required
Main body text of the callout, displayed below the title in regular weight.
Background color: #F4F6F9 (off-white blue). Border color: #D9E2EC. Cell padding: 6 pt on all sides.

add_label_detail_table(document, rows, col_widths, header)

Inserts a two-column key-value table — the left column holds a label (bold) and the right column holds its detail. Suitable for dataset summaries, column listings, and parameter descriptions.
def add_label_detail_table(document, rows, col_widths=(2700, 6660), header=None):
    """Add a two-column label–detail table with an optional header row."""
document
Document
required
The active python-docx Document instance.
rows
list[tuple[str, str]]
required
List of (label, detail) tuples. Each tuple becomes one row in the table.
col_widths
tuple[int, int]
required
Two-element tuple specifying the width in dxa units (twentieths of a point) of the label column and the detail column respectively. Default: (2700, 6660).
header
str
Optional header string. When provided, a shaded header row spanning both columns is prepended to the table.

add_matrix_table(document, headers, rows, widths)

Inserts a multi-column matrix table with a styled header row. Used for comparisons, library listings, and structured data that requires more than two columns.
def add_matrix_table(document, headers, rows, widths):
    """Add a full matrix table with a shaded header row and data rows."""
document
Document
required
The active python-docx Document instance.
headers
list[str]
required
Column header labels. Length must match the length of each inner list in rows and the length of widths.
rows
list[list[str]]
required
Data rows. Each inner list is one table row; values map positionally to headers.
widths
list[int]
required
Column widths in dxa units (twentieths of a point). Must have the same length as headers.
Header row background: #E8EEF5 (light blue) with dark-blue (#1F4D78) bold text.

add_paragraph(document, text, style, bold_prefix)

Inserts a single paragraph with a specified Word paragraph style and an optional bold prefix string.
def add_paragraph(document, text="", style=None, bold_prefix=None):
    """Add a styled paragraph, optionally prefixed with a bold run."""
document
Document
required
The active python-docx Document instance.
text
str
required
Main paragraph text, rendered in the specified style weight.
style
str
Word paragraph style name (default: None, which Word renders as Normal). Common values used in this script: "Heading 1", "Heading 2", "Normal".
bold_prefix
str
When provided, this string is inserted as a bold run before the main text. Useful for inline labels such as "Output:" or "Nota:".

add_bullets(document, items)

Inserts a bulleted list using Word’s built-in "List Bullet" paragraph style.
def add_bullets(document, items):
    """Add a bulleted list. Each string in items becomes one bullet point."""
document
Document
required
The active python-docx Document instance.
items
list[str]
required
List of strings. Each string is added as a separate "List Bullet" paragraph.

add_numbered(document, items)

Inserts a numbered list using Word’s built-in "List Number" paragraph style.
def add_numbered(document, items):
    """Add a numbered list. Each string in items becomes one numbered item."""
document
Document
required
The active python-docx Document instance.
items
list[str]
required
List of strings. Each string is added as a separate "List Number" paragraph; Word handles auto-incrementing the numbers.

Colour palette

The script defines six named colours used consistently throughout the document:
NameHexUsage
Blue#2E74B5Heading 1, Heading 2, label-detail table header text, title text
Dark blue#1F4D78Heading 3, matrix table header text, label column bold text, title run
Light blue#E8EEF5Label-detail table header row background, matrix table header row background
Light gray#F2F4F7Label column (left cell) background in data rows of label-detail tables
Border#B7C7D9Default table border colour
Muted#595959Footer text

Extending the script

To add a new section to the generated document, create a helper function following the established pattern and call it inside build_document():
def add_my_new_section(document):
    add_paragraph(document, "My New Section", style="Heading 1")
    add_paragraph(document, "Introductory text for the section.")
    add_bullets(document, [
        "First point",
        "Second point",
        "Third point",
    ])

def build_document():
    document = Document()
    configure_document(document)
    # ... existing section calls ...
    add_my_new_section(document)   # <-- add your call here
    accent_document(document)
    document.save(OUTPUT_PATH)  # OUTPUT_PATH = PROJECT_ROOT / "docs" / "guia_presentacion_notebooks_cleaning_eda.docx"
Always call accent_document(document) as the last step before document.save(). Calling it earlier means any text added after the call will not have its accents corrected.

Build docs developers (and LLMs) love