Data Cleaning Pipeline for Tecnoempleo Jobs Dataset

The TinderJob cleaning pipeline transforms the raw output from the Tecnoempleo scraper into a structured, analytics-ready dataset. It reads data/raw/tecnoempleo_jobs.csv, applies a deterministic sequence of nine transformation steps — covering text normalization, deduplication, feature engineering, salary parsing, and outlier detection — and writes the result to data/processed/clean_tecnoempleo_jobs.csv. Every function is pure and idempotent: given the same raw input, the pipeline always produces identical output.

Running the Pipeline

Execute the cleaning script from the project root after the scraper has produced the raw CSV:

python src/data_processing/clean_tecnoempleo_data.py

The script prints a diagnostic summary at each step, including row counts before and after deduplication and the IQR salary bounds used for outlier detection.

Pipeline Steps

Load data — carga_datos()

Reads data/raw/tecnoempleo_jobs.csv into a Pandas DataFrame using pd.read_csv(). Prints the loaded shape (rows × columns) as a quick sanity check before any transformations are applied.

Text normalization — normalizar_texto()

Lowercases the content of the following columns: titulo, empresa, ubicacion, tipo_contrato, skills, and busqueda. The url column is deliberately excluded from normalization — URLs are case-sensitive and lowercasing them would corrupt links to offer detail pages.

Duplicate removal — eliminar_dupl()

Drops duplicate rows using a five-column composite key: titulo, empresa, ubicacion, salario, and tipo_contrato. The first occurrence of each unique combination is kept. Row counts before and after are printed so duplicate volume is auditable.

Skills cleaning — limpiar_skills()

Applied row-by-row to the skills column. Each value is split on commas, each token is stripped of whitespace and lowercased, duplicates are removed while preserving order, and the cleaned list is re-joined into a comma-separated string. NaN values are passed through as None.

Work modality derivation — crear_modalidad()

Creates a new modalidad column by scanning the ubicacion text for keyword signals:

Keyword detected	Assigned `modalidad`
`"remoto"`	`"En Remoto"`
`"híbrido"` or `"hibrido"`	`"Híbrido"`
`"presencial"`	`"Presencial"`
(none / null)	`"No especificado"`

Detection is case-insensitive because normalizar_texto() has already lowercased the column.

City extraction — crear_ciudad()

Creates a new ciudad column by stripping known modal suffixes and regional qualifiers from the ubicacion value. The following patterns are removed: " - españa", "(híbrido)", "(hibrido)", "(presencial)", "(remoto)". The string "100% remoto" is normalized to "remoto". The result is trimmed and stored as the clean city name.

Salary parsing — limpiar_salarios()

Parses the free-text salario column into three new numeric columns using regex:

salario_min — lower bound of the salary range (annual EUR)
salario_max — upper bound of the salary range (annual EUR)
salario_medio — arithmetic mean of min and max

If the raw text contains monthly indicators ("mes", "b/m", "monthly", "/month"), both min and max are multiplied by 12 before computing the mean. Rows where the salary text cannot yield at least two numeric values receive None in all three columns.

Outlier detection — outlier_salario()

Computes Q1, Q3, and IQR from the non-null salario_medio values. Marks each row with a boolean es_outlier column:

lower_bound = Q1 − 1.5 × IQR
upper_bound = Q3 + 3.0 × IQR

Additionally, rows where salario_min < 10,000 (implausibly low annual figures — likely parsing artefacts) are removed from the dataset entirely. The number of removed rows is printed to stdout.

The IQR upper-bound multiplier is 3× (not the conventional 1.5×). This asymmetric formula intentionally preserves high-salary senior and specialist roles, which are legitimate data points for the TinderJob salary benchmarking analysis rather than erroneous outliers.

Save — guardar_datos_limpios()

Creates the data/processed/ directory if it does not exist, then writes the cleaned DataFrame to data/processed/clean_tecnoempleo_jobs.csv with UTF-8 BOM encoding (utf-8-sig) so the file opens correctly in Excel and other Windows-native tools without encoding errors.

Function Reference

Function	Signature	Returns	Description
`carga_datos`	`carga_datos(ruta: str)`	`pd.DataFrame`	Reads raw CSV and prints shape
`limpiar_texto`	`limpiar_texto(valor)`	`str \| None`	Strips leading/trailing whitespace and collapses internal multiple spaces to a single space; returns `None` for `NaN` input
`limpiar_columnas`	`limpiar_columnas(df: pd.DataFrame)`	`pd.DataFrame`	Applies `limpiar_texto` element-wise to all text columns: `titulo`, `empresa`, `ubicacion`, `salario`, `tipo_contrato`, `skills`, `busqueda`
`normalizar_texto`	`normalizar_texto(df: pd.DataFrame)`	`pd.DataFrame`	Lowercases the following columns: `titulo`, `empresa`, `ubicacion`, `tipo_contrato`, `skills`, `busqueda`. Skips both `salario` (to preserve raw formatting for regex parsing) and `url` (case-sensitive)
`eliminar_dupl`	`eliminar_dupl(df: pd.DataFrame)`	`pd.DataFrame`	Deduplicates on the five-column composite key; prints before/after counts
`limpiar_skills`	`limpiar_skills(valor)`	`str \| None`	Splits, strips, lowercases, deduplicates, and rejoins the skills string
`crear_modalidad`	`crear_modalidad(df: pd.DataFrame)`	`pd.DataFrame`	Adds `modalidad` column by parsing `ubicacion` keywords
`crear_ciudad`	`crear_ciudad(df: pd.DataFrame)`	`pd.DataFrame`	Adds `ciudad` column by stripping modal/regional suffixes from `ubicacion`
`limpiar_salarios`	`limpiar_salarios(df: pd.DataFrame)`	`pd.DataFrame`	Adds `salario_min`, `salario_max`, `salario_medio` via regex parsing
`outlier_salario`	`outlier_salario(df: pd.DataFrame)`	`pd.DataFrame`	Adds `es_outlier` boolean; removes rows where `salario_min < 10000`
`guardar_datos_limpios`	`guardar_datos_limpios(df: pd.DataFrame, ruta: str)`	`None`	Saves cleaned DataFrame as UTF-8 BOM CSV

Output Schema

The processed CSV at data/processed/clean_tecnoempleo_jobs.csv contains all eight raw columns plus six derived columns added by the pipeline:

Column	Origin	Description
`titulo`	Raw	Normalized job title (lowercase)
`empresa`	Raw	Normalized company name (lowercase)
`ubicacion`	Raw	Normalized location string (lowercase)
`salario`	Raw	Original salary range text (unchanged)
`tipo_contrato`	Raw	Normalized contract type (lowercase)
`skills`	Raw → cleaned	Deduplicated, lowercase, comma-separated skill list
`busqueda`	Raw	Normalized search term (lowercase)
`url`	Raw	Full offer URL (case preserved)
`modalidad`	Derived	Work modality: `'En Remoto'`, `'Híbrido'`, `'Presencial'`, or `'No especificado'`
`ciudad`	Derived	Clean city name, stripped of modality suffixes
`salario_min`	Derived	Minimum annual salary (EUR float, or `NaN`)
`salario_max`	Derived	Maximum annual salary (EUR float, or `NaN`)
`salario_medio`	Derived	Mean of `salario_min` and `salario_max` (EUR float, or `NaN`)
`es_outlier`	Derived	`True` if `salario_medio` falls outside IQR-based bounds

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Data Cleaning Pipeline for Tecnoempleo Jobs Dataset

Running the Pipeline

Pipeline Steps

Function Reference

Output Schema

Build docs developers (and LLMs) love

Overview

Data Pipeline

Analysis Notebooks

Streamlit Dashboard

Key Findings

Documentation Index

​Running the Pipeline

​Pipeline Steps

​Function Reference

​Output Schema

Build docs developers (and LLMs) love

Running the Pipeline

Pipeline Steps

Function Reference

Output Schema