The TinderJob cleaning pipeline transforms the raw output from the Tecnoempleo scraper into a structured, analytics-ready dataset. It readsDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt
Use this file to discover all available pages before exploring further.
data/raw/tecnoempleo_jobs.csv, applies a deterministic sequence of nine transformation steps — covering text normalization, deduplication, feature engineering, salary parsing, and outlier detection — and writes the result to data/processed/clean_tecnoempleo_jobs.csv. Every function is pure and idempotent: given the same raw input, the pipeline always produces identical output.
Running the Pipeline
Execute the cleaning script from the project root after the scraper has produced the raw CSV:Pipeline Steps
Load data — carga_datos()
Reads
data/raw/tecnoempleo_jobs.csv into a Pandas DataFrame using pd.read_csv(). Prints the loaded shape (rows × columns) as a quick sanity check before any transformations are applied.Text normalization — normalizar_texto()
Lowercases the content of the following columns:
titulo, empresa, ubicacion, tipo_contrato, skills, and busqueda. The url column is deliberately excluded from normalization — URLs are case-sensitive and lowercasing them would corrupt links to offer detail pages.Duplicate removal — eliminar_dupl()
Drops duplicate rows using a five-column composite key:
titulo, empresa, ubicacion, salario, and tipo_contrato. The first occurrence of each unique combination is kept. Row counts before and after are printed so duplicate volume is auditable.Skills cleaning — limpiar_skills()
Applied row-by-row to the
skills column. Each value is split on commas, each token is stripped of whitespace and lowercased, duplicates are removed while preserving order, and the cleaned list is re-joined into a comma-separated string. NaN values are passed through as None.Work modality derivation — crear_modalidad()
Creates a new
Detection is case-insensitive because
modalidad column by scanning the ubicacion text for keyword signals:| Keyword detected | Assigned modalidad |
|---|---|
"remoto" | "En Remoto" |
"híbrido" or "hibrido" | "Híbrido" |
"presencial" | "Presencial" |
| (none / null) | "No especificado" |
normalizar_texto() has already lowercased the column.City extraction — crear_ciudad()
Creates a new
ciudad column by stripping known modal suffixes and regional qualifiers from the ubicacion value. The following patterns are removed: " - españa", "(híbrido)", "(hibrido)", "(presencial)", "(remoto)". The string "100% remoto" is normalized to "remoto". The result is trimmed and stored as the clean city name.Salary parsing — limpiar_salarios()
Parses the free-text
salario column into three new numeric columns using regex:salario_min— lower bound of the salary range (annual EUR)salario_max— upper bound of the salary range (annual EUR)salario_medio— arithmetic mean of min and max
"mes", "b/m", "monthly", "/month"), both min and max are multiplied by 12 before computing the mean. Rows where the salary text cannot yield at least two numeric values receive None in all three columns.Outlier detection — outlier_salario()
Computes Q1, Q3, and IQR from the non-null Additionally, rows where
salario_medio values. Marks each row with a boolean es_outlier column:salario_min < 10,000 (implausibly low annual figures — likely parsing artefacts) are removed from the dataset entirely. The number of removed rows is printed to stdout.The IQR upper-bound multiplier is 3× (not the conventional 1.5×). This asymmetric formula intentionally preserves high-salary senior and specialist roles, which are legitimate data points for the TinderJob salary benchmarking analysis rather than erroneous outliers.
Function Reference
| Function | Signature | Returns | Description |
|---|---|---|---|
carga_datos | carga_datos(ruta: str) | pd.DataFrame | Reads raw CSV and prints shape |
limpiar_texto | limpiar_texto(valor) | str | None | Strips leading/trailing whitespace and collapses internal multiple spaces to a single space; returns None for NaN input |
limpiar_columnas | limpiar_columnas(df: pd.DataFrame) | pd.DataFrame | Applies limpiar_texto element-wise to all text columns: titulo, empresa, ubicacion, salario, tipo_contrato, skills, busqueda |
normalizar_texto | normalizar_texto(df: pd.DataFrame) | pd.DataFrame | Lowercases the following columns: titulo, empresa, ubicacion, tipo_contrato, skills, busqueda. Skips both salario (to preserve raw formatting for regex parsing) and url (case-sensitive) |
eliminar_dupl | eliminar_dupl(df: pd.DataFrame) | pd.DataFrame | Deduplicates on the five-column composite key; prints before/after counts |
limpiar_skills | limpiar_skills(valor) | str | None | Splits, strips, lowercases, deduplicates, and rejoins the skills string |
crear_modalidad | crear_modalidad(df: pd.DataFrame) | pd.DataFrame | Adds modalidad column by parsing ubicacion keywords |
crear_ciudad | crear_ciudad(df: pd.DataFrame) | pd.DataFrame | Adds ciudad column by stripping modal/regional suffixes from ubicacion |
limpiar_salarios | limpiar_salarios(df: pd.DataFrame) | pd.DataFrame | Adds salario_min, salario_max, salario_medio via regex parsing |
outlier_salario | outlier_salario(df: pd.DataFrame) | pd.DataFrame | Adds es_outlier boolean; removes rows where salario_min < 10000 |
guardar_datos_limpios | guardar_datos_limpios(df: pd.DataFrame, ruta: str) | None | Saves cleaned DataFrame as UTF-8 BOM CSV |
Output Schema
The processed CSV atdata/processed/clean_tecnoempleo_jobs.csv contains all eight raw columns plus six derived columns added by the pipeline:
| Column | Origin | Description |
|---|---|---|
titulo | Raw | Normalized job title (lowercase) |
empresa | Raw | Normalized company name (lowercase) |
ubicacion | Raw | Normalized location string (lowercase) |
salario | Raw | Original salary range text (unchanged) |
tipo_contrato | Raw | Normalized contract type (lowercase) |
skills | Raw → cleaned | Deduplicated, lowercase, comma-separated skill list |
busqueda | Raw | Normalized search term (lowercase) |
url | Raw | Full offer URL (case preserved) |
modalidad | Derived | Work modality: 'En Remoto', 'Híbrido', 'Presencial', or 'No especificado' |
ciudad | Derived | Clean city name, stripped of modality suffixes |
salario_min | Derived | Minimum annual salary (EUR float, or NaN) |
salario_max | Derived | Maximum annual salary (EUR float, or NaN) |
salario_medio | Derived | Mean of salario_min and salario_max (EUR float, or NaN) |
es_outlier | Derived | True if salario_medio falls outside IQR-based bounds |