Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HelenDiMo/TinderJob/llms.txt

Use this file to discover all available pages before exploring further.

The TinderJob cleaning pipeline transforms the raw output from the Tecnoempleo scraper into a structured, analytics-ready dataset. It reads data/raw/tecnoempleo_jobs.csv, applies a deterministic sequence of nine transformation steps — covering text normalization, deduplication, feature engineering, salary parsing, and outlier detection — and writes the result to data/processed/clean_tecnoempleo_jobs.csv. Every function is pure and idempotent: given the same raw input, the pipeline always produces identical output.

Running the Pipeline

Execute the cleaning script from the project root after the scraper has produced the raw CSV:
python src/data_processing/clean_tecnoempleo_data.py
The script prints a diagnostic summary at each step, including row counts before and after deduplication and the IQR salary bounds used for outlier detection.

Pipeline Steps

1

Load data — carga_datos()

Reads data/raw/tecnoempleo_jobs.csv into a Pandas DataFrame using pd.read_csv(). Prints the loaded shape (rows × columns) as a quick sanity check before any transformations are applied.
2

Text normalization — normalizar_texto()

Lowercases the content of the following columns: titulo, empresa, ubicacion, tipo_contrato, skills, and busqueda. The url column is deliberately excluded from normalization — URLs are case-sensitive and lowercasing them would corrupt links to offer detail pages.
3

Duplicate removal — eliminar_dupl()

Drops duplicate rows using a five-column composite key: titulo, empresa, ubicacion, salario, and tipo_contrato. The first occurrence of each unique combination is kept. Row counts before and after are printed so duplicate volume is auditable.
4

Skills cleaning — limpiar_skills()

Applied row-by-row to the skills column. Each value is split on commas, each token is stripped of whitespace and lowercased, duplicates are removed while preserving order, and the cleaned list is re-joined into a comma-separated string. NaN values are passed through as None.
5

Work modality derivation — crear_modalidad()

Creates a new modalidad column by scanning the ubicacion text for keyword signals:
Keyword detectedAssigned modalidad
"remoto""En Remoto"
"híbrido" or "hibrido""Híbrido"
"presencial""Presencial"
(none / null)"No especificado"
Detection is case-insensitive because normalizar_texto() has already lowercased the column.
6

City extraction — crear_ciudad()

Creates a new ciudad column by stripping known modal suffixes and regional qualifiers from the ubicacion value. The following patterns are removed: " - españa", "(híbrido)", "(hibrido)", "(presencial)", "(remoto)". The string "100% remoto" is normalized to "remoto". The result is trimmed and stored as the clean city name.
7

Salary parsing — limpiar_salarios()

Parses the free-text salario column into three new numeric columns using regex:
  • salario_min — lower bound of the salary range (annual EUR)
  • salario_max — upper bound of the salary range (annual EUR)
  • salario_medio — arithmetic mean of min and max
If the raw text contains monthly indicators ("mes", "b/m", "monthly", "/month"), both min and max are multiplied by 12 before computing the mean. Rows where the salary text cannot yield at least two numeric values receive None in all three columns.
8

Outlier detection — outlier_salario()

Computes Q1, Q3, and IQR from the non-null salario_medio values. Marks each row with a boolean es_outlier column:
lower_bound = Q1 − 1.5 × IQR
upper_bound = Q3 + 3.0 × IQR
Additionally, rows where salario_min < 10,000 (implausibly low annual figures — likely parsing artefacts) are removed from the dataset entirely. The number of removed rows is printed to stdout.
The IQR upper-bound multiplier is (not the conventional 1.5×). This asymmetric formula intentionally preserves high-salary senior and specialist roles, which are legitimate data points for the TinderJob salary benchmarking analysis rather than erroneous outliers.
9

Save — guardar_datos_limpios()

Creates the data/processed/ directory if it does not exist, then writes the cleaned DataFrame to data/processed/clean_tecnoempleo_jobs.csv with UTF-8 BOM encoding (utf-8-sig) so the file opens correctly in Excel and other Windows-native tools without encoding errors.

Function Reference

FunctionSignatureReturnsDescription
carga_datoscarga_datos(ruta: str)pd.DataFrameReads raw CSV and prints shape
limpiar_textolimpiar_texto(valor)str | NoneStrips leading/trailing whitespace and collapses internal multiple spaces to a single space; returns None for NaN input
limpiar_columnaslimpiar_columnas(df: pd.DataFrame)pd.DataFrameApplies limpiar_texto element-wise to all text columns: titulo, empresa, ubicacion, salario, tipo_contrato, skills, busqueda
normalizar_textonormalizar_texto(df: pd.DataFrame)pd.DataFrameLowercases the following columns: titulo, empresa, ubicacion, tipo_contrato, skills, busqueda. Skips both salario (to preserve raw formatting for regex parsing) and url (case-sensitive)
eliminar_dupleliminar_dupl(df: pd.DataFrame)pd.DataFrameDeduplicates on the five-column composite key; prints before/after counts
limpiar_skillslimpiar_skills(valor)str | NoneSplits, strips, lowercases, deduplicates, and rejoins the skills string
crear_modalidadcrear_modalidad(df: pd.DataFrame)pd.DataFrameAdds modalidad column by parsing ubicacion keywords
crear_ciudadcrear_ciudad(df: pd.DataFrame)pd.DataFrameAdds ciudad column by stripping modal/regional suffixes from ubicacion
limpiar_salarioslimpiar_salarios(df: pd.DataFrame)pd.DataFrameAdds salario_min, salario_max, salario_medio via regex parsing
outlier_salariooutlier_salario(df: pd.DataFrame)pd.DataFrameAdds es_outlier boolean; removes rows where salario_min < 10000
guardar_datos_limpiosguardar_datos_limpios(df: pd.DataFrame, ruta: str)NoneSaves cleaned DataFrame as UTF-8 BOM CSV

Output Schema

The processed CSV at data/processed/clean_tecnoempleo_jobs.csv contains all eight raw columns plus six derived columns added by the pipeline:
ColumnOriginDescription
tituloRawNormalized job title (lowercase)
empresaRawNormalized company name (lowercase)
ubicacionRawNormalized location string (lowercase)
salarioRawOriginal salary range text (unchanged)
tipo_contratoRawNormalized contract type (lowercase)
skillsRaw → cleanedDeduplicated, lowercase, comma-separated skill list
busquedaRawNormalized search term (lowercase)
urlRawFull offer URL (case preserved)
modalidadDerivedWork modality: 'En Remoto', 'Híbrido', 'Presencial', or 'No especificado'
ciudadDerivedClean city name, stripped of modality suffixes
salario_minDerivedMinimum annual salary (EUR float, or NaN)
salario_maxDerivedMaximum annual salary (EUR float, or NaN)
salario_medioDerivedMean of salario_min and salario_max (EUR float, or NaN)
es_outlierDerivedTrue if salario_medio falls outside IQR-based bounds

Build docs developers (and LLMs) love