Preprocessing: Missing Values and Outlier Treatment

Clean column types are not enough on their own — missing values and extreme outliers will skew every statistic and visualisation produced downstream. This section covers two distinct preprocessing concerns: locating and filling NA entries in the children column, and suppressing extreme values in the adr (Average Daily Rate) column without discarding the affected records.

3.4.1 Missing values
3.4.2 Outlier treatment

Identifying missing values

colSums(is.na(df)) applies is.na() to every column simultaneously and sums the resulting TRUE/FALSE vector to produce a named integer showing how many NAs each column contains.

upc-grupo5-tb1.R

# 3.4.1 Identificación y Tratamiento de Datos Faltantes
colSums(is.na(df)) # Identificar NAs

Run this line and check the output before any imputation. Only columns with a non-zero count require treatment.

Imputing the `children` column

The children column records the number of children included in a booking and contains a small number of NA entries. Rather than deleting those rows, the missing values are replaced with the column’s median.

upc-grupo5-tb1.R

# Imputación de la mediana en la variable 'children'
df$children[is.na(df$children)] <- median(df$children, na.rm = TRUE)

The logical index is.na(df$children) selects only the missing positions, so existing values are never overwritten. The na.rm = TRUE argument inside median() ensures the median is computed from the non-missing values only.The median is preferred over the mean here because children is a count variable with a right-skewed distribution (most bookings have zero children). The mean would be pulled upward by the few bookings with several children, producing an imputed value that is unrepresentative of the typical case.

Detecting outliers in ADR

ADR (Average Daily Rate) is the average revenue earned per occupied room per day. It is computed by dividing total room revenue by the number of rooms sold. In this dataset, a small number of bookings have implausibly high ADR values — likely data entry errors or very unusual reservations — that would compress the y-axis of any plot and inflate regression coefficients.A boxplot makes these extremes visible before treatment:

upc-grupo5-tb1.R

# 3.4.2 Detección y Tratamiento de Outliers (ADR)
# Boxplot inicial para detección
boxplot(df$adr, main="Detección de Outliers en ADR", col="orange")

Winsorizing ADR at the 95th percentile

upc-grupo5-tb1.R

# Winsorización al percentil 95
limite_superior <- quantile(df$adr, 0.95, na.rm = TRUE)
df$adr[df$adr > limite_superior] <- limite_superior

# Boxplot después del tratamiento
boxplot(df$adr, main="ADR después de Winsorización", col="lightgreen")

quantile(df$adr, 0.95) computes the value below which 95 % of ADR observations fall. Every row whose adr exceeds that threshold is then capped — set equal to — limite_superior rather than deleted.The second boxplot confirms that the extreme whisker has been removed while the distribution shape below the 95th percentile is preserved intact.

Winsorization was chosen over row deletion because removing outlier bookings would reduce the dataset size and could introduce selection bias — high-rate bookings may correlate with specific hotel types or seasons that matter for the analysis. Capping retains those rows’ other variables while limiting the influence of the extreme ADR value on statistics and plots.

Overview

Analysis Workflow

Visualizations

Results & Conclusions

Preprocessing: Missing Values and Outlier Treatment

Identifying missing values

Imputing the `children` column

Detecting outliers in ADR

Winsorizing ADR at the 95th percentile

Build docs developers (and LLMs) love

Overview

Analysis Workflow

Visualizations

Results & Conclusions

Documentation Index

​Identifying missing values

​Imputing the children column

​Detecting outliers in ADR

​Winsorizing ADR at the 95th percentile

Build docs developers (and LLMs) love

Identifying missing values

Imputing the `children` column

Detecting outliers in ADR

Winsorizing ADR at the 95th percentile