Documentation Index
Fetch the complete documentation index at: https://mintlify.com/frxxxnz/1ACC0216-TB1-2026-1/llms.txt
Use this file to discover all available pages before exploring further.
Clean column types are not enough on their own — missing values and extreme outliers will skew every statistic and visualisation produced downstream. This section covers two distinct preprocessing concerns: locating and filling NA entries in the children column, and suppressing extreme values in the adr (Average Daily Rate) column without discarding the affected records.
3.4.1 Missing values
3.4.2 Outlier treatment
Identifying missing values
colSums(is.na(df)) applies is.na() to every column simultaneously and sums the resulting TRUE/FALSE vector to produce a named integer showing how many NAs each column contains.# 3.4.1 Identificación y Tratamiento de Datos Faltantes
colSums(is.na(df)) # Identificar NAs
Run this line and check the output before any imputation. Only columns with a non-zero count require treatment.Imputing the children column
The children column records the number of children included in a booking and contains a small number of NA entries. Rather than deleting those rows, the missing values are replaced with the column’s median.# Imputación de la mediana en la variable 'children'
df$children[is.na(df$children)] <- median(df$children, na.rm = TRUE)
The logical index is.na(df$children) selects only the missing positions, so existing values are never overwritten. The na.rm = TRUE argument inside median() ensures the median is computed from the non-missing values only.The median is preferred over the mean here because children is a count variable with a right-skewed distribution (most bookings have zero children). The mean would be pulled upward by the few bookings with several children, producing an imputed value that is unrepresentative of the typical case.Detecting outliers in ADR
ADR (Average Daily Rate) is the average revenue earned per occupied room per day. It is computed by dividing total room revenue by the number of rooms sold. In this dataset, a small number of bookings have implausibly high ADR values — likely data entry errors or very unusual reservations — that would compress the y-axis of any plot and inflate regression coefficients.A boxplot makes these extremes visible before treatment:# 3.4.2 Detección y Tratamiento de Outliers (ADR)
# Boxplot inicial para detección
boxplot(df$adr, main="Detección de Outliers en ADR", col="orange")
Winsorizing ADR at the 95th percentile
# Winsorización al percentil 95
limite_superior <- quantile(df$adr, 0.95, na.rm = TRUE)
df$adr[df$adr > limite_superior] <- limite_superior
# Boxplot después del tratamiento
boxplot(df$adr, main="ADR después de Winsorización", col="lightgreen")
quantile(df$adr, 0.95) computes the value below which 95 % of ADR observations fall. Every row whose adr exceeds that threshold is then capped — set equal to — limite_superior rather than deleted.The second boxplot confirms that the extreme whisker has been removed while the distribution shape below the 95th percentile is preserved intact.Winsorization was chosen over row deletion because removing outlier bookings would reduce the dataset size and could introduce selection bias — high-rate bookings may correlate with specific hotel types or seasons that matter for the analysis. Capping retains those rows’ other variables while limiting the influence of the extreme ADR value on statistics and plots.