Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/frxxxnz/1ACC0216-TB1-2026-1/llms.txt

Use this file to discover all available pages before exploring further.

Clean column types are not enough on their own — missing values and extreme outliers will skew every statistic and visualisation produced downstream. This section covers two distinct preprocessing concerns: locating and filling NA entries in the children column, and suppressing extreme values in the adr (Average Daily Rate) column without discarding the affected records.

Identifying missing values

colSums(is.na(df)) applies is.na() to every column simultaneously and sums the resulting TRUE/FALSE vector to produce a named integer showing how many NAs each column contains.
upc-grupo5-tb1.R
# 3.4.1 Identificación y Tratamiento de Datos Faltantes
colSums(is.na(df)) # Identificar NAs
Run this line and check the output before any imputation. Only columns with a non-zero count require treatment.

Imputing the children column

The children column records the number of children included in a booking and contains a small number of NA entries. Rather than deleting those rows, the missing values are replaced with the column’s median.
upc-grupo5-tb1.R
# Imputación de la mediana en la variable 'children'
df$children[is.na(df$children)] <- median(df$children, na.rm = TRUE)
The logical index is.na(df$children) selects only the missing positions, so existing values are never overwritten. The na.rm = TRUE argument inside median() ensures the median is computed from the non-missing values only.The median is preferred over the mean here because children is a count variable with a right-skewed distribution (most bookings have zero children). The mean would be pulled upward by the few bookings with several children, producing an imputed value that is unrepresentative of the typical case.

Build docs developers (and LLMs) love