This project applied a structured exploratory data analysis pipeline to a real-world hotel bookings dataset using R and ggplot2. Starting from a raw CSV file, the team cleaned, transformed, preprocessed, and visualised the data across eight charts, producing a set of descriptive findings about booking patterns, guest composition, cancellation behaviour, and pricing. This page synthesises those findings into practical implications and identifies directions for future work.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/frxxxnz/1ACC0216-TB1-2026-1/llms.txt
Use this file to discover all available pages before exploring further.
Analysis pipeline summary
The workflow followed a clear sequence of steps, each building on the previous:Data loading
The dataset was read into R with
read.csv(), explicitly setting header = TRUE and stringsAsFactors = FALSE to maintain control over data types from the outset.Data inspection
str(), summary(), and sum(duplicated()) were used to understand the structure, value ranges, and data quality of all 32 variables before any modifications were made.Transformation
Duplicate records were removed with
unique(). Key variables — hotel, arrival_date_month, meal, and is_canceled — were cast to factors, and reservation_status_date was parsed as a Date object.Preprocessing
Missing values in
children were imputed using the column median. Extreme values in adr were treated via Winsorization at the 95th percentile, reducing the influence of outliers without record deletion.Practical implications for hotel management
Revenue management
Winsorizingadr at the 95th percentile produces a cleaner pricing baseline for any downstream rate analysis. Without this treatment, a small number of anomalous rates — likely data entry errors or exceptional events — would distort average calculations and obscure genuine pricing signals. Hotels relying on ADR as a KPI should apply similar outlier-handling practices before reporting.
Cancellation strategy
The most operationally significant finding is the relationship between lead time and cancellation. Bookings made months in advance carry a higher cancellation risk than bookings made close to the arrival date. Hotels may benefit from:- Tiered deposit policies that increase the non-refundable portion for long-lead bookings
- Targeted pre-arrival communications (e.g., confirmation nudges at 30 and 7 days out)
- Dynamic pricing that adjusts rates as the arrival date approaches and demand crystallises
Seasonal staffing and inventory planning
The monthly booking distribution shows consistent seasonal peaks in summer and troughs in winter. This predictable shape enables hotels to plan staffing rosters, food and beverage inventory, and maintenance windows around known low-demand periods, rather than reacting to occupancy changes after they occur.Marketing segmentation
The data shows that the overwhelming majority of bookings are adults-only; bookings including children or babies are a small minority. Marketing budgets allocated to family-focused campaigns should be calibrated accordingly. However, the family segment — though small — may warrant dedicated messaging given that family travellers often have higher total spend and longer average stays.Limitations
This analysis is entirely exploratory. No statistical models were fitted and no hypotheses were formally tested. All findings are descriptive summaries of observed patterns in a single dataset. The following limitations apply:
- No causal inference: Correlations between variables (e.g., lead time and cancellation) do not establish causation without further analysis.
- No model validation: Without a predictive model, it is not possible to quantify how reliably any variable predicts an outcome.
- Dataset scope: The dataset covers a defined time window (2015–2017) from a limited number of properties. Findings may not generalise to other markets or time periods.
- Preprocessing choices: Median imputation and Winsorization introduce assumptions. Alternative treatments may yield different distributional results.
Continue exploring
Visualizations
Review all eight ggplot2 charts that underpin these conclusions, with annotations for each.
Analysis workflow
Trace the full preprocessing pipeline from raw CSV import through outlier treatment.