Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/frxxxnz/1ACC0216-TB1-2026-1/llms.txt

Use this file to discover all available pages before exploring further.

This project applied a structured exploratory data analysis pipeline to a real-world hotel bookings dataset using R and ggplot2. Starting from a raw CSV file, the team cleaned, transformed, preprocessed, and visualised the data across eight charts, producing a set of descriptive findings about booking patterns, guest composition, cancellation behaviour, and pricing. This page synthesises those findings into practical implications and identifies directions for future work.

Analysis pipeline summary

The workflow followed a clear sequence of steps, each building on the previous:
1

Data loading

The dataset was read into R with read.csv(), explicitly setting header = TRUE and stringsAsFactors = FALSE to maintain control over data types from the outset.
2

Data inspection

str(), summary(), and sum(duplicated()) were used to understand the structure, value ranges, and data quality of all 32 variables before any modifications were made.
3

Transformation

Duplicate records were removed with unique(). Key variables — hotel, arrival_date_month, meal, and is_canceled — were cast to factors, and reservation_status_date was parsed as a Date object.
4

Preprocessing

Missing values in children were imputed using the column median. Extreme values in adr were treated via Winsorization at the 95th percentile, reducing the influence of outliers without record deletion.
5

Visualisation and interpretation

Eight ggplot2 charts were produced, covering hotel type volumes, monthly demand trends, stay duration patterns, guest composition, parking demand, cancellation proportions, and the relationship between lead time and cancellation.

Practical implications for hotel management

Revenue management

Winsorizing adr at the 95th percentile produces a cleaner pricing baseline for any downstream rate analysis. Without this treatment, a small number of anomalous rates — likely data entry errors or exceptional events — would distort average calculations and obscure genuine pricing signals. Hotels relying on ADR as a KPI should apply similar outlier-handling practices before reporting.

Cancellation strategy

The most operationally significant finding is the relationship between lead time and cancellation. Bookings made months in advance carry a higher cancellation risk than bookings made close to the arrival date. Hotels may benefit from:
  • Tiered deposit policies that increase the non-refundable portion for long-lead bookings
  • Targeted pre-arrival communications (e.g., confirmation nudges at 30 and 7 days out)
  • Dynamic pricing that adjusts rates as the arrival date approaches and demand crystallises
City Hotels, which already show higher cancellation proportions than Resort Hotels, are the most immediate beneficiaries of a lead-time–aware cancellation strategy.

Seasonal staffing and inventory planning

The monthly booking distribution shows consistent seasonal peaks in summer and troughs in winter. This predictable shape enables hotels to plan staffing rosters, food and beverage inventory, and maintenance windows around known low-demand periods, rather than reacting to occupancy changes after they occur.

Marketing segmentation

The data shows that the overwhelming majority of bookings are adults-only; bookings including children or babies are a small minority. Marketing budgets allocated to family-focused campaigns should be calibrated accordingly. However, the family segment — though small — may warrant dedicated messaging given that family travellers often have higher total spend and longer average stays.

Limitations

This analysis is entirely exploratory. No statistical models were fitted and no hypotheses were formally tested. All findings are descriptive summaries of observed patterns in a single dataset. The following limitations apply:
  • No causal inference: Correlations between variables (e.g., lead time and cancellation) do not establish causation without further analysis.
  • No model validation: Without a predictive model, it is not possible to quantify how reliably any variable predicts an outcome.
  • Dataset scope: The dataset covers a defined time window (2015–2017) from a limited number of properties. Findings may not generalise to other markets or time periods.
  • Preprocessing choices: Median imputation and Winsorization introduce assumptions. Alternative treatments may yield different distributional results.

Continue exploring

Visualizations

Review all eight ggplot2 charts that underpin these conclusions, with annotations for each.

Analysis workflow

Trace the full preprocessing pipeline from raw CSV import through outlier treatment.
The natural next step beyond this EDA is predictive modelling. Logistic regression is a straightforward starting point for predicting is_canceled from lead time, hotel type, and deposit type. Ensemble methods such as random forest or gradient boosting (e.g., xgboost in R) can capture non-linear interactions and typically yield stronger predictive performance. Cross-validated model evaluation with precision, recall, and AUC-ROC metrics would make the cancellation risk signal operationally reliable.

Build docs developers (and LLMs) love