This section presents the main insights derived from the exploratory data analysis performed on theDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/frxxxnz/1ACC0216-TB1-2026-1/llms.txt
Use this file to discover all available pages before exploring further.
hotel_bookings dataset. Each finding corresponds to one of the eight ggplot2 visualizations produced in section 3.5 of the R script, or to a preprocessing step that shaped the data before visualisation. The analysis covers booking volumes, seasonal patterns, guest behavior, parking demand, cancellation rates, lead time effects, and pricing outliers.
All findings below are descriptive — they reflect patterns observed in the data through exploratory data analysis (EDA). No predictive models were trained as part of this project. Causal interpretations should be made with caution.
Findings by theme
Hotel type distribution (chart 1)
Hotel type distribution (chart 1)
The dataset contains two hotel types: Resort Hotel and City Hotel. The bar chart of reservation counts shows that City Hotels account for a substantially larger share of total bookings than Resort Hotels. This imbalance has implications for any comparative analysis — proportional metrics (such as cancellation rates) should be preferred over raw counts when comparing the two segments.
Seasonal demand patterns (charts 2–3)
Seasonal demand patterns (charts 2–3)
Two charts examine monthly booking volume. The line chart (chart 2) plots monthly counts separately for each year (2015, 2016, 2017), revealing year-over-year growth and a consistent seasonal shape. The bar chart (chart 3) aggregates all years to show the overall distribution across months.Together they indicate that demand peaks in the summer months (roughly June through August) and troughs in winter (November through February). This pattern is consistent with typical European leisure travel demand and has direct implications for revenue management and staffing.
Stay duration patterns (chart 4)
Stay duration patterns (chart 4)
The scatter plot of
stays_in_weekend_nights against stays_in_week_nights shows that the most common stay pattern is 1–5 weeknights with 0–2 weekend nights. The jitter plot with low alpha reveals the density structure: short midweek stays dominate, while longer stays that include multiple weekend nights are comparatively rare.This suggests the guest mix skews toward business travellers or short leisure trips, rather than extended resort stays.Guest composition — children and babies (chart 5)
Guest composition — children and babies (chart 5)
A derived binary variable (
tiene_ninos) was created to flag bookings that include at least one child or baby. The bar chart shows that the large majority of bookings are adults-only; bookings that include children or babies represent a clear minority of the dataset.This finding is relevant for marketing segmentation: family-oriented amenities and promotions may have limited reach across the full customer base, though they may be critical for retaining the family segment that does book.Parking demand (chart 6)
Parking demand (chart 6)
The bar chart of
required_car_parking_spaces (treated as a categorical variable) shows that the vast majority of guests request zero parking spaces. A small but non-trivial proportion requests one space, and requests for two or more are rare.This distribution suggests that parking infrastructure investment yields diminishing returns at low occupancy levels, and that most hotel guests either arrive without a car or make alternative arrangements.Cancellation rates by hotel type (chart 7)
Cancellation rates by hotel type (chart 7)
The proportional bar chart (
position = "fill") shows cancellation status split by hotel type. City Hotels exhibit a higher cancellation proportion than Resort Hotels. This gap may reflect the different booking channels and customer profiles typical of each segment — corporate and last-minute bookings at city properties tend to carry higher cancellation rates than leisure resort bookings made well in advance.Lead time and cancellation risk (chart 8)
Lead time and cancellation risk (chart 8)
The box plot comparing
lead_time across cancelled and non-cancelled bookings reveals that cancelled bookings have a markedly higher median lead time than bookings that were honored. Guests who book many months in advance are more likely to cancel than those who book close to their arrival date.This is the team’s primary analytical question (section 3.5, chart 8) and represents one of the most actionable findings in the dataset. It suggests that long lead time is a risk signal that hotels can use to prioritise follow-up communications or apply deposit requirements.ADR outlier treatment — Winsorization
ADR outlier treatment — Winsorization
Prior to visualisation, the
adr (Average Daily Rate) variable was inspected with a boxplot and found to contain extreme high values. Winsorization at the 95th percentile was applied: any ADR value above the 95th-percentile threshold was replaced with that threshold value.This treatment reduces the distorting influence of outliers on distributional summaries and charts without discarding records from the dataset. The before-and-after boxplots confirm that the upper tail was compressed to a defensible ceiling while the bulk of the distribution was preserved.