Overview
As the meteorological database grows with continuous weather observations, query performance and storage efficiency become critical. This document presents three complementary optimization strategies implemented in the system.Strategy 1: Strategic Indexing
Purpose
Improve query performance for time-range and location-based searches, which are the most common access patterns in meteorological data analysis.Implementation
Index Design
Composite Index:(localidad, fechaHora)
This index is ordered by locality first, then by timestamp, enabling efficient queries that:
- Filter by location
- Filter by time range
- Filter by both location and time range
Use Cases
Time Range Queries
Location-Specific Queries
Combined Location and Time Queries
When to Use
✅ Use this strategy when:- Queries frequently filter by location and/or time
- Real-time data access is required
- Storage space is not a primary concern
- Write performance is acceptable (indexes add overhead to INSERT operations)
- The table is write-heavy with minimal reads
- Queries rarely filter on these columns
- Storage space is severely constrained
Strategy 2: Partitioning by Date
Purpose
Improve query performance and enable efficient data lifecycle management by physically separating data by year.Implementation
Partitioning Scheme
Type: RANGE partitioning Partitioning Key:YEAR(fechaHora)
Partitions:
p2024- All records from year 2024p2025- All records from year 2025pMax- All records from 2026 onwards
How It Works
- Partition Pruning: When querying data for a specific year, MySQL only scans the relevant partition(s), ignoring others entirely.
- Physical Separation: Each partition is stored separately, reducing I/O operations.
- Maintenance: Old partitions can be archived, backed up, or dropped independently.
Performance Benefits
Year-Specific Queries
p2025 is scanned, potentially reducing data scanned by 66% or more.
Multi-Year Queries
p2024 and p2025 are scanned in parallel, while pMax is pruned.
Data Lifecycle Management
When to Use
✅ Use this strategy when:- Data has a strong temporal component
- Queries typically filter by year, month, or date ranges
- Historical data needs to be archived or purged periodically
- Table size exceeds several million rows
- Different time periods have different access patterns (recent data = hot, old data = cold)
- Queries frequently span multiple years without date filters
- Table size is small (less than 100K rows)
- Partitioning key (year) has low cardinality for your use case
- You need to frequently reorganize partition boundaries
Best Practices
- Future-Proof: Always include a
MAXVALUEpartition to handle future dates - Combine with Indexing: Partitioning and indexing work together - partition first, then index within partitions
- Monitor Growth: Plan to add new partitions as new years begin
- Align with Business Logic: Partition boundaries should match reporting periods
Strategy 3: Dimensional Modeling
Purpose
Normalize location data to reduce redundancy and improve referential integrity. This follows a star/snowflake schema design pattern common in data warehousing.Implementation
Schema Transformation
Before (Denormalized):Adding Foreign Key Relationship
Benefits
1. Storage Efficiency
Denormalized:- ‘Bogota’ + ‘Colombia’ stored in every row
- 1000 Bogota observations = 1000 × ~20 bytes = ~20KB redundant data
- ‘Bogota’ + ‘Colombia’ stored once in
dim_localidad - 1000 Bogota observations = 1000 × 4 bytes (INT) = 4KB
- Savings: ~80% reduction in location data storage
2. Data Integrity
- Consistency: Location names cannot have spelling variations (‘Bogota’ vs ‘Bogotá’)
- Referential Integrity: Cannot insert weather data for non-existent locations
- Centralized Updates: Changing ‘Colombia’ to ‘República de Colombia’ requires one update
3. Query Flexibility
4. Extended Dimensions
When to Use
✅ Use this strategy when:- Location data is highly repetitive (same cities measured frequently)
- You need to maintain location metadata (coordinates, region, timezone)
- Data quality and consistency are critical
- Building a data warehouse or analytics platform
- Multiple fact tables reference the same locations
- Dataset is small with few repeated locations
- Simplicity is prioritized over normalization
- JOIN overhead would impact real-time query performance
- Location information rarely changes and has no additional attributes
Extended Dimensional Model
For a full data warehouse implementation:Combining Strategies
These three optimization strategies are not mutually exclusive. For optimal performance:Recommended Combination
- Dimensional Modeling - Normalize location data first
- Partitioning - Partition the fact table by year
- Indexing - Create indexes on partition pruning keys and foreign keys
Performance Impact Matrix
| Query Type | Indexing | Partitioning | Dimensional |
|---|---|---|---|
| Single location, date range | ⭐⭐⭐ | ⭐⭐ | ⭐ |
| All locations, single year | ⭐⭐ | ⭐⭐⭐ | ⭐ |
| Aggregations by region | ⭐ | ⭐ | ⭐⭐⭐ |
| Recent data (hot queries) | ⭐⭐⭐ | ⭐⭐ | ⭐⭐ |
| Historical analysis | ⭐⭐ | ⭐⭐⭐ | ⭐⭐ |