Optimization Strategies - Dataknow Data Engineering Technical Test

Overview

As the meteorological database grows with continuous weather observations, query performance and storage efficiency become critical. This document presents three complementary optimization strategies implemented in the system.

Strategy 1: Strategic Indexing

Purpose

Improve query performance for time-range and location-based searches, which are the most common access patterns in meteorological data analysis.

Implementation

CREATE INDEX idx_localidad_fecha
ON clima(localidad, fechaHora);

Index Design

Composite Index: (localidad, fechaHora) This index is ordered by locality first, then by timestamp, enabling efficient queries that:

Filter by location
Filter by time range
Filter by both location and time range

Use Cases

Time Range Queries

SELECT *
FROM clima
WHERE fechaHora BETWEEN '2025-01-01' AND '2025-02-21';

Performance Impact: The index allows the database to quickly locate all records within the date range without scanning the entire table.

Location-Specific Queries

SELECT *
FROM clima
WHERE localidad = 'Bogota'
ORDER BY fechaHora DESC;

Performance Impact: The index enables direct access to a specific location’s data and maintains chronological ordering.

Combined Location and Time Queries

SELECT *
FROM clima
WHERE localidad = 'Medellin'
AND fechaHora BETWEEN '2025-01-01' AND '2025-12-31';

Performance Impact: Optimal use case - both index columns are utilized for maximum efficiency.

When to Use

✅ Use this strategy when:

Queries frequently filter by location and/or time
Real-time data access is required
Storage space is not a primary concern
Write performance is acceptable (indexes add overhead to INSERT operations)

❌ Avoid when:

The table is write-heavy with minimal reads
Queries rarely filter on these columns
Storage space is severely constrained

Strategy 2: Partitioning by Date

Purpose

Improve query performance and enable efficient data lifecycle management by physically separating data by year.

Implementation

ALTER TABLE clima
PARTITION BY RANGE (YEAR(fechaHora)) (
    PARTITION p2024 VALUES LESS THAN (2025),
    PARTITION p2025 VALUES LESS THAN (2026),
    PARTITION pMax VALUES LESS THAN MAXVALUE
);

Partitioning Scheme

Type: RANGE partitioning Partitioning Key: YEAR(fechaHora) Partitions:

p2024 - All records from year 2024
p2025 - All records from year 2025
pMax - All records from 2026 onwards

How It Works

Partition Pruning: When querying data for a specific year, MySQL only scans the relevant partition(s), ignoring others entirely.
Physical Separation: Each partition is stored separately, reducing I/O operations.
Maintenance: Old partitions can be archived, backed up, or dropped independently.

Performance Benefits

Year-Specific Queries

SELECT AVG(tempCelsius)
FROM clima
WHERE fechaHora BETWEEN '2025-01-01' AND '2025-12-31';

Optimization: Only partition p2025 is scanned, potentially reducing data scanned by 66% or more.

Multi-Year Queries

SELECT localidad, YEAR(fechaHora), AVG(tempCelsius)
FROM clima
WHERE YEAR(fechaHora) IN (2024, 2025)
GROUP BY localidad, YEAR(fechaHora);

Optimization: Partitions p2024 and p2025 are scanned in parallel, while pMax is pruned.

Data Lifecycle Management

-- Example: Archive 2024 data
ALTER TABLE clima EXCHANGE PARTITION p2024 WITH TABLE clima_archive_2024;

-- Example: Drop old data
ALTER TABLE clima DROP PARTITION p2024;

When to Use

✅ Use this strategy when:

Data has a strong temporal component
Queries typically filter by year, month, or date ranges
Historical data needs to be archived or purged periodically
Table size exceeds several million rows
Different time periods have different access patterns (recent data = hot, old data = cold)

❌ Avoid when:

Queries frequently span multiple years without date filters
Table size is small (less than 100K rows)
Partitioning key (year) has low cardinality for your use case
You need to frequently reorganize partition boundaries

Best Practices

Future-Proof: Always include a MAXVALUE partition to handle future dates
Combine with Indexing: Partitioning and indexing work together - partition first, then index within partitions
Monitor Growth: Plan to add new partitions as new years begin
Align with Business Logic: Partition boundaries should match reporting periods

Strategy 3: Dimensional Modeling

Purpose

Normalize location data to reduce redundancy and improve referential integrity. This follows a star/snowflake schema design pattern common in data warehousing.

Implementation

CREATE TABLE dim_localidad (
    id INT AUTO_INCREMENT PRIMARY KEY,
    localidad VARCHAR(100),
    pais VARCHAR(100)
);

Schema Transformation

Before (Denormalized):

clima: id, localidad, pais, tempCelsius, fechaHora, ...

After (Normalized):

dim_localidad: id, localidad, pais
clima: id, localidad_id (FK), tempCelsius, fechaHora, ...

Adding Foreign Key Relationship

-- Modify the main tabla to reference the dimension
ALTER TABLE clima
ADD COLUMN localidad_id INT,
ADD FOREIGN KEY (localidad_id) REFERENCES dim_localidad(id);

Benefits

1. Storage Efficiency

Denormalized:

‘Bogota’ + ‘Colombia’ stored in every row
1000 Bogota observations = 1000 × ~20 bytes = ~20KB redundant data

Normalized:

‘Bogota’ + ‘Colombia’ stored once in dim_localidad
1000 Bogota observations = 1000 × 4 bytes (INT) = 4KB
Savings: ~80% reduction in location data storage

2. Data Integrity

Consistency: Location names cannot have spelling variations (‘Bogota’ vs ‘Bogotá’)
Referential Integrity: Cannot insert weather data for non-existent locations
Centralized Updates: Changing ‘Colombia’ to ‘República de Colombia’ requires one update

3. Query Flexibility

-- Enriched location queries
SELECT 
    l.localidad,
    l.pais,
    AVG(c.tempCelsius) as avg_temp
FROM clima c
JOIN dim_localidad l ON c.localidad_id = l.id
GROUP BY l.localidad, l.pais;

4. Extended Dimensions

-- Easy to add location metadata
ALTER TABLE dim_localidad
ADD COLUMN region VARCHAR(50),
ADD COLUMN latitud DECIMAL(10,8),
ADD COLUMN longitud DECIMAL(11,8),
ADD COLUMN altitud INT;

When to Use

✅ Use this strategy when:

Location data is highly repetitive (same cities measured frequently)
You need to maintain location metadata (coordinates, region, timezone)
Data quality and consistency are critical
Building a data warehouse or analytics platform
Multiple fact tables reference the same locations

❌ Avoid when:

Dataset is small with few repeated locations
Simplicity is prioritized over normalization
JOIN overhead would impact real-time query performance
Location information rarely changes and has no additional attributes

Extended Dimensional Model

For a full data warehouse implementation:

-- Time dimension
CREATE TABLE dim_tiempo (
    id INT AUTO_INCREMENT PRIMARY KEY,
    fecha DATE,
    anio INT,
    mes INT,
    dia INT,
    trimestre INT,
    nombre_mes VARCHAR(20),
    dia_semana VARCHAR(20)
);

-- Fact table (clima as fact)
ALTER TABLE clima
ADD COLUMN localidad_id INT,
ADD COLUMN tiempo_id INT,
ADD FOREIGN KEY (localidad_id) REFERENCES dim_localidad(id),
ADD FOREIGN KEY (tiempo_id) REFERENCES dim_tiempo(id);

Combining Strategies

These three optimization strategies are not mutually exclusive. For optimal performance:

Recommended Combination

Dimensional Modeling - Normalize location data first
Partitioning - Partition the fact table by year
Indexing - Create indexes on partition pruning keys and foreign keys

-- Step 1: Create dimension
CREATE TABLE dim_localidad (...);

-- Step 2: Modify fact table
ALTER TABLE clima ADD COLUMN localidad_id INT;

-- Step 3: Partition
ALTER TABLE clima PARTITION BY RANGE (YEAR(fechaHora)) (...);

-- Step 4: Index
CREATE INDEX idx_localidad_fecha ON clima(localidad_id, fechaHora);

Performance Impact Matrix

Query Type	Indexing	Partitioning	Dimensional
Single location, date range	⭐⭐⭐	⭐⭐	⭐
All locations, single year	⭐⭐	⭐⭐⭐	⭐
Aggregations by region	⭐	⭐	⭐⭐⭐
Recent data (hot queries)	⭐⭐⭐	⭐⭐	⭐⭐
Historical analysis	⭐⭐	⭐⭐⭐	⭐⭐

Monitoring and Validation

-- Check index usage
SHOW INDEX FROM clima;

-- Verify partitioning
SELECT PARTITION_NAME, TABLE_ROWS
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_NAME = 'clima';

-- Analyze query plans
EXPLAIN SELECT * FROM clima
WHERE localidad = 'Bogota'
AND fechaHora BETWEEN '2025-01-01' AND '2025-12-31';

ETL Pipelines

Database Design

​Overview

​Strategy 1: Strategic Indexing

​Purpose

​Implementation

​Index Design

​Use Cases

​Time Range Queries

​Location-Specific Queries

​Combined Location and Time Queries

​When to Use

​Strategy 2: Partitioning by Date

​Purpose

​Implementation

​Partitioning Scheme

​How It Works

​Performance Benefits

​Year-Specific Queries

​Multi-Year Queries

​Data Lifecycle Management

​When to Use

​Best Practices

​Strategy 3: Dimensional Modeling

​Purpose

​Implementation

​Schema Transformation

​Adding Foreign Key Relationship

​Benefits

​1. Storage Efficiency

​2. Data Integrity

​3. Query Flexibility

​4. Extended Dimensions

​When to Use

​Extended Dimensional Model

​Combining Strategies

​Recommended Combination

​Performance Impact Matrix

​Monitoring and Validation

Build docs developers (and LLMs) love

Overview

Strategy 1: Strategic Indexing

Purpose

Implementation

Index Design

Use Cases

Time Range Queries

Location-Specific Queries

Combined Location and Time Queries

When to Use

Strategy 2: Partitioning by Date

Purpose

Implementation

Partitioning Scheme

How It Works

Performance Benefits

Year-Specific Queries

Multi-Year Queries

Data Lifecycle Management

When to Use

Best Practices

Strategy 3: Dimensional Modeling

Purpose

Implementation

Schema Transformation

Adding Foreign Key Relationship

Benefits

1. Storage Efficiency

2. Data Integrity

3. Query Flexibility

4. Extended Dimensions

When to Use

Extended Dimensional Model

Combining Strategies

Recommended Combination

Performance Impact Matrix

Monitoring and Validation