Skip to main content

Overview

This quickstart guide will help you explore and understand the data engineering technical test project. The project demonstrates proficiency in ETL pipelines, database design, and machine learning.

Exploring the Project

1

Review the Project Structure

Start by understanding the organization of the project files:
  • Prueba_tecnica_Dataknow.ipynb - Main Jupyter notebook with all implementations
  • prueba_SQL_Dataknow.sql - SQL schema and optimization strategies
  • test_evaluado.csv - Fraud detection model predictions
Learn more in the Project Structure guide.
2

Understand the Data Sources

The project works with three types of data:Power Plant Data (OFEI)
  • Text file with generation data for 305 power plants
  • 24 hourly generation values per plant
  • Agent-based organization
Weather Data
  • Colombian cities meteorological observations
  • Temperature, cloud coverage, UV index, wind speed
  • Time-series data requiring aggregation
Fraud Detection Dataset
  • Training and test CSV files
  • Multiple numeric and categorical features
  • Binary classification target
See Data Schemas for detailed structure.
3

Run the ETL Pipeline

The ETL pipeline demonstrates data loading and transformation:
import pandas as pd

# Parse OFEI text file
data = []
agente_actual = None

with open("OFEI1204.txt", "r", encoding="latin-1") as file:
    for line in file:
        if line.startswith("AGENTE:"):
            agente_actual = line.replace("AGENTE:", "").strip()
        elif ", D," in line:
            partes = line.split(",")
            planta = partes[0].strip()
            tipo = partes[1].strip()
            horas = [float(h.strip()) for h in partes[2:26]]
            registro = [agente_actual, tipo, planta] + horas
            data.append(registro)

# Create DataFrame
columnas = ["Agente", "Tipo", "Planta"] + [f"Hora_{i}" for i in range(1, 25)]
df_final = pd.DataFrame(data, columns=columnas)
Explore the full implementation in Data Loading.
4

Review Database Design

The SQL component showcases production-ready database design:
CREATE TABLE clima (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    localidad VARCHAR(100) NOT NULL,
    pais VARCHAR(50) NOT NULL DEFAULT 'Colombia',
    tempCelsius DECIMAL(5,2) NOT NULL,
    fechaHora DATETIME NOT NULL,
    coberturaNubes VARCHAR(20) CHECK (coberturaNubes IN ('Minima','Parcial','Total')) NOT NULL,
    CONSTRAINT uk_localidad_fecha UNIQUE (localidad, fechaHora)
);
Learn about Optimization Strategies including indexing, partitioning, and dimensional modeling.
5

Examine the ML Model

The fraud detection model achieves exceptional performance:
from sklearn.ensemble import RandomForestClassifier

# Initialize model
model = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    class_weight='balanced'
)

# Train model
model.fit(X_train, y_train)

# Results:
# - Accuracy: 95%
# - AUC-ROC: 98.8%
# - Precision (fraud): 92%
# - Recall (fraud): 88%
Dive deeper into the Model Overview and Evaluation.

Key Components

ETL Pipelines

Custom parsers for Colombian energy sector data with multi-source integration

Database Design

MySQL implementation with advanced optimization and window functions

ML Model

Random Forest classifier with 98.8% AUC-ROC for fraud detection

Running the Project

The project is designed for Google Colab:
1

Open in Colab

Click the “Open in Colab” badge in the notebook or navigate to:
https://colab.research.google.com/github/Jocesitho/Prueba_tecnica_dataknow_Jose_Arteaga/blob/master/Prueba_tecnica_Dataknow.ipynb
2

Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')
Ensure data files are in the correct Drive path.
3

Run All Cells

Execute cells sequentially to see:
  • Data loading and parsing
  • DataFrame transformations
  • ML model training
  • Evaluation results

Local Environment

To run locally:
pip install pandas scikit-learn jupyter openpyxl
Update file paths to reference your local data directory instead of Google Drive paths.

SQL Database Setup

Create the Database

CREATE DATABASE meteorologia;
USE meteorologia;

Run the Schema

Execute the SQL file:
mysql -u your_user -p meteorologia < prueba_SQL_Dataknow.sql
Or run sections individually in your MySQL client.

Project Results

ETL Pipeline Output

  • 305 power plants parsed from OFEI file
  • 16 active plants after filtering (EMGESA, Hydro/Thermal)
  • 24 hourly columns aggregated for analysis

Database Features

  • Strategic indexing on (localidad, fechaHora)
  • RANGE partitioning by year
  • Dimensional modeling for location data
  • Window functions for temperature delta calculations

ML Model Performance

MetricValue
Accuracy95%
AUC-ROC98.8%
Precision (Non-fraud)96%
Recall (Non-fraud)98%
Precision (Fraud)92%
Recall (Fraud)88%

Next Steps

Project Structure

Understand the complete file organization and implementation structure

Requirements

Review technical requirements and dependencies

Python Examples

Explore code examples and implementation patterns

SQL Queries

See SQL query examples and optimization techniques

Technical Highlights

  • Custom text file parsing with encoding handling
  • Multi-source data integration (Excel, TXT, CSV)
  • Pandas merge operations with data quality checks
  • Horizontal aggregation across multiple columns
  • Feature engineering and preprocessing pipelines
  • Comprehensive schema design with constraints
  • Three optimization strategies implementation
  • Window functions for analytical queries
  • Data type selection and normalization
  • Foreign key relationships and referential integrity
  • Imbalanced classification problem solving
  • Missing value imputation strategies
  • Categorical encoding techniques
  • Model evaluation with multiple metrics
  • Threshold optimization for business requirements
This project was completed as part of a technical assessment for the Data Engineer position at Dataknow. All implementations are production-ready and follow industry best practices.

Build docs developers (and LLMs) love