Quickstart

Overview

This quickstart guide will help you explore and understand the data engineering technical test project. The project demonstrates proficiency in ETL pipelines, database design, and machine learning.

Exploring the Project

Review the Project Structure

Start by understanding the organization of the project files:

Prueba_tecnica_Dataknow.ipynb - Main Jupyter notebook with all implementations
prueba_SQL_Dataknow.sql - SQL schema and optimization strategies
test_evaluado.csv - Fraud detection model predictions

Learn more in the Project Structure guide.

Understand the Data Sources

The project works with three types of data:Power Plant Data (OFEI)

Text file with generation data for 305 power plants
24 hourly generation values per plant
Agent-based organization

Weather Data

Colombian cities meteorological observations
Temperature, cloud coverage, UV index, wind speed
Time-series data requiring aggregation

Fraud Detection Dataset

Training and test CSV files
Multiple numeric and categorical features
Binary classification target

See Data Schemas for detailed structure.

Run the ETL Pipeline

The ETL pipeline demonstrates data loading and transformation:

import pandas as pd

# Parse OFEI text file
data = []
agente_actual = None

with open("OFEI1204.txt", "r", encoding="latin-1") as file:
    for line in file:
        if line.startswith("AGENTE:"):
            agente_actual = line.replace("AGENTE:", "").strip()
        elif ", D," in line:
            partes = line.split(",")
            planta = partes[0].strip()
            tipo = partes[1].strip()
            horas = [float(h.strip()) for h in partes[2:26]]
            registro = [agente_actual, tipo, planta] + horas
            data.append(registro)

# Create DataFrame
columnas = ["Agente", "Tipo", "Planta"] + [f"Hora_{i}" for i in range(1, 25)]
df_final = pd.DataFrame(data, columns=columnas)

Explore the full implementation in Data Loading.

Review Database Design

The SQL component showcases production-ready database design:

CREATE TABLE clima (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    localidad VARCHAR(100) NOT NULL,
    pais VARCHAR(50) NOT NULL DEFAULT 'Colombia',
    tempCelsius DECIMAL(5,2) NOT NULL,
    fechaHora DATETIME NOT NULL,
    coberturaNubes VARCHAR(20) CHECK (coberturaNubes IN ('Minima','Parcial','Total')) NOT NULL,
    CONSTRAINT uk_localidad_fecha UNIQUE (localidad, fechaHora)
);

Learn about Optimization Strategies including indexing, partitioning, and dimensional modeling.

Examine the ML Model

The fraud detection model achieves exceptional performance:

from sklearn.ensemble import RandomForestClassifier

# Initialize model
model = RandomForestClassifier(
    n_estimators=200,
    random_state=42,
    class_weight='balanced'
)

# Train model
model.fit(X_train, y_train)

# Results:
# - Accuracy: 95%
# - AUC-ROC: 98.8%
# - Precision (fraud): 92%
# - Recall (fraud): 88%

Dive deeper into the Model Overview and Evaluation.

Key Components

ETL Pipelines

Custom parsers for Colombian energy sector data with multi-source integration

Database Design

MySQL implementation with advanced optimization and window functions

ML Model

Random Forest classifier with 98.8% AUC-ROC for fraud detection

Running the Project

Google Colab (Recommended)

The project is designed for Google Colab:

Open in Colab

Click the “Open in Colab” badge in the notebook or navigate to:

https://colab.research.google.com/github/Jocesitho/Prueba_tecnica_dataknow_Jose_Arteaga/blob/master/Prueba_tecnica_Dataknow.ipynb

Mount Google Drive

from google.colab import drive
drive.mount('/content/drive')

Ensure data files are in the correct Drive path.

Run All Cells

Execute cells sequentially to see:

Data loading and parsing
DataFrame transformations
ML model training
Evaluation results

Local Environment

To run locally:

pip install pandas scikit-learn jupyter openpyxl

Update file paths to reference your local data directory instead of Google Drive paths.

SQL Database Setup

Create the Database

CREATE DATABASE meteorologia;
USE meteorologia;

Run the Schema

Execute the SQL file:

mysql -u your_user -p meteorologia < prueba_SQL_Dataknow.sql

Or run sections individually in your MySQL client.

Project Results

ETL Pipeline Output

305 power plants parsed from OFEI file
16 active plants after filtering (EMGESA, Hydro/Thermal)
24 hourly columns aggregated for analysis

Database Features

Strategic indexing on (localidad, fechaHora)
RANGE partitioning by year
Dimensional modeling for location data
Window functions for temperature delta calculations

ML Model Performance

Metric	Value
Accuracy	95%
AUC-ROC	98.8%
Precision (Non-fraud)	96%
Recall (Non-fraud)	98%
Precision (Fraud)	92%
Recall (Fraud)	88%

Next Steps

Project Structure

Understand the complete file organization and implementation structure

Requirements

Review technical requirements and dependencies

Python Examples

Explore code examples and implementation patterns

SQL Queries

See SQL query examples and optimization techniques

Technical Highlights

Data Engineering Skills

Custom text file parsing with encoding handling
Multi-source data integration (Excel, TXT, CSV)
Pandas merge operations with data quality checks
Horizontal aggregation across multiple columns
Feature engineering and preprocessing pipelines

Database Expertise

Comprehensive schema design with constraints
Three optimization strategies implementation
Window functions for analytical queries
Data type selection and normalization
Foreign key relationships and referential integrity

Machine Learning Proficiency

Imbalanced classification problem solving
Missing value imputation strategies
Categorical encoding techniques
Model evaluation with multiple metrics
Threshold optimization for business requirements

This project was completed as part of a technical assessment for the Data Engineer position at Dataknow. All implementations are production-ready and follow industry best practices.

Introduction

Getting Started

Overview

Exploring the Project

Key Components

ETL Pipelines

Database Design

ML Model

Running the Project

Google Colab (Recommended)

Local Environment

SQL Database Setup

Create the Database

Run the Schema

Project Results

ETL Pipeline Output

Database Features

ML Model Performance

Next Steps

Project Structure

Requirements

Python Examples

SQL Queries

Technical Highlights

Build docs developers (and LLMs) love

Introduction

Getting Started

​Overview

​Exploring the Project

​Key Components

ETL Pipelines

Database Design

ML Model

​Running the Project

​Google Colab (Recommended)

​Local Environment

​SQL Database Setup

​Create the Database

​Run the Schema

​Project Results

​ETL Pipeline Output

​Database Features

​ML Model Performance

​Next Steps

Project Structure

Requirements

Python Examples

SQL Queries

​Technical Highlights

Build docs developers (and LLMs) love

Overview

Exploring the Project

Key Components

Running the Project

Google Colab (Recommended)

Local Environment

SQL Database Setup

Create the Database

Run the Schema

Project Results

ETL Pipeline Output

Database Features

ML Model Performance

Next Steps

Technical Highlights