Dataknow Technical Assessment

A complete data engineering project demonstrating ETL pipeline development, advanced SQL database design, and machine learning fraud detection with 98.8% AUC-ROC accuracy.

Get Started View ETL Pipeline

Quick Start

Explore the Project Structure

Review the project structure to understand the organization of notebooks, SQL files, and test outputs.

Review ETL Pipelines

Examine the data loading and transformation processes in the ETL section, including parsing OFEI power generation data.

Study Database Design

Analyze the SQL database implementation with advanced features like indexing, partitioning, and window functions in the Database section.

Understand ML Model

Learn about the Random Forest fraud detection model achieving 95% accuracy with balanced classes in the ML section.

Project Areas

ETL Pipelines

Custom parsers for Colombian energy market data (OFEI), multi-source data integration, and Pandas-based transformations.

Database Design

MySQL weather database with constraints, indexing strategies, partitioning, dimensional modeling, and window functions.

ML Fraud Detection

Random Forest classifier with 98.8% AUC-ROC, handling imbalanced classes, feature engineering, and model evaluation metrics.

Key Highlights

Advanced Data Processing

Custom text file parser for OFEI energy generation data
Multi-file merge operations with master data integration
Horizontal aggregation across 24 hourly columns
Data filtering by agent and plant type (Hydro/Thermal)

Production-Ready SQL

Comprehensive schema with constraints and foreign keys
Three optimization strategies: indexing, partitioning, dimensional modeling
Temperature delta calculations using window functions (LAG)
Celsius to Fahrenheit conversions with daily aggregations

High-Performance ML Model

Random Forest with 200 estimators and balanced class weights
95% classification accuracy (96% precision, 98% recall for non-fraud)
98.8% AUC-ROC score demonstrating excellent separation
Proper handling of categorical variables with one-hot encoding
Train/validation split with stratification to maintain class distribution

Technical Stack

Data Processing

Python 3.x with Pandas for data manipulation
NumPy for numerical operations
Custom text parsing with Latin-1 encoding support

Machine Learning

Scikit-learn for modeling and evaluation
Random Forest ensemble method
Confusion matrix, classification report, ROC-AUC metrics

Database

MySQL relational database
Advanced SQL with window functions
Partitioning by date ranges

Development

Jupyter Notebooks for interactive development
Google Colab for cloud execution
CSV output generation for results

Project Deliverables

Prueba_tecnica_Dataknow.ipynb

Main Jupyter notebook containing all code implementations for data loading, manipulation, and ML modeling with detailed comments.

prueba_SQL_Dataknow.sql

Complete SQL script with database creation, table definitions, data insertion, optimization strategies, and analytical queries.

test_evaluado.csv

Fraud detection predictions for 100 test cases with transaction IDs and binary fraud labels (0/1).

Ready to Explore?

Start with the Quick Start Guide or dive directly into the Project Structure to understand how everything fits together.

Project Structure

⌘I

Auto-generate your docs

Dataknow Technical Assessment
Quick Start
Project Areas
Key Highlights
Technical Stack
Project Deliverables
Ready to Explore?

Build docs developers (and LLMs) love

Get started for free Talk to us

Introduction

Getting Started

Data Engineering Technical Test

Dataknow Technical Assessment

Quick Start

Project Areas

ETL Pipelines

Database Design

ML Fraud Detection

Key Highlights

Technical Stack

Data Processing

Machine Learning

Database

Development

Project Deliverables

Ready to Explore?

Build docs developers (and LLMs) love

Introduction

Getting Started

​Dataknow Technical Assessment

​Quick Start

​Project Areas

ETL Pipelines

Database Design

ML Fraud Detection

​Key Highlights

​Technical Stack

Data Processing

Machine Learning

Database

Development

​Project Deliverables

​Ready to Explore?

Build docs developers (and LLMs) love

Dataknow Technical Assessment

Quick Start

Project Areas

Key Highlights

Technical Stack

Project Deliverables

Ready to Explore?