Skip to main content

Dataknow Technical Assessment

A complete data engineering project demonstrating ETL pipeline development, advanced SQL database design, and machine learning fraud detection with 98.8% AUC-ROC accuracy.

Quick Start

1

Explore the Project Structure

Review the project structure to understand the organization of notebooks, SQL files, and test outputs.
2

Review ETL Pipelines

Examine the data loading and transformation processes in the ETL section, including parsing OFEI power generation data.
3

Study Database Design

Analyze the SQL database implementation with advanced features like indexing, partitioning, and window functions in the Database section.
4

Understand ML Model

Learn about the Random Forest fraud detection model achieving 95% accuracy with balanced classes in the ML section.

Project Areas

ETL Pipelines

Custom parsers for Colombian energy market data (OFEI), multi-source data integration, and Pandas-based transformations.

Database Design

MySQL weather database with constraints, indexing strategies, partitioning, dimensional modeling, and window functions.

ML Fraud Detection

Random Forest classifier with 98.8% AUC-ROC, handling imbalanced classes, feature engineering, and model evaluation metrics.

Key Highlights

  • Custom text file parser for OFEI energy generation data
  • Multi-file merge operations with master data integration
  • Horizontal aggregation across 24 hourly columns
  • Data filtering by agent and plant type (Hydro/Thermal)
  • Comprehensive schema with constraints and foreign keys
  • Three optimization strategies: indexing, partitioning, dimensional modeling
  • Temperature delta calculations using window functions (LAG)
  • Celsius to Fahrenheit conversions with daily aggregations
  • Random Forest with 200 estimators and balanced class weights
  • 95% classification accuracy (96% precision, 98% recall for non-fraud)
  • 98.8% AUC-ROC score demonstrating excellent separation
  • Proper handling of categorical variables with one-hot encoding
  • Train/validation split with stratification to maintain class distribution

Technical Stack

Data Processing

  • Python 3.x with Pandas for data manipulation
  • NumPy for numerical operations
  • Custom text parsing with Latin-1 encoding support

Machine Learning

  • Scikit-learn for modeling and evaluation
  • Random Forest ensemble method
  • Confusion matrix, classification report, ROC-AUC metrics

Database

  • MySQL relational database
  • Advanced SQL with window functions
  • Partitioning by date ranges

Development

  • Jupyter Notebooks for interactive development
  • Google Colab for cloud execution
  • CSV output generation for results

Project Deliverables

1

Prueba_tecnica_Dataknow.ipynb

Main Jupyter notebook containing all code implementations for data loading, manipulation, and ML modeling with detailed comments.
2

prueba_SQL_Dataknow.sql

Complete SQL script with database creation, table definitions, data insertion, optimization strategies, and analytical queries.
3

test_evaluado.csv

Fraud detection predictions for 100 test cases with transaction IDs and binary fraud labels (0/1).

Ready to Explore?

Start with the Quick Start Guide or dive directly into the Project Structure to understand how everything fits together.

Build docs developers (and LLMs) love