Dataknow Technical Assessment
A complete data engineering project demonstrating ETL pipeline development, advanced SQL database design, and machine learning fraud detection with 98.8% AUC-ROC accuracy.
Quick Start
Explore the Project Structure
Review the project structure to understand the organization of notebooks, SQL files, and test outputs.
Review ETL Pipelines
Examine the data loading and transformation processes in the ETL section, including parsing OFEI power generation data.
Study Database Design
Analyze the SQL database implementation with advanced features like indexing, partitioning, and window functions in the Database section.
Understand ML Model
Learn about the Random Forest fraud detection model achieving 95% accuracy with balanced classes in the ML section.
Project Areas
ETL Pipelines
Custom parsers for Colombian energy market data (OFEI), multi-source data integration, and Pandas-based transformations.
Database Design
MySQL weather database with constraints, indexing strategies, partitioning, dimensional modeling, and window functions.
ML Fraud Detection
Random Forest classifier with 98.8% AUC-ROC, handling imbalanced classes, feature engineering, and model evaluation metrics.
Key Highlights
Advanced Data Processing
Advanced Data Processing
- Custom text file parser for OFEI energy generation data
- Multi-file merge operations with master data integration
- Horizontal aggregation across 24 hourly columns
- Data filtering by agent and plant type (Hydro/Thermal)
Production-Ready SQL
Production-Ready SQL
- Comprehensive schema with constraints and foreign keys
- Three optimization strategies: indexing, partitioning, dimensional modeling
- Temperature delta calculations using window functions (LAG)
- Celsius to Fahrenheit conversions with daily aggregations
High-Performance ML Model
High-Performance ML Model
- Random Forest with 200 estimators and balanced class weights
- 95% classification accuracy (96% precision, 98% recall for non-fraud)
- 98.8% AUC-ROC score demonstrating excellent separation
- Proper handling of categorical variables with one-hot encoding
- Train/validation split with stratification to maintain class distribution
Technical Stack
Data Processing
- Python 3.x with Pandas for data manipulation
- NumPy for numerical operations
- Custom text parsing with Latin-1 encoding support
Machine Learning
- Scikit-learn for modeling and evaluation
- Random Forest ensemble method
- Confusion matrix, classification report, ROC-AUC metrics
Database
- MySQL relational database
- Advanced SQL with window functions
- Partitioning by date ranges
Development
- Jupyter Notebooks for interactive development
- Google Colab for cloud execution
- CSV output generation for results
Project Deliverables
Prueba_tecnica_Dataknow.ipynb
Main Jupyter notebook containing all code implementations for data loading, manipulation, and ML modeling with detailed comments.
prueba_SQL_Dataknow.sql
Complete SQL script with database creation, table definitions, data insertion, optimization strategies, and analytical queries.
Ready to Explore?
Start with the Quick Start Guide or dive directly into the Project Structure to understand how everything fits together.