Overview
This quickstart guide will help you explore and understand the data engineering technical test project. The project demonstrates proficiency in ETL pipelines, database design, and machine learning.Exploring the Project
Review the Project Structure
Start by understanding the organization of the project files:
- Prueba_tecnica_Dataknow.ipynb - Main Jupyter notebook with all implementations
- prueba_SQL_Dataknow.sql - SQL schema and optimization strategies
- test_evaluado.csv - Fraud detection model predictions
Understand the Data Sources
The project works with three types of data:Power Plant Data (OFEI)
- Text file with generation data for 305 power plants
- 24 hourly generation values per plant
- Agent-based organization
- Colombian cities meteorological observations
- Temperature, cloud coverage, UV index, wind speed
- Time-series data requiring aggregation
- Training and test CSV files
- Multiple numeric and categorical features
- Binary classification target
Run the ETL Pipeline
The ETL pipeline demonstrates data loading and transformation:Explore the full implementation in Data Loading.
Review Database Design
The SQL component showcases production-ready database design:Learn about Optimization Strategies including indexing, partitioning, and dimensional modeling.
Examine the ML Model
The fraud detection model achieves exceptional performance:Dive deeper into the Model Overview and Evaluation.
Key Components
ETL Pipelines
Custom parsers for Colombian energy sector data with multi-source integration
Database Design
MySQL implementation with advanced optimization and window functions
ML Model
Random Forest classifier with 98.8% AUC-ROC for fraud detection
Running the Project
Google Colab (Recommended)
The project is designed for Google Colab:Local Environment
To run locally:SQL Database Setup
Create the Database
Run the Schema
Execute the SQL file:Project Results
ETL Pipeline Output
- 305 power plants parsed from OFEI file
- 16 active plants after filtering (EMGESA, Hydro/Thermal)
- 24 hourly columns aggregated for analysis
Database Features
- Strategic indexing on (localidad, fechaHora)
- RANGE partitioning by year
- Dimensional modeling for location data
- Window functions for temperature delta calculations
ML Model Performance
| Metric | Value |
|---|---|
| Accuracy | 95% |
| AUC-ROC | 98.8% |
| Precision (Non-fraud) | 96% |
| Recall (Non-fraud) | 98% |
| Precision (Fraud) | 92% |
| Recall (Fraud) | 88% |
Next Steps
Project Structure
Understand the complete file organization and implementation structure
Requirements
Review technical requirements and dependencies
Python Examples
Explore code examples and implementation patterns
SQL Queries
See SQL query examples and optimization techniques
Technical Highlights
Data Engineering Skills
Data Engineering Skills
- Custom text file parsing with encoding handling
- Multi-source data integration (Excel, TXT, CSV)
- Pandas merge operations with data quality checks
- Horizontal aggregation across multiple columns
- Feature engineering and preprocessing pipelines
Database Expertise
Database Expertise
- Comprehensive schema design with constraints
- Three optimization strategies implementation
- Window functions for analytical queries
- Data type selection and normalization
- Foreign key relationships and referential integrity
Machine Learning Proficiency
Machine Learning Proficiency
- Imbalanced classification problem solving
- Missing value imputation strategies
- Categorical encoding techniques
- Model evaluation with multiple metrics
- Threshold optimization for business requirements
This project was completed as part of a technical assessment for the Data Engineer position at Dataknow. All implementations are production-ready and follow industry best practices.