File Organization
Main Jupyter Notebook
Prueba_tecnica_Dataknow.ipynb
Interactive notebook containing all data processing and machine learning implementations
Notebook Sections
The notebook is organized into three major sections:1. Data Loading (CARGA DE INFORMACIÓN)
1. Data Loading (CARGA DE INFORMACIÓN)
Purpose: Parse and load Colombian energy market data from OFEI formatKey Features:Output: DataFrame with agents like AES CHIVOR, EMGESA, CELSIA and their hourly power generation in MW
- Custom text file parser for OFEI1204.txt
- Handles Latin-1 encoding for special characters
- Extracts agent information and plant generation data
- Processes 24-hour generation columns
- Creates structured DataFrame with 305 power plants
The parser specifically filters for lines containing ”, D,” to identify power generation records (type D = declared generation).
2. Data Manipulation (MANIPULACIÓN DE DATOS)
2. Data Manipulation (MANIPULACIÓN DE DATOS)
Purpose: Integrate multiple data sources and perform complex transformationsData Sources:
- Master Data:
Datos Maestros VF.xlsx- Plant metadata with agent names, plant codes, and types - dDEC Data:
dDEC1204.TXT- Declared generation data with 24 hourly columns
-
Load Master Data
-
Filter by Criteria
- Agents: EMGESA, EMGESA S.A.
- Plant Types: H (Hydro), T (Thermal)
-
Merge Datasets
- Join master data with dDEC file on plant name
- Text normalization (strip, uppercase) for accurate matching
-
Calculate Totals
- Sum all 24 hourly columns horizontally
- Filter plants with generation > 0
6. ML Analytical Modeling (PRUEBA DE MODELACIÓN ANALÍTICA)
6. ML Analytical Modeling (PRUEBA DE MODELACIÓN ANALÍTICA)
Purpose: Build fraud detection classifier for financial transactionsModel Architecture:Data Preprocessing:Classification Report:
- Feature Selection
- Use common columns between train and test
- Exclude ‘id’ and target variable ‘FRAUDE’
- Missing Value Handling
- Numeric columns: Fill with mean
- Categorical columns: Fill with “Desconocido” (Unknown)
- Encoding
- One-hot encoding for categorical variables
- Drop first category to avoid multicollinearity
- Align test columns with training set
- 80/20 train-validation split
- Stratified sampling to maintain fraud proportion
- Balanced class weights to handle imbalance
- Class 0 (Non-Fraud): 96% precision, 98% recall, 97% F1-score
- Class 1 (Fraud): 92% precision, 88% recall, 90% F1-score
- Overall Accuracy: 95%
- AUC-ROC: 0.988 (excellent discrimination)
- 100 test cases evaluated
- Binary predictions (0 = legitimate, 1 = fraud)
- Threshold: 0.5 probability cutoff
- Results saved to
test_evaluado.csv
SQL Script File
prueba_SQL_Dataknow.sql
Complete database implementation for weather data management system
SQL Script Sections
The SQL file demonstrates advanced database design patterns:Point 1: Database Creation and Schema Design
Database: Key Features:
meteorologiaMain Table: climaSchema Definition:- CHECK constraint for cloud coverage validation
- UNIQUE constraint preventing duplicate readings for same location/time
- DEFAULT value for country column
- Appropriate data types with precision (DECIMAL for measurements)
- 24 weather records across Colombian cities
- Cities: Bogota, Medellin, Envigado, Sabaneta, Cali, Barranquilla, Cartagena
- Date range: 2024-2026
- Realistic weather measurements (temperature, UV, pressure, wind speed)
Point 2: Query Optimization Strategies
Three different approaches for improving query performance:Option 1: Strategic Indexing
- Composite index on frequently queried columns
- Speeds up date range and location-based queries
- Benefits: Fast lookups, minimal storage overhead
- Year-based partitioning for large datasets
- Benefits: Partition pruning, easier data management
- Ideal for time-series data with date-based queries
- Separate dimension table for location attributes
- Reduces redundancy in fact table
- Benefits: Normalized design, efficient joins
Each optimization strategy targets different use cases. Indexing works for most scenarios, partitioning scales for time-series data, and dimensional modeling supports analytical workloads.
Point 3: Data Aggregation and Transformation
New Table: Processing Logic:
clima_diario_f (Daily climate in Fahrenheit)Transformation Query:- Group hourly readings to daily summaries
- Convert Celsius to Fahrenheit: (°C × 9/5) + 32
- Aggregate functions: AVG for continuous metrics, MAX for categorical
- UNIQUE constraint on (localidad, fecha) prevents duplicates
Point 4: Window Functions for Delta Calculations
Purpose: Calculate temperature changes between consecutive readingsImplementation for Original Table:Window Function Mechanics:Business Value:
PARTITION BY localidad: Separate calculations per cityORDER BY fechaHora: Chronological orderingLAG(): Access previous row’s temperature value- Subquery pattern required for UPDATE with window functions in MySQL
- Identifies sudden temperature changes for weather alerts
- First reading per location has NULL delta (no previous value)
- Enables trend analysis and anomaly detection
Output File
test_evaluado.csv
Fraud detection model predictions for submission
File Structure
Format: CSV with header row Columns:id: Transaction identifier (BIGINT)FRAUDE: Binary prediction (0 = legitimate, 1 = fraudulent)
- Total records: 100 test cases
- Fraud predictions: Mix of 0s and 1s based on model probability threshold (0.5)
- File size: 1,295 bytes
This file represents the final deliverable for the fraud detection challenge, containing predictions for previously unseen transactions using the trained Random Forest model.
Technical Implementation Details
Development Environment
Platform
Google Colab
- Cloud-based Jupyter environment
- GPU/TPU acceleration available
- Direct GitHub integration
- Google Drive mounting for data access
Python Version
Python 3.x
- Standard Colab runtime
- Pre-installed data science libraries
- Compatibility with scikit-learn 1.x
Key Dependencies
Data Processing:pandas- DataFrame operations and CSV handlingnumpy- Numerical computations (implicit via pandas)
sklearn.ensemble.RandomForestClassifier- Ensemble modelsklearn.model_selection.train_test_split- Data splittingsklearn.metrics- Model evaluation (confusion_matrix, classification_report, roc_auc_score)
- Standard Python file handling with
open()and context managers pd.read_csv()andpd.read_excel()for structured datapd.to_csv()for output generation
Code Quality Features
Documentation
Documentation
- Inline comments explaining each code section
- Spanish comments for clarity (matching project context)
- Step-by-step logic description
- Purpose statements for each major operation
Error Handling
Error Handling
- Encoding specification (latin-1) for special characters
- Column alignment between train and test (reindex)
- Missing value handling for both numeric and categorical
- Type validation with
.copy()to avoid SettingWithCopyWarning
Reproducibility
Reproducibility
random_state=42for consistent results- Stratified splitting to maintain class distribution
- Fixed column ordering with
drop_first=True - Version-controlled via GitHub repository
Repository Information
GitHub Repository
View the complete source code and version history
- Owner: Jose Antonio Arteaga Bolaños
- Purpose: Technical assessment for Data Engineer position at Dataknow
- Branch: master
- Colab Integration: Direct notebook opening via colab.research.google.com link
- Phone: 3007586135
- Email: [email protected]
Next Steps
ETL Pipeline
Explore the data loading implementation
Database Design
Review SQL schema and optimization
ML Model
Understand the fraud detection model