Skip to main content

Overview

This project is designed to run in Google Colab, but can also be executed in any Python environment with the required dependencies installed. The system demonstrates ETL pipelines, SQL database design, and machine learning capabilities.

Python Environment

Required Python Version

  • Python 3.7 or higher

Core Dependencies

The project uses the following Python libraries:

pandas

Data manipulation and analysis library for structured data operations

scikit-learn

Machine learning library for the Random Forest classifier and evaluation metrics

Detailed Library Requirements

# Data manipulation
import pandas as pd

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

Installation

pip install pandas scikit-learn

Database Requirements

MySQL Server

The SQL component requires MySQL 5.7 or higher with support for:
  • Partitioning: RANGE partitioning by date
  • Window Functions: LAG function for delta calculations
  • Constraints: CHECK constraints (MySQL 8.0+)
If using MySQL 5.7, you may need to modify CHECK constraints or implement them at the application level.

Database Features Used

FeatureMySQL VersionUsage
RANGE Partitioning5.1+Year-based data partitioning
Window Functions (LAG)8.0+Temperature delta calculation
CHECK Constraints8.0.16+Data validation for coberturaNubes
Composite IndexesAllPerformance optimization

Data Files

Input Data Files

The project requires the following data files:
Format: Custom text format with agent-based sectionsContent: Power plant generation data
  • Agent names (e.g., “AGENTE: AES CHIVOR”)
  • Plant records with 24 hourly generation values
  • 305 plant records total
Size: ~70KBLocation: /content/drive/MyDrive/Prueba_tecnica/Datos3/
Format: Excel workbookContent: Master data for power plants
  • Agent names and plant identifiers
  • Plant types (Hidro, Termo, Filo, Menor)
  • Central names for matching
Columns Used:
  • Nombre visible Agente
  • AGENTE (OFEI)
  • CENTRAL (dDEC, dSEGDES, dPRU…)
  • Tipo de central (Hidro, Termo, Filo, Menor)
Format: CSV text fileContent: Generation declaration data
  • Central names
  • 24 hourly generation values (H1 to H24)
Encoding: latin-1
Format: CSV filesContent: Fraud detection datasets
  • Training data with FRAUDE target variable
  • Test data without target (for predictions)
  • Multiple numeric and categorical features
Size: Training set has 2,965 records for the validation shown in results

Google Colab Setup

Mounting Google Drive

The project is configured for Google Colab with data stored in Google Drive:
from google.colab import drive
drive.mount('/content/drive')

File Paths

All file paths in the notebook reference Google Drive:
# Example paths
ruta = "/content/drive/MyDrive/Prueba_tecnica/Datos3/OFEI1204.txt"
train = pd.read_csv("/content/drive/MyDrive/Prueba_tecnica/Datos3/train.csv")
If running locally, update all file paths to match your local directory structure.

Local Environment Setup

Running Outside Google Colab

To run the project in a local Python environment:
1

Install dependencies

pip install pandas scikit-learn jupyter openpyxl
2

Download data files

Place all required data files in a local directory (e.g., ./data/)
3

Update file paths

Modify all file paths in the notebook to reference your local data directory
# Change from:
ruta = "/content/drive/MyDrive/Prueba_tecnica/Datos3/OFEI1204.txt"

# To:
ruta = "./data/OFEI1204.txt"
4

Launch Jupyter

jupyter notebook Prueba_tecnica_Dataknow.ipynb

MySQL Database Setup

Creating the Database

CREATE DATABASE meteorologia;
USE meteorologia;

Setting Permissions

Ensure your MySQL user has the following privileges:
GRANT CREATE, ALTER, INSERT, UPDATE, SELECT, DROP
ON meteorologia.* 
TO 'your_user'@'localhost';

FLUSH PRIVILEGES;

Partitioning Prerequisites

RANGE partitioning requires the partition key to be part of the primary key or included in all unique indexes.

System Resources

Minimum Requirements

  • RAM: 4 GB (8 GB recommended for ML training)
  • Storage: 500 MB for data files and outputs
  • CPU: Any modern multi-core processor (Random Forest benefits from multiple cores)

Google Colab Resources

Google Colab provides:
  • 12 GB RAM
  • Free GPU (not required for this project)
  • 100 GB temporary storage

Encoding Considerations

The project uses latin-1 encoding for text files to handle Spanish characters:
with open(ruta, "r", encoding="latin-1") as file:
    # ...
If you encounter encoding errors, verify that your data files use latin-1 or UTF-8 encoding.

Verification

Testing Your Environment

Run this script to verify all dependencies are installed:
import sys

# Check Python version
print(f"Python version: {sys.version}")

# Check pandas
import pandas as pd
print(f"pandas version: {pd.__version__}")

# Check scikit-learn
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")

# Check if all required components are available
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

print("\n✓ All dependencies are installed correctly!")
Expected output:
Python version: 3.x.x
pandas version: 1.x.x or higher
scikit-learn version: 0.24.x or higher

✓ All dependencies are installed correctly!

Next Steps

Once your environment meets all requirements:
  1. Review the Quickstart Guide to run the project
  2. Explore ETL Pipelines documentation
  3. Understand Database Design decisions
  4. Study the Fraud Detection Model

Build docs developers (and LLMs) love