Requirements

Overview

This project is designed to run in Google Colab, but can also be executed in any Python environment with the required dependencies installed. The system demonstrates ETL pipelines, SQL database design, and machine learning capabilities.

Python Environment

Required Python Version

Python 3.7 or higher

Core Dependencies

The project uses the following Python libraries:

pandas

Data manipulation and analysis library for structured data operations

scikit-learn

Machine learning library for the Random Forest classifier and evaluation metrics

Detailed Library Requirements

# Data manipulation
import pandas as pd

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

Installation

pip install pandas scikit-learn

Database Requirements

MySQL Server

The SQL component requires MySQL 5.7 or higher with support for:

Partitioning: RANGE partitioning by date
Window Functions: LAG function for delta calculations
Constraints: CHECK constraints (MySQL 8.0+)

If using MySQL 5.7, you may need to modify CHECK constraints or implement them at the application level.

Database Features Used

Feature	MySQL Version	Usage
RANGE Partitioning	5.1+	Year-based data partitioning
Window Functions (LAG)	8.0+	Temperature delta calculation
CHECK Constraints	8.0.16+	Data validation for `coberturaNubes`
Composite Indexes	All	Performance optimization

Data Files

Input Data Files

The project requires the following data files:

OFEI1204.txt

Format: Custom text format with agent-based sectionsContent: Power plant generation data

Agent names (e.g., “AGENTE: AES CHIVOR”)
Plant records with 24 hourly generation values
305 plant records total

Size: ~70KBLocation: /content/drive/MyDrive/Prueba_tecnica/Datos3/

Datos Maestros VF.xlsx

Format: Excel workbookContent: Master data for power plants

Agent names and plant identifiers
Plant types (Hidro, Termo, Filo, Menor)
Central names for matching

Columns Used:

Nombre visible Agente
AGENTE (OFEI)
CENTRAL (dDEC, dSEGDES, dPRU…)
Tipo de central (Hidro, Termo, Filo, Menor)

dDEC1204.TXT

Format: CSV text fileContent: Generation declaration data

Central names
24 hourly generation values (H1 to H24)

Encoding: latin-1

train.csv & test.csv

Format: CSV filesContent: Fraud detection datasets

Training data with FRAUDE target variable
Test data without target (for predictions)
Multiple numeric and categorical features

Size: Training set has 2,965 records for the validation shown in results

Google Colab Setup

Mounting Google Drive

The project is configured for Google Colab with data stored in Google Drive:

from google.colab import drive
drive.mount('/content/drive')

File Paths

All file paths in the notebook reference Google Drive:

# Example paths
ruta = "/content/drive/MyDrive/Prueba_tecnica/Datos3/OFEI1204.txt"
train = pd.read_csv("/content/drive/MyDrive/Prueba_tecnica/Datos3/train.csv")

If running locally, update all file paths to match your local directory structure.

Local Environment Setup

Running Outside Google Colab

To run the project in a local Python environment:

Install dependencies

pip install pandas scikit-learn jupyter openpyxl

Download data files

Place all required data files in a local directory (e.g., ./data/)

Update file paths

Modify all file paths in the notebook to reference your local data directory

# Change from:
ruta = "/content/drive/MyDrive/Prueba_tecnica/Datos3/OFEI1204.txt"

# To:
ruta = "./data/OFEI1204.txt"

Launch Jupyter

jupyter notebook Prueba_tecnica_Dataknow.ipynb

MySQL Database Setup

Creating the Database

CREATE DATABASE meteorologia;
USE meteorologia;

Setting Permissions

Ensure your MySQL user has the following privileges:

GRANT CREATE, ALTER, INSERT, UPDATE, SELECT, DROP
ON meteorologia.* 
TO 'your_user'@'localhost';

FLUSH PRIVILEGES;

Partitioning Prerequisites

RANGE partitioning requires the partition key to be part of the primary key or included in all unique indexes.

System Resources

Minimum Requirements

RAM: 4 GB (8 GB recommended for ML training)
Storage: 500 MB for data files and outputs
CPU: Any modern multi-core processor (Random Forest benefits from multiple cores)

Google Colab Resources

Google Colab provides:

12 GB RAM
Free GPU (not required for this project)
100 GB temporary storage

Encoding Considerations

The project uses latin-1 encoding for text files to handle Spanish characters:

with open(ruta, "r", encoding="latin-1") as file:
    # ...

If you encounter encoding errors, verify that your data files use latin-1 or UTF-8 encoding.

Verification

Testing Your Environment

Run this script to verify all dependencies are installed:

import sys

# Check Python version
print(f"Python version: {sys.version}")

# Check pandas
import pandas as pd
print(f"pandas version: {pd.__version__}")

# Check scikit-learn
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")

# Check if all required components are available
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

print("\n✓ All dependencies are installed correctly!")

Expected output:

Python version: 3.x.x
pandas version: 1.x.x or higher
scikit-learn version: 0.24.x or higher

✓ All dependencies are installed correctly!

Next Steps

Once your environment meets all requirements:

Review the Quickstart Guide to run the project
Explore ETL Pipelines documentation
Understand Database Design decisions
Study the Fraud Detection Model

Introduction

Getting Started

Overview

Python Environment

Required Python Version

Core Dependencies

pandas

scikit-learn

Detailed Library Requirements

Installation

Database Requirements

MySQL Server

Database Features Used

Data Files

Input Data Files

Google Colab Setup

Mounting Google Drive

File Paths

Local Environment Setup

Running Outside Google Colab

MySQL Database Setup

Creating the Database

Setting Permissions

Partitioning Prerequisites

System Resources

Minimum Requirements

Google Colab Resources

Encoding Considerations

Verification

Testing Your Environment

Next Steps

Build docs developers (and LLMs) love

Introduction

Getting Started

​Overview

​Python Environment

​Required Python Version

​Core Dependencies

pandas

scikit-learn

​Detailed Library Requirements

​Installation

​Database Requirements

​MySQL Server

​Database Features Used

​Data Files

​Input Data Files

​Google Colab Setup

​Mounting Google Drive

​File Paths

​Local Environment Setup

​Running Outside Google Colab

​MySQL Database Setup

​Creating the Database

​Setting Permissions

​Partitioning Prerequisites

​System Resources

​Minimum Requirements

​Google Colab Resources

​Encoding Considerations

​Verification

​Testing Your Environment

​Next Steps

Build docs developers (and LLMs) love

Overview

Python Environment

Required Python Version

Core Dependencies

Detailed Library Requirements

Installation

Database Requirements

MySQL Server

Database Features Used

Data Files

Input Data Files

Google Colab Setup

Mounting Google Drive

File Paths

Local Environment Setup

Running Outside Google Colab

MySQL Database Setup

Creating the Database

Setting Permissions

Partitioning Prerequisites

System Resources

Minimum Requirements

Google Colab Resources

Encoding Considerations

Verification

Testing Your Environment

Next Steps