Overview
This project is designed to run in Google Colab, but can also be executed in any Python environment with the required dependencies installed. The system demonstrates ETL pipelines, SQL database design, and machine learning capabilities.Python Environment
Required Python Version
- Python 3.7 or higher
Core Dependencies
The project uses the following Python libraries:pandas
Data manipulation and analysis library for structured data operations
scikit-learn
Machine learning library for the Random Forest classifier and evaluation metrics
Detailed Library Requirements
Installation
Database Requirements
MySQL Server
The SQL component requires MySQL 5.7 or higher with support for:- Partitioning: RANGE partitioning by date
- Window Functions: LAG function for delta calculations
- Constraints: CHECK constraints (MySQL 8.0+)
If using MySQL 5.7, you may need to modify CHECK constraints or implement them at the application level.
Database Features Used
| Feature | MySQL Version | Usage |
|---|---|---|
| RANGE Partitioning | 5.1+ | Year-based data partitioning |
| Window Functions (LAG) | 8.0+ | Temperature delta calculation |
| CHECK Constraints | 8.0.16+ | Data validation for coberturaNubes |
| Composite Indexes | All | Performance optimization |
Data Files
Input Data Files
The project requires the following data files:OFEI1204.txt
OFEI1204.txt
Format: Custom text format with agent-based sectionsContent: Power plant generation data
- Agent names (e.g., “AGENTE: AES CHIVOR”)
- Plant records with 24 hourly generation values
- 305 plant records total
/content/drive/MyDrive/Prueba_tecnica/Datos3/Datos Maestros VF.xlsx
Datos Maestros VF.xlsx
Format: Excel workbookContent: Master data for power plants
- Agent names and plant identifiers
- Plant types (Hidro, Termo, Filo, Menor)
- Central names for matching
- Nombre visible Agente
- AGENTE (OFEI)
- CENTRAL (dDEC, dSEGDES, dPRU…)
- Tipo de central (Hidro, Termo, Filo, Menor)
dDEC1204.TXT
dDEC1204.TXT
Format: CSV text fileContent: Generation declaration data
- Central names
- 24 hourly generation values (H1 to H24)
train.csv & test.csv
train.csv & test.csv
Format: CSV filesContent: Fraud detection datasets
- Training data with FRAUDE target variable
- Test data without target (for predictions)
- Multiple numeric and categorical features
Google Colab Setup
Mounting Google Drive
The project is configured for Google Colab with data stored in Google Drive:File Paths
All file paths in the notebook reference Google Drive:Local Environment Setup
Running Outside Google Colab
To run the project in a local Python environment:MySQL Database Setup
Creating the Database
Setting Permissions
Ensure your MySQL user has the following privileges:Partitioning Prerequisites
RANGE partitioning requires the partition key to be part of the primary key or included in all unique indexes.
System Resources
Minimum Requirements
- RAM: 4 GB (8 GB recommended for ML training)
- Storage: 500 MB for data files and outputs
- CPU: Any modern multi-core processor (Random Forest benefits from multiple cores)
Google Colab Resources
Google Colab provides:- 12 GB RAM
- Free GPU (not required for this project)
- 100 GB temporary storage
Encoding Considerations
The project uses latin-1 encoding for text files to handle Spanish characters:Verification
Testing Your Environment
Run this script to verify all dependencies are installed:Next Steps
Once your environment meets all requirements:- Review the Quickstart Guide to run the project
- Explore ETL Pipelines documentation
- Understand Database Design decisions
- Study the Fraud Detection Model