Documentation Index Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Phase 2 provides Docker-based command-line tools for training models and making batch predictions. This phase is ideal for automated pipelines, batch processing, and local development.
Best For : Data engineers, MLOps teams, and automated workflows that need to process CSV files in batch mode.Location : ~/workspace/source/fase-2/
Architecture
Phase 2 consists of two main Python scripts:
train.py Trains a RandomForestClassifier from CSV data and saves the model to a pickle file
predict.py Loads a trained model and generates predictions for new CSV data
Both scripts run inside a Docker container with all dependencies pre-installed.
Prerequisites
Docker Installed
Ensure Docker is installed and running on your system: docker --version
# Should show: Docker version XX.XX.XX
Dataset Files
You need CSV files from the Kaggle dataset:
train.csv - Training data with diabetes column
test.csv - Test data (with or without diabetes column)
See Dataset Documentation for download instructions.
Source Code
Navigate to the fase-2 directory: cd ~/workspace/source/fase-2
Quick Start
Build Docker Image
Build the container with all dependencies: docker build -t ai-proyecto-sustituto .
This creates an image based on Python 3.12 with:
scikit-learn
pandas
imbalanced-learn
loguru
argparse
Run Container
Start an interactive container: docker run -it --name ai-container ai-proyecto-sustituto /bin/bash
You’re now inside the container at /app directory.
Copy Data Files
In a new terminal (keep the container terminal open), copy your CSV files: cd ~/workspace/source/resources
docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app
Train the Model
Back in the container terminal , train the model: python train.py --model_file model.pkl --data_file train.csv --overwrite_model
You’ll see logging output: overwriting existing model file model.pkl
loading train data
encoding train data
scaling train data
fitting model
saving model to model.pkl
Make Predictions
Generate predictions for test data: python predict.py --model_file model.pkl --input_file test.csv --predictions_file predictions.csv
Logging output: loading input data
encoding data
scaling data
loading model
making predictions
saving predictions to predictions.csv
View Results
Check the predictions: Output format:
train.py - Model Training Script
Command-Line Arguments
python train.py [OPTIONS]
Argument Required Type Description --data_fileYes str Path to CSV file with training data --model_fileYes str Path where trained model will be saved --overwrite_modelNo flag If set, overwrites existing model file
Usage Examples
Basic Usage
Without Overwrite
Different Files
python train.py --model_file model.pkl --data_file train.csv --overwrite_model
Training Process
The script performs these steps:
Validate Model File
if os.path.isfile(model_file):
if overwrite:
logger.info( f "overwriting existing model file { model_file } " )
else :
logger.info( f "model file { model_file } exists. exitting. use --overwrite_model option" )
exit ( - 1 )
Prevents accidental overwriting of existing models.
Load Training Data
logger.info( "loading train data" )
z = pd.read_csv(data_file)
Encode Categorical Features
logger.info( "encoding train data" )
gender_dict = { 'Female' : 0 , 'Male' : 1 , 'Other' : 2 }
smoking_history_dict = {
'No Info' : 0 , 'current' : 1 , 'ever' : 2 ,
'former' : 3 , 'never' : 4 , 'not current' : 5
}
z = z.replace({ 'gender' : gender_dict, 'smoking_history' : smoking_history_dict})
Separate Features and Target
Xtr = z.drop( 'diabetes' , axis = 1 )
ytr = z[[ 'diabetes' ]]
Scale Features
logger.info( "scaling train data" )
scaler = StandardScaler()
Xtr = scaler.fit_transform(Xtr)
The scaler is fit on training data but not saved. This means predictions must fit their own scaler, which may cause inconsistencies. For production, consider saving the scaler with the model.
Apply SMOTEENN Resampling
smote_enn = SMOTEENN( random_state = 42 )
Xtr, ytr = smote_enn.fit_resample(Xtr, ytr)
Balances the imbalanced dataset.
Train RandomForest Model
logger.info( "fitting model" )
m = RandomForestClassifier()
m.fit(Xtr, ytr)
Uses default hyperparameters:
n_estimators=100
max_depth=None
min_samples_split=2
Save Model
logger.info( f "saving model to { model_file } " )
with open (model_file, "wb" ) as f:
pickle.dump(m, f)
Saves as pickle file for later use.
Full Source Code
import argparse
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.combine import SMOTEENN
from loguru import logger
import os
import pandas as pd
import pickle
# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument( '--data_file' , required = True , type = str , help = 'a csv file with train data' )
parser.add_argument( '--model_file' , required = True , type = str , help = 'where the trained model will be stored' )
parser.add_argument( '--overwrite_model' , default = False , action = 'store_true' , help = 'if sets overwrites the model file if it exists' )
args = parser.parse_args()
model_file = args.model_file
data_file = args.data_file
overwrite = args.overwrite_model
# Check if model file already exists
if os.path.isfile(model_file):
if overwrite:
logger.info( f "overwriting existing model file { model_file } " )
else :
logger.info( f "model file { model_file } exists. exitting. use --overwrite_model option" )
exit ( - 1 )
# Load training data
logger.info( "loading train data" )
z = pd.read_csv(data_file)
# Encode training data
logger.info( "encoding train data" )
gender_dict = { 'Female' : 0 , 'Male' : 1 , 'Other' : 2 }
smoking_history_dict = { 'No Info' : 0 , 'current' : 1 , 'ever' : 2 , 'former' : 3 , 'never' : 4 , 'not current' : 5 }
z = z.replace({ 'gender' : gender_dict, 'smoking_history' : smoking_history_dict})
# Separate features and labels
Xtr = z.drop( 'diabetes' , axis = 1 )
ytr = z[[ 'diabetes' ]]
# Scale training data
logger.info( "scaling train data" )
scaler = StandardScaler()
Xtr = scaler.fit_transform(Xtr)
# Apply oversampling and undersampling with SMOTEENN
smote_enn = SMOTEENN( random_state = 42 )
Xtr, ytr = smote_enn.fit_resample(Xtr, ytr)
# Train the model
logger.info( "fitting model" )
m = RandomForestClassifier()
m.fit(Xtr, ytr)
# Save model to file
logger.info( f "saving model to { model_file } " )
with open (model_file, "wb" ) as f:
pickle.dump(m, f)
predict.py - Prediction Script
Command-Line Arguments
python predict.py [OPTIONS]
Argument Required Type Description --input_fileYes str CSV file with input data (no target column) --predictions_fileYes str CSV file where predictions will be saved --model_fileYes str PKL file with trained model
Usage Examples
Basic Usage
Different Files
python predict.py --model_file model.pkl --input_file test.csv --predictions_file predictions.csv
Prediction Process
Validate Files
if not os.path.isfile(model_file):
logger.error( f "model file { model_file } does not exist" )
exit ( - 1 )
if not os.path.isfile(input_file):
logger.error( f "input file { input_file } does not exist" )
exit ( - 1 )
Load Input Data
logger.info( "loading input data" )
Xts = pd.read_csv(input_file)
Encode Features
logger.info( "encoding data" )
gender_dict = { 'Female' : 0 , 'Male' : 1 , 'Other' : 2 }
smoking_history_dict = {
'No Info' : 0 , 'current' : 1 , 'ever' : 2 ,
'former' : 3 , 'never' : 4 , 'not current' : 5
}
Xts = Xts.replace({ 'gender' : gender_dict, 'smoking_history' : smoking_history_dict})
Scale Features
logger.info( "scaling data" )
scaler = StandardScaler()
Xts = scaler.fit_transform(Xts)
The scaler is fit on test data rather than using the training scaler. This is a limitation of this implementation and may affect prediction accuracy.
Load Model
logger.info( "loading model" )
with open (model_file, 'rb' ) as f:
m = pickle.load(f)
Generate Predictions
logger.info( "making predictions" )
preds = m.predict(Xts)
Save Results
logger.info( f "saving predictions to { predictions_file } " )
pd.DataFrame(preds.reshape( - 1 , 1 ), columns = [ 'preds' ]).to_csv(predictions_file, index = False )
Output format:
Full Source Code
import argparse
import numpy as np
from loguru import logger
from sklearn.preprocessing import StandardScaler
import os
import pandas as pd
import pickle
# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument( '--input_file' , required = True , type = str , help = 'a csv file with input data (no targets)' )
parser.add_argument( '--predictions_file' , required = True , type = str , help = 'a csv file where predictions will be saved to' )
parser.add_argument( '--model_file' , required = True , type = str , help = 'a pkl file with a model already stored (see train.py)' )
args = parser.parse_args()
model_file = args.model_file
input_file = args.input_file
predictions_file = args.predictions_file
# Verify model file exists
if not os.path.isfile(model_file):
logger.error( f "model file { model_file } does not exist" )
exit ( - 1 )
# Verify input file exists
if not os.path.isfile(input_file):
logger.error( f "input file { input_file } does not exist" )
exit ( - 1 )
# Load input data
logger.info( "loading input data" )
Xts = pd.read_csv(input_file)
# Encode input data
logger.info( "encoding data" )
gender_dict = { 'Female' : 0 , 'Male' : 1 , 'Other' : 2 }
smoking_history_dict = { 'No Info' : 0 , 'current' : 1 , 'ever' : 2 , 'former' : 3 , 'never' : 4 , 'not current' : 5 }
Xts = Xts.replace({ 'gender' : gender_dict, 'smoking_history' : smoking_history_dict})
# Scale input data
logger.info( "scaling data" )
scaler = StandardScaler()
Xts = scaler.fit_transform(Xts)
# Load model
logger.info( "loading model" )
with open (model_file, 'rb' ) as f:
m = pickle.load(f)
# Make predictions
logger.info( "making predictions" )
preds = m.predict(Xts)
# Save predictions to file
logger.info( f "saving predictions to { predictions_file } " )
pd.DataFrame(preds.reshape( - 1 , 1 ), columns = [ 'preds' ]).to_csv(predictions_file, index = False )
Dockerfile
The Docker container is built from this Dockerfile:
# Select Python base image
FROM python:3.12
# Set working directory
WORKDIR /app
# Copy necessary files to application directory
ADD train.py /app
ADD predict.py /app
ADD requirements.txt /app
# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt
Dependencies (requirements.txt)
argparse
scikit-learn
loguru
pandas
imbalanced-learn
Advanced Usage
Copy Files Out of Container
Retrieve predictions from the container:
# Copy predictions to host
docker cp ai-container:/app/predictions.csv ./local_predictions.csv
# Copy trained model to host
docker cp ai-container:/app/model.pkl ./saved_model.pkl
Automated Pipeline
Create a shell script for automated training and prediction:
#!/bin/bash
# Build image
docker build -t ai-proyecto-sustituto .
# Run container
docker run -d --name ai-container ai-proyecto-sustituto sleep infinity
# Copy data files
docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app
# Train model
docker exec ai-container python train.py \
--model_file model.pkl \
--data_file train.csv \
--overwrite_model
# Make predictions
docker exec ai-container python predict.py \
--model_file model.pkl \
--input_file test.csv \
--predictions_file predictions.csv
# Copy results
docker cp ai-container:/app/predictions.csv ./results/
# Cleanup
docker stop ai-container
docker rm ai-container
Run the pipeline:
chmod +x pipeline.sh
./pipeline.sh
Volume Mounting
Mount local directories for easier file access:
docker run -it \
--name ai-container \
-v $( pwd ) /data:/app/data \
-v $( pwd ) /models:/app/models \
ai-proyecto-sustituto /bin/bash
# Inside container
python train.py --model_file models/model.pkl --data_file data/train.csv --overwrite_model
python predict.py --model_file models/model.pkl --input_file data/test.csv --predictions_file data/predictions.csv
Troubleshooting
Error : model file model.pkl exists. exitting. use --overwrite_model optionSolution : Add the --overwrite_model flag:python train.py --model_file model.pkl --data_file train.csv --overwrite_model
Or delete the existing model:
Error : model file model.pkl does not exist or input file test.csv does not existSolution : Verify files are in the container:docker exec ai-container ls -la /app
Copy missing files: docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app
Container exits immediately
Problem : Special characters in CSV cause errorsSolution : Ensure CSV is UTF-8 encoded:import pandas as pd
df = pd.read_csv( 'data.csv' , encoding = 'utf-8' )
df.to_csv( 'data_clean.csv' , encoding = 'utf-8' , index = False )
Comparison with Other Phases
Feature Phase 1 (Notebook) Phase 2 (CLI) Phase 3 (API) Interface Jupyter cells Command-line REST endpoints Deployment Google Colab Docker container Docker container Best For Exploration Batch processing Production/Web Input Inline code CSV files JSON requests Output Inline results CSV files JSON responses Automation Manual Scriptable Fully automated
Next Steps
Phase 3: API Deploy a REST API for real-time predictions
Docker Setup Advanced Docker configuration and best practices
CLI Usage Detailed guide for CLI operations and automation
Data Preprocessing Deep dive into encoding and scaling pipeline