Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Phase 2 provides Docker-based command-line tools for training models and making batch predictions. This phase is ideal for automated pipelines, batch processing, and local development.
Best For: Data engineers, MLOps teams, and automated workflows that need to process CSV files in batch mode.Location: ~/workspace/source/fase-2/

Architecture

Phase 2 consists of two main Python scripts:

train.py

Trains a RandomForestClassifier from CSV data and saves the model to a pickle file

predict.py

Loads a trained model and generates predictions for new CSV data
Both scripts run inside a Docker container with all dependencies pre-installed.

Prerequisites

1

Docker Installed

Ensure Docker is installed and running on your system:
docker --version
# Should show: Docker version XX.XX.XX
2

Dataset Files

You need CSV files from the Kaggle dataset:
  • train.csv - Training data with diabetes column
  • test.csv - Test data (with or without diabetes column)
See Dataset Documentation for download instructions.
3

Source Code

Navigate to the fase-2 directory:
cd ~/workspace/source/fase-2

Quick Start

1

Build Docker Image

Build the container with all dependencies:
docker build -t ai-proyecto-sustituto .
This creates an image based on Python 3.12 with:
  • scikit-learn
  • pandas
  • imbalanced-learn
  • loguru
  • argparse
2

Run Container

Start an interactive container:
docker run -it --name ai-container ai-proyecto-sustituto /bin/bash
You’re now inside the container at /app directory.
3

Copy Data Files

In a new terminal (keep the container terminal open), copy your CSV files:
cd ~/workspace/source/resources
docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app
4

Train the Model

Back in the container terminal, train the model:
python train.py --model_file model.pkl --data_file train.csv --overwrite_model
You’ll see logging output:
overwriting existing model file model.pkl
loading train data
encoding train data
scaling train data
fitting model
saving model to model.pkl
5

Make Predictions

Generate predictions for test data:
python predict.py --model_file model.pkl --input_file test.csv --predictions_file predictions.csv
Logging output:
loading input data
encoding data
scaling data
loading model
making predictions
saving predictions to predictions.csv
6

View Results

Check the predictions:
cat predictions.csv
Output format:
preds
0
1
0
0
1

train.py - Model Training Script

Command-Line Arguments

python train.py [OPTIONS]
ArgumentRequiredTypeDescription
--data_fileYesstrPath to CSV file with training data
--model_fileYesstrPath where trained model will be saved
--overwrite_modelNoflagIf set, overwrites existing model file

Usage Examples

python train.py --model_file model.pkl --data_file train.csv --overwrite_model

Training Process

The script performs these steps:
1

Validate Model File

if os.path.isfile(model_file):
    if overwrite:
        logger.info(f"overwriting existing model file {model_file}")
    else:
        logger.info(f"model file {model_file} exists. exitting. use --overwrite_model option")
        exit(-1)
Prevents accidental overwriting of existing models.
2

Load Training Data

logger.info("loading train data")
z = pd.read_csv(data_file)
3

Encode Categorical Features

logger.info("encoding train data")
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {
    'No Info': 0, 'current': 1, 'ever': 2, 
    'former': 3, 'never': 4, 'not current': 5
}
z = z.replace({'gender': gender_dict, 'smoking_history': smoking_history_dict})
4

Separate Features and Target

Xtr = z.drop('diabetes', axis=1)
ytr = z[['diabetes']]
5

Scale Features

logger.info("scaling train data")
scaler = StandardScaler()
Xtr = scaler.fit_transform(Xtr)
The scaler is fit on training data but not saved. This means predictions must fit their own scaler, which may cause inconsistencies. For production, consider saving the scaler with the model.
6

Apply SMOTEENN Resampling

smote_enn = SMOTEENN(random_state=42)
Xtr, ytr = smote_enn.fit_resample(Xtr, ytr)
Balances the imbalanced dataset.
7

Train RandomForest Model

logger.info("fitting model")
m = RandomForestClassifier()
m.fit(Xtr, ytr)
Uses default hyperparameters:
  • n_estimators=100
  • max_depth=None
  • min_samples_split=2
8

Save Model

logger.info(f"saving model to {model_file}")
with open(model_file, "wb") as f:
    pickle.dump(m, f)
Saves as pickle file for later use.

Full Source Code

train.py
import argparse
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.combine import SMOTEENN
from loguru import logger
import os
import pandas as pd
import pickle

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('--data_file', required=True, type=str, help='a csv file with train data')
parser.add_argument('--model_file', required=True, type=str, help='where the trained model will be stored')
parser.add_argument('--overwrite_model', default=False, action='store_true', help='if sets overwrites the model file if it exists')

args = parser.parse_args()

model_file = args.model_file
data_file = args.data_file
overwrite = args.overwrite_model

# Check if model file already exists
if os.path.isfile(model_file):
    if overwrite:
        logger.info(f"overwriting existing model file {model_file}")
    else:
        logger.info(f"model file {model_file} exists. exitting. use --overwrite_model option")
        exit(-1)

# Load training data
logger.info("loading train data")
z = pd.read_csv(data_file)

# Encode training data
logger.info("encoding train data")
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {'No Info': 0, 'current': 1, 'ever': 2, 'former': 3, 'never': 4, 'not current': 5}
z = z.replace({'gender': gender_dict, 'smoking_history': smoking_history_dict})

# Separate features and labels
Xtr = z.drop('diabetes', axis=1)
ytr = z[['diabetes']]

# Scale training data
logger.info("scaling train data")
scaler = StandardScaler()
Xtr = scaler.fit_transform(Xtr)

# Apply oversampling and undersampling with SMOTEENN
smote_enn = SMOTEENN(random_state=42)
Xtr, ytr = smote_enn.fit_resample(Xtr, ytr)

# Train the model
logger.info("fitting model")
m = RandomForestClassifier()
m.fit(Xtr, ytr)

# Save model to file
logger.info(f"saving model to {model_file}")
with open(model_file, "wb") as f:
    pickle.dump(m, f)

predict.py - Prediction Script

Command-Line Arguments

python predict.py [OPTIONS]
ArgumentRequiredTypeDescription
--input_fileYesstrCSV file with input data (no target column)
--predictions_fileYesstrCSV file where predictions will be saved
--model_fileYesstrPKL file with trained model

Usage Examples

python predict.py --model_file model.pkl --input_file test.csv --predictions_file predictions.csv

Prediction Process

1

Validate Files

if not os.path.isfile(model_file):
    logger.error(f"model file {model_file} does not exist")
    exit(-1)

if not os.path.isfile(input_file):
    logger.error(f"input file {input_file} does not exist")
    exit(-1)
2

Load Input Data

logger.info("loading input data")
Xts = pd.read_csv(input_file)
3

Encode Features

logger.info("encoding data")
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {
    'No Info': 0, 'current': 1, 'ever': 2, 
    'former': 3, 'never': 4, 'not current': 5
}
Xts = Xts.replace({'gender': gender_dict, 'smoking_history': smoking_history_dict})
4

Scale Features

logger.info("scaling data")
scaler = StandardScaler()
Xts = scaler.fit_transform(Xts)
The scaler is fit on test data rather than using the training scaler. This is a limitation of this implementation and may affect prediction accuracy.
5

Load Model

logger.info("loading model")
with open(model_file, 'rb') as f:
    m = pickle.load(f)
6

Generate Predictions

logger.info("making predictions")
preds = m.predict(Xts)
7

Save Results

logger.info(f"saving predictions to {predictions_file}")
pd.DataFrame(preds.reshape(-1,1), columns=['preds']).to_csv(predictions_file, index=False)
Output format:
preds
0
1
0

Full Source Code

predict.py
import argparse
import numpy as np
from loguru import logger
from sklearn.preprocessing import StandardScaler
import os
import pandas as pd
import pickle

# Parse command-line arguments
parser = argparse.ArgumentParser()
parser.add_argument('--input_file', required=True, type=str, help='a csv file with input data (no targets)')
parser.add_argument('--predictions_file', required=True, type=str, help='a csv file where predictions will be saved to')
parser.add_argument('--model_file', required=True, type=str, help='a pkl file with a model already stored (see train.py)')

args = parser.parse_args()

model_file = args.model_file
input_file = args.input_file
predictions_file = args.predictions_file

# Verify model file exists
if not os.path.isfile(model_file):
    logger.error(f"model file {model_file} does not exist")
    exit(-1)

# Verify input file exists
if not os.path.isfile(input_file):
    logger.error(f"input file {input_file} does not exist")
    exit(-1)

# Load input data
logger.info("loading input data")
Xts = pd.read_csv(input_file)

# Encode input data
logger.info("encoding data")
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {'No Info': 0, 'current': 1, 'ever': 2, 'former': 3, 'never': 4, 'not current': 5}
Xts = Xts.replace({'gender': gender_dict, 'smoking_history': smoking_history_dict})

# Scale input data
logger.info("scaling data")
scaler = StandardScaler()
Xts = scaler.fit_transform(Xts)

# Load model
logger.info("loading model")
with open(model_file, 'rb') as f:
    m = pickle.load(f)

# Make predictions
logger.info("making predictions")
preds = m.predict(Xts)

# Save predictions to file
logger.info(f"saving predictions to {predictions_file}")
pd.DataFrame(preds.reshape(-1,1), columns=['preds']).to_csv(predictions_file, index=False)

Dockerfile

The Docker container is built from this Dockerfile:
Dockerfile
# Select Python base image
FROM python:3.12

# Set working directory
WORKDIR /app

# Copy necessary files to application directory
ADD train.py /app
ADD predict.py /app
ADD requirements.txt /app

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt

Dependencies (requirements.txt)

argparse
scikit-learn
loguru
pandas
imbalanced-learn

Advanced Usage

Copy Files Out of Container

Retrieve predictions from the container:
# Copy predictions to host
docker cp ai-container:/app/predictions.csv ./local_predictions.csv

# Copy trained model to host
docker cp ai-container:/app/model.pkl ./saved_model.pkl

Automated Pipeline

Create a shell script for automated training and prediction:
pipeline.sh
#!/bin/bash

# Build image
docker build -t ai-proyecto-sustituto .

# Run container
docker run -d --name ai-container ai-proyecto-sustituto sleep infinity

# Copy data files
docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app

# Train model
docker exec ai-container python train.py \
  --model_file model.pkl \
  --data_file train.csv \
  --overwrite_model

# Make predictions
docker exec ai-container python predict.py \
  --model_file model.pkl \
  --input_file test.csv \
  --predictions_file predictions.csv

# Copy results
docker cp ai-container:/app/predictions.csv ./results/

# Cleanup
docker stop ai-container
docker rm ai-container
Run the pipeline:
chmod +x pipeline.sh
./pipeline.sh

Volume Mounting

Mount local directories for easier file access:
docker run -it \
  --name ai-container \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/models:/app/models \
  ai-proyecto-sustituto /bin/bash

# Inside container
python train.py --model_file models/model.pkl --data_file data/train.csv --overwrite_model
python predict.py --model_file models/model.pkl --input_file data/test.csv --predictions_file data/predictions.csv

Troubleshooting

Error: model file model.pkl exists. exitting. use --overwrite_model optionSolution: Add the --overwrite_model flag:
python train.py --model_file model.pkl --data_file train.csv --overwrite_model
Or delete the existing model:
rm model.pkl
Error: model file model.pkl does not exist or input file test.csv does not existSolution: Verify files are in the container:
docker exec ai-container ls -la /app
Copy missing files:
docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app
Problem: Container stops right after startingReason: The Dockerfile doesn’t have a CMD, so the container needs a commandSolution: Use interactive mode with bash:
docker run -it --name ai-container ai-proyecto-sustituto /bin/bash
Or run with sleep:
docker run -d --name ai-container ai-proyecto-sustituto sleep infinity
docker exec -it ai-container /bin/bash
Problem: Special characters in CSV cause errorsSolution: Ensure CSV is UTF-8 encoded:
import pandas as pd
df = pd.read_csv('data.csv', encoding='utf-8')
df.to_csv('data_clean.csv', encoding='utf-8', index=False)

Comparison with Other Phases

FeaturePhase 1 (Notebook)Phase 2 (CLI)Phase 3 (API)
InterfaceJupyter cellsCommand-lineREST endpoints
DeploymentGoogle ColabDocker containerDocker container
Best ForExplorationBatch processingProduction/Web
InputInline codeCSV filesJSON requests
OutputInline resultsCSV filesJSON responses
AutomationManualScriptableFully automated

Next Steps

Phase 3: API

Deploy a REST API for real-time predictions

Docker Setup

Advanced Docker configuration and best practices

CLI Usage

Detailed guide for CLI operations and automation

Data Preprocessing

Deep dive into encoding and scaling pipeline

Build docs developers (and LLMs) love