Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt

Use this file to discover all available pages before exploring further.

Overview

Phase 2 provides command-line interface (CLI) tools for diabetes prediction. This approach is ideal for batch processing, automated pipelines, and integration with existing data workflows.
CLI Tools: train.py and predict.pyDeployment: Docker containerizedBest For: Batch predictions, automated workflows, data pipelines

CLI Architecture

The CLI system consists of two independent scripts:
fase-2/
├── train.py          # Model training script
├── predict.py        # Batch prediction script
├── requirements.txt  # Python dependencies
└── Dockerfile        # Container configuration

train.py

Trains RandomForestClassifier from CSV dataInput: train.csvOutput: model.pkl

predict.py

Generates predictions for new patientsInput: test.csv, model.pklOutput: predictions.csv

Setup

1

Build Docker Image

cd ~/workspace/source/fase-2
docker build -t ai-proyecto-sustituto .
What happens:
  • Downloads Python 3.12 base image
  • Copies train.py and predict.py
  • Installs scikit-learn, pandas, imbalanced-learn, loguru
Time: 5-10 minutes (first build)
2

Start Container

docker run -it --name ai-container ai-proyecto-sustituto /bin/bash
You’ll see a prompt like:
root@abc123def456:/app#
3

Copy Data Files

In a new terminal (keep container running), copy CSV files:
docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app

train.py - Training Script

Command Syntax

python train.py --model_file <MODEL> --data_file <DATA> [--overwrite_model]

Arguments

ArgumentRequiredTypeDescription
--model_fileYesstringPath to save trained model (e.g., model.pkl)
--data_fileYesstringPath to training CSV file (e.g., train.csv)
--overwrite_modelNoflagAllow overwriting existing model file

Basic Usage

python train.py --model_file model.pkl --data_file train.csv
Output:
loading train data
encoding train data
scaling train data
fitting model
saving model to model.pkl
Result: Creates model.pkl file

Training Process Details

What happens during training:
1

Load Data

z = pd.read_csv("train.csv")
# Expected columns: gender, age, hypertension, heart_disease,
#                   smoking_history, bmi, HbA1c_level,
#                   blood_glucose_level, diabetes
2

Encode Categorical Features

# Gender: Female->0, Male->1, Other->2
# Smoking: No Info->0, current->1, ever->2, former->3, never->4, not current->5
3

Separate X and y

Xtr = z.drop('diabetes', axis=1)  # 8 features
ytr = z[['diabetes']]              # Target
4

Scale Features

scaler = StandardScaler()
Xtr = scaler.fit_transform(Xtr)
# All features normalized to mean=0, std=1
5

Balance Classes

smote_enn = SMOTEENN(random_state=42)
Xtr, ytr = smote_enn.fit_resample(Xtr, ytr)
# Creates ~1:1 ratio of diabetic:non-diabetic
6

Train Model

m = RandomForestClassifier()
m.fit(Xtr, ytr)
# Trains 100 decision trees
7

Save Model

with open("model.pkl", "wb") as f:
    pickle.dump(m, f)
# Serializes model to disk

Training Data Format

Expected CSV Structure:
gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
Female,80.0,0,1,never,25.19,6.6,140,0
Male,54.0,0,0,No Info,27.32,6.6,80,0
Female,36.0,0,0,current,23.45,5.0,155,0
Male,76.0,1,1,current,20.14,4.8,155,0
Requirements:
  • Header row required
  • All 9 columns must be present
  • diabetes column must contain 0 or 1
  • Categorical values must match encoding dictionaries

Training Time Estimates

Dataset SizeApproximate Time
1,000 rows1-2 seconds
10,000 rows5-10 seconds
100,000 rows30-60 seconds
1,000,000 rows5-10 minutes
Times vary based on CPU and whether SMOTEENN needs to generate many synthetic samples.

Troubleshooting Training

Error: FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'Solution: Ensure train.csv is in the container:
# Check if file exists
ls -la /app/train.csv

# If not, copy it
docker cp train.csv ai-container:/app
Error: KeyError: 'diabetes'Solution: CSV must have all required columns including diabetes target
# Check CSV structure
head -n 1 train.csv
Error: MemoryError during SMOTEENNSolution: Allocate more memory to Docker:
docker run -it --memory="4g" --name ai-container ai-proyecto-sustituto /bin/bash

predict.py - Prediction Script

Command Syntax

python predict.py --model_file <MODEL> --input_file <INPUT> --predictions_file <OUTPUT>

Arguments

ArgumentRequiredTypeDescription
--model_fileYesstringPath to trained model file (e.g., model.pkl)
--input_fileYesstringPath to input CSV with patient data
--predictions_fileYesstringPath to save predictions CSV

Basic Usage

python predict.py \
  --model_file model.pkl \
  --input_file test.csv \
  --predictions_file predictions.csv
Output:
loading input data
encoding data
scaling data
loading model
making predictions
saving predictions to predictions.csv
Result: Creates predictions.csv

Input Data Format

Expected CSV Structure (WITHOUT diabetes column):
gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
Female,36,0,0,current,32.27,6.2,220
Male,28,0,0,never,27.32,5.7,158
Female,54,0,0,No Info,27.32,6.6,80
Important: Input CSV should NOT include the diabetes column. Only the 8 feature columns.

Output Format

predictions.csv:
preds
1
0
0
  • Single column named preds
  • One row per input patient
  • Values: 0 (no diabetes) or 1 (has diabetes)
  • Row order matches input file

Prediction Process

1

Validate Files

# Check model exists
if not os.path.isfile(model_file):
    logger.error(f"model file {model_file} does not exist")
    exit(-1)

# Check input exists
if not os.path.isfile(input_file):
    logger.error(f"input file {input_file} does not exist")
    exit(-1)
2

Load Input Data

Xts = pd.read_csv(input_file)
# Expected: 8 feature columns, no diabetes column
3

Encode Categorical Features

# Apply same encoding as training
gender_dict = {'Female': 0, 'Male': 1, 'Other': 2}
smoking_history_dict = {...}
Xts = Xts.replace({'gender': gender_dict, 'smoking_history': smoking_history_dict})
4

Scale Features

scaler = StandardScaler()
Xts = scaler.fit_transform(Xts)
Issue: Creates NEW scaler instead of using training scaler. This may reduce accuracy.
5

Load Model

with open(model_file, 'rb') as f:
    m = pickle.load(f)
6

Generate Predictions

preds = m.predict(Xts)
# Array of 0s and 1s
7

Save Results

pd.DataFrame(preds.reshape(-1,1), columns=['preds']).to_csv(predictions_file, index=False)

Troubleshooting Predictions

Error: model file model.pkl does not existSolution: Train the model first:
python train.py --model_file model.pkl --data_file train.csv
Error: ValueError: X has 7 features, but RandomForestClassifier is expecting 8 featuresSolution: Ensure test.csv has all 8 feature columns (excluding diabetes)
# Check column count
head -n 1 test.csv | awk -F',' '{print NF}'
# Should output: 8
Error: Category not in encoding dictionarySolution: Ensure categorical values match training:
  • gender: Female, Male, or Other
  • smoking_history: No Info, current, ever, former, never, not current

Advanced CLI Usage

Batch Processing Pipeline

Create a shell script for automated workflow:
#!/bin/bash
# diabetes_pipeline.sh

set -e  # Exit on error

# Configuration
CONTAINER_NAME="ai-container"
IMAGE_NAME="ai-proyecto-sustituto"
DATA_DIR="./data"
MODELS_DIR="./models"
RESULTS_DIR="./results"

# Create directories
mkdir -p $DATA_DIR $MODELS_DIR $RESULTS_DIR

echo "[1/6] Building Docker image..."
docker build -t $IMAGE_NAME .

echo "[2/6] Starting container..."
docker run -d --name $CONTAINER_NAME $IMAGE_NAME sleep infinity

echo "[3/6] Copying data files..."
docker cp $DATA_DIR/train.csv $CONTAINER_NAME:/app/
docker cp $DATA_DIR/test.csv $CONTAINER_NAME:/app/

echo "[4/6] Training model..."
docker exec $CONTAINER_NAME python train.py \
  --model_file model.pkl \
  --data_file train.csv \
  --overwrite_model

echo "[5/6] Generating predictions..."
docker exec $CONTAINER_NAME python predict.py \
  --model_file model.pkl \
  --input_file test.csv \
  --predictions_file predictions.csv

echo "[6/6] Retrieving results..."
docker cp $CONTAINER_NAME:/app/model.pkl $MODELS_DIR/
docker cp $CONTAINER_NAME:/app/predictions.csv $RESULTS_DIR/

echo "Cleaning up..."
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME

echo "Pipeline complete!"
echo "Model: $MODELS_DIR/model.pkl"
echo "Predictions: $RESULTS_DIR/predictions.csv"
Usage:
chmod +x diabetes_pipeline.sh
./diabetes_pipeline.sh

Volume-Based Workflow

Use volumes for persistent storage:
# Create directories
mkdir -p ./data ./models ./output

# Place data files
cp train.csv ./data/
cp test.csv ./data/

# Run with volumes
docker run -it \
  --name ai-container \
  -v $(pwd)/data:/app/data \
  -v $(pwd)/models:/app/models \
  -v $(pwd)/output:/app/output \
  ai-proyecto-sustituto /bin/bash

# Inside container
python train.py \
  --model_file /app/models/model.pkl \
  --data_file /app/data/train.csv \
  --overwrite_model

python predict.py \
  --model_file /app/models/model.pkl \
  --input_file /app/data/test.csv \
  --predictions_file /app/output/predictions.csv

# Exit container
exit

# Files are on host
ls models/    # model.pkl
ls output/    # predictions.csv

Cron Job Automation

Schedule daily predictions:
# crontab -e

# Run diabetes predictions every day at 2 AM
0 2 * * * /path/to/diabetes_pipeline.sh >> /var/log/diabetes_predictions.log 2>&1

Python Integration

Call CLI tools from Python:
import subprocess
import pandas as pd

# Train model
subprocess.run([
    "docker", "exec", "ai-container",
    "python", "train.py",
    "--model_file", "model.pkl",
    "--data_file", "train.csv",
    "--overwrite_model"
], check=True)

# Generate predictions
subprocess.run([
    "docker", "exec", "ai-container",
    "python", "predict.py",
    "--model_file", "model.pkl",
    "--input_file", "test.csv",
    "--predictions_file", "predictions.csv"
], check=True)

# Copy and read results
subprocess.run([
    "docker", "cp",
    "ai-container:/app/predictions.csv",
    "./predictions.csv"
], check=True)

predictions = pd.read_csv("predictions.csv")
print(f"Diabetes cases detected: {(predictions['preds'] == 1).sum()}")

Performance Optimization

Parallel Processing

For very large datasets, split and process in parallel:
# Split test.csv into chunks
split -l 10000 test.csv test_chunk_

# Process in parallel (requires GNU parallel)
parallel -j 4 "docker exec ai-container python predict.py \
  --model_file model.pkl \
  --input_file {} \
  --predictions_file {}.preds" ::: test_chunk_*

# Combine results
cat test_chunk_*.preds > predictions_all.csv

Resource Allocation

Allocate more resources for faster training:
docker run -it \
  --name ai-container \
  --cpus="4" \
  --memory="8g" \
  ai-proyecto-sustituto /bin/bash

Comparison: CLI vs API

FeatureCLI (Phase 2)API (Phase 3)
InputCSV filesJSON requests
OutputCSV filesJSON responses
Best ForBatch processingReal-time queries
IntegrationScripts, cron jobsWeb apps, mobile apps
ScalabilityManual parallelizationBuilt-in concurrency
Ease of UseCommand-line knowledgeHTTP requests
DebuggingTerminal logsAPI logs + Swagger UI

Next Steps

API Deployment

For real-time predictions, explore the REST API

Docker Setup

Advanced Docker configurations and best practices

Phase 2 Guide

Complete Phase 2 walkthrough

Model Architecture

Understand the underlying model

Build docs developers (and LLMs) love