Documentation Index Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Phase 2 provides command-line interface (CLI) tools for diabetes prediction. This approach is ideal for batch processing, automated pipelines, and integration with existing data workflows.
CLI Tools : train.py and predict.pyDeployment : Docker containerizedBest For : Batch predictions, automated workflows, data pipelines
CLI Architecture
The CLI system consists of two independent scripts:
fase-2/
├── train.py # Model training script
├── predict.py # Batch prediction script
├── requirements.txt # Python dependencies
└── Dockerfile # Container configuration
train.py Trains RandomForestClassifier from CSV data Input : train.csvOutput : model.pkl
predict.py Generates predictions for new patients Input : test.csv, model.pklOutput : predictions.csv
Setup
Build Docker Image
cd ~/workspace/source/fase-2
docker build -t ai-proyecto-sustituto .
What happens:
Downloads Python 3.12 base image
Copies train.py and predict.py
Installs scikit-learn, pandas, imbalanced-learn, loguru
Time : 5-10 minutes (first build)
Start Container
docker run -it --name ai-container ai-proyecto-sustituto /bin/bash
You’ll see a prompt like:
Copy Data Files
In a new terminal (keep container running), copy CSV files: docker cp train.csv ai-container:/app
docker cp test.csv ai-container:/app
train.py - Training Script
Command Syntax
python train.py --model_file < MODE L > --data_file < DAT A > [--overwrite_model]
Arguments
Argument Required Type Description --model_fileYes string Path to save trained model (e.g., model.pkl) --data_fileYes string Path to training CSV file (e.g., train.csv) --overwrite_modelNo flag Allow overwriting existing model file
Basic Usage
First Training
Overwrite Existing Model
Without Overwrite Flag
Custom Model Name
python train.py --model_file model.pkl --data_file train.csv
Output: loading train data
encoding train data
scaling train data
fitting model
saving model to model.pkl
Result : Creates model.pkl filepython train.py --model_file model.pkl --data_file train.csv --overwrite_model
Output: overwriting existing model file model.pkl
loading train data
encoding train data
scaling train data
fitting model
saving model to model.pkl
Use Case : Retraining with updated data# If model.pkl already exists
python train.py --model_file model.pkl --data_file train.csv
Output: model file model.pkl exists. exitting. use --overwrite_model option
Exit Code : -1 (error)Safety : Prevents accidental overwritingpython train.py --model_file diabetes_v2.pkl --data_file train.csv
Use Case : Version control for models
Training Process Details
What happens during training:
Load Data
z = pd.read_csv( "train.csv" )
# Expected columns: gender, age, hypertension, heart_disease,
# smoking_history, bmi, HbA1c_level,
# blood_glucose_level, diabetes
Encode Categorical Features
# Gender: Female->0, Male->1, Other->2
# Smoking: No Info->0, current->1, ever->2, former->3, never->4, not current->5
Separate X and y
Xtr = z.drop( 'diabetes' , axis = 1 ) # 8 features
ytr = z[[ 'diabetes' ]] # Target
Scale Features
scaler = StandardScaler()
Xtr = scaler.fit_transform(Xtr)
# All features normalized to mean=0, std=1
Balance Classes
smote_enn = SMOTEENN( random_state = 42 )
Xtr, ytr = smote_enn.fit_resample(Xtr, ytr)
# Creates ~1:1 ratio of diabetic:non-diabetic
Train Model
m = RandomForestClassifier()
m.fit(Xtr, ytr)
# Trains 100 decision trees
Save Model
with open ( "model.pkl" , "wb" ) as f:
pickle.dump(m, f)
# Serializes model to disk
Expected CSV Structure:
gender, age, hypertension, heart_disease, smoking_history, bmi, HbA1c_level, blood_glucose_level, diabetes
Female, 80.0, 0, 1, never, 25.19, 6.6, 140, 0
Male, 54.0, 0, 0, No Info, 27.32, 6.6, 80, 0
Female, 36.0, 0, 0, current, 23.45, 5.0, 155, 0
Male, 76.0, 1, 1, current, 20.14, 4.8, 155, 0
Requirements:
Header row required
All 9 columns must be present
diabetes column must contain 0 or 1
Categorical values must match encoding dictionaries
Training Time Estimates
Dataset Size Approximate Time 1,000 rows 1-2 seconds 10,000 rows 5-10 seconds 100,000 rows 30-60 seconds 1,000,000 rows 5-10 minutes
Times vary based on CPU and whether SMOTEENN needs to generate many synthetic samples.
Troubleshooting Training
Error : FileNotFoundError: [Errno 2] No such file or directory: 'train.csv'Solution : Ensure train.csv is in the container:# Check if file exists
ls -la /app/train.csv
# If not, copy it
docker cp train.csv ai-container:/app
Error : KeyError: 'diabetes'Solution : CSV must have all required columns including diabetes target# Check CSV structure
head -n 1 train.csv
Error : MemoryError during SMOTEENNSolution : Allocate more memory to Docker:docker run -it --memory= "4g" --name ai-container ai-proyecto-sustituto /bin/bash
predict.py - Prediction Script
Command Syntax
python predict.py --model_file < MODE L > --input_file < INPU T > --predictions_file < OUTPU T >
Arguments
Argument Required Type Description --model_fileYes string Path to trained model file (e.g., model.pkl) --input_fileYes string Path to input CSV with patient data --predictions_fileYes string Path to save predictions CSV
Basic Usage
Simple Prediction
View Predictions
Count Predictions
python predict.py \
--model_file model.pkl \
--input_file test.csv \
--predictions_file predictions.csv
Output: loading input data
encoding data
scaling data
loading model
making predictions
saving predictions to predictions.csv
Result : Creates predictions.csvOutput: Interpretation:
0 = No diabetes
1 = Has diabetes
# Count diabetes predictions
grep -c "^1$" predictions.csv
# Or using Python in container
python -c "import pandas as pd; df = pd.read_csv('predictions.csv'); print(df['preds'].value_counts())"
Output: 0 8542
1 1458
Name: preds, dtype: int64
Expected CSV Structure (WITHOUT diabetes column):
gender, age, hypertension, heart_disease, smoking_history, bmi, HbA1c_level, blood_glucose_level
Female, 36, 0, 0, current, 32.27, 6.2, 220
Male, 28, 0, 0, never, 27.32, 5.7, 158
Female, 54, 0, 0, No Info, 27.32, 6.6, 80
Important : Input CSV should NOT include the diabetes column. Only the 8 feature columns.
predictions.csv:
Single column named preds
One row per input patient
Values: 0 (no diabetes) or 1 (has diabetes)
Row order matches input file
Prediction Process
Validate Files
# Check model exists
if not os.path.isfile(model_file):
logger.error( f "model file { model_file } does not exist" )
exit ( - 1 )
# Check input exists
if not os.path.isfile(input_file):
logger.error( f "input file { input_file } does not exist" )
exit ( - 1 )
Load Input Data
Xts = pd.read_csv(input_file)
# Expected: 8 feature columns, no diabetes column
Encode Categorical Features
# Apply same encoding as training
gender_dict = { 'Female' : 0 , 'Male' : 1 , 'Other' : 2 }
smoking_history_dict = { ... }
Xts = Xts.replace({ 'gender' : gender_dict, 'smoking_history' : smoking_history_dict})
Scale Features
scaler = StandardScaler()
Xts = scaler.fit_transform(Xts)
Issue : Creates NEW scaler instead of using training scaler. This may reduce accuracy.
Load Model
with open (model_file, 'rb' ) as f:
m = pickle.load(f)
Generate Predictions
preds = m.predict(Xts)
# Array of 0s and 1s
Save Results
pd.DataFrame(preds.reshape( - 1 , 1 ), columns = [ 'preds' ]).to_csv(predictions_file, index = False )
Troubleshooting Predictions
Model file does not exist
Error : model file model.pkl does not existSolution : Train the model first:python train.py --model_file model.pkl --data_file train.csv
Error : ValueError: X has 7 features, but RandomForestClassifier is expecting 8 featuresSolution : Ensure test.csv has all 8 feature columns (excluding diabetes)# Check column count
head -n 1 test.csv | awk -F ',' '{print NF}'
# Should output: 8
Error : Category not in encoding dictionarySolution : Ensure categorical values match training:
gender : Female, Male, or Other
smoking_history : No Info, current, ever, former, never, not current
Advanced CLI Usage
Batch Processing Pipeline
Create a shell script for automated workflow:
#!/bin/bash
# diabetes_pipeline.sh
set -e # Exit on error
# Configuration
CONTAINER_NAME = "ai-container"
IMAGE_NAME = "ai-proyecto-sustituto"
DATA_DIR = "./data"
MODELS_DIR = "./models"
RESULTS_DIR = "./results"
# Create directories
mkdir -p $DATA_DIR $MODELS_DIR $RESULTS_DIR
echo "[1/6] Building Docker image..."
docker build -t $IMAGE_NAME .
echo "[2/6] Starting container..."
docker run -d --name $CONTAINER_NAME $IMAGE_NAME sleep infinity
echo "[3/6] Copying data files..."
docker cp $DATA_DIR /train.csv $CONTAINER_NAME :/app/
docker cp $DATA_DIR /test.csv $CONTAINER_NAME :/app/
echo "[4/6] Training model..."
docker exec $CONTAINER_NAME python train.py \
--model_file model.pkl \
--data_file train.csv \
--overwrite_model
echo "[5/6] Generating predictions..."
docker exec $CONTAINER_NAME python predict.py \
--model_file model.pkl \
--input_file test.csv \
--predictions_file predictions.csv
echo "[6/6] Retrieving results..."
docker cp $CONTAINER_NAME :/app/model.pkl $MODELS_DIR /
docker cp $CONTAINER_NAME :/app/predictions.csv $RESULTS_DIR /
echo "Cleaning up..."
docker stop $CONTAINER_NAME
docker rm $CONTAINER_NAME
echo "Pipeline complete!"
echo "Model: $MODELS_DIR /model.pkl"
echo "Predictions: $RESULTS_DIR /predictions.csv"
Usage:
chmod +x diabetes_pipeline.sh
./diabetes_pipeline.sh
Volume-Based Workflow
Use volumes for persistent storage:
# Create directories
mkdir -p ./data ./models ./output
# Place data files
cp train.csv ./data/
cp test.csv ./data/
# Run with volumes
docker run -it \
--name ai-container \
-v $( pwd ) /data:/app/data \
-v $( pwd ) /models:/app/models \
-v $( pwd ) /output:/app/output \
ai-proyecto-sustituto /bin/bash
# Inside container
python train.py \
--model_file /app/models/model.pkl \
--data_file /app/data/train.csv \
--overwrite_model
python predict.py \
--model_file /app/models/model.pkl \
--input_file /app/data/test.csv \
--predictions_file /app/output/predictions.csv
# Exit container
exit
# Files are on host
ls models/ # model.pkl
ls output/ # predictions.csv
Cron Job Automation
Schedule daily predictions:
# crontab -e
# Run diabetes predictions every day at 2 AM
0 2 * * * /path/to/diabetes_pipeline.sh >> /var/log/diabetes_predictions.log 2>&1
Python Integration
Call CLI tools from Python:
import subprocess
import pandas as pd
# Train model
subprocess.run([
"docker" , "exec" , "ai-container" ,
"python" , "train.py" ,
"--model_file" , "model.pkl" ,
"--data_file" , "train.csv" ,
"--overwrite_model"
], check = True )
# Generate predictions
subprocess.run([
"docker" , "exec" , "ai-container" ,
"python" , "predict.py" ,
"--model_file" , "model.pkl" ,
"--input_file" , "test.csv" ,
"--predictions_file" , "predictions.csv"
], check = True )
# Copy and read results
subprocess.run([
"docker" , "cp" ,
"ai-container:/app/predictions.csv" ,
"./predictions.csv"
], check = True )
predictions = pd.read_csv( "predictions.csv" )
print ( f "Diabetes cases detected: { (predictions[ 'preds' ] == 1 ).sum() } " )
Parallel Processing
For very large datasets, split and process in parallel:
# Split test.csv into chunks
split -l 10000 test.csv test_chunk_
# Process in parallel (requires GNU parallel)
parallel -j 4 "docker exec ai-container python predict.py \
--model_file model.pkl \
--input_file {} \
--predictions_file {}.preds" ::: test_chunk_ *
# Combine results
cat test_chunk_ * .preds > predictions_all.csv
Resource Allocation
Allocate more resources for faster training:
docker run -it \
--name ai-container \
--cpus= "4" \
--memory= "8g" \
ai-proyecto-sustituto /bin/bash
Comparison: CLI vs API
Feature CLI (Phase 2) API (Phase 3) Input CSV files JSON requests Output CSV files JSON responses Best For Batch processing Real-time queries Integration Scripts, cron jobs Web apps, mobile apps Scalability Manual parallelization Built-in concurrency Ease of Use Command-line knowledge HTTP requests Debugging Terminal logs API logs + Swagger UI
Next Steps
API Deployment For real-time predictions, explore the REST API
Docker Setup Advanced Docker configurations and best practices
Phase 2 Guide Complete Phase 2 walkthrough
Model Architecture Understand the underlying model