Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jonatan-leal/ia-proyecto-sustituto/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Phase 1 uses a Jupyter notebook (Diabetes_Prediction.ipynb) for interactive data exploration, visualization, model training, and evaluation. This phase is perfect for understanding the dataset, experimenting with different approaches, and developing the initial model.
Best For: Data scientists and ML engineers who want to explore the data, visualize patterns, and iterate on model development.Platform: Google Colab (free, no local setup required)
Prerequisites
Kaggle Account
Create a free account at kaggle.com
Kaggle API Credentials
- Go to your Kaggle account settings
- Scroll to “API” section
- Click “Create New API Token”
- Download
kaggle.json(you’ll upload this to Colab)
Google Colab Access
Navigate to Google Colab - no installation needed
Getting Started
Open Notebook in Colab
- Go to Google Colab
- Click File → Upload notebook
- Navigate to
~/workspace/source/fase-1/Diabetes_Prediction.ipynb - Upload the file
Upload Kaggle Credentials
When you run the first code cell, you’ll need to upload your The notebook will prompt you to upload
kaggle.json file:kaggle.json.Notebook Structure
The notebook is organized into logical sections:1. Overview
Dataset Introduction
Dataset Introduction
The notebook begins with a description of the dataset:
“The following data is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). This dataset includes 100,000 rows and 9 features:”Features listed:
- gender
- age
- hypertension
- heart_disease
- smoking_history
- body mass index (bmi)
- HbA1c_level
- blood_glucose_level
- diabetes (target variable)
“Healthcare professionals may find this data useful in identifying patients at risk of developing diabetes and in developing personalized treatment plans.”
2. Data Download
3. Import Libraries
The notebook uses:
- pandas for data manipulation
- matplotlib/seaborn for visualization
- sklearn for ML algorithms
- imblearn for handling imbalanced data
4. Load and Explore Data
5. Visualize Class Distribution
6. Encode Categorical Variables
7. Split Features and Target
8. Train-Test Split
Split Ratio: 70% training, 30% testingStratification: The notebook doesn’t explicitly use stratification, but you could add
stratify=y to ensure balanced splits.9. Feature Scaling
StandardScaler normalizes features to have:
- Mean = 0
- Standard deviation = 1
10. Handle Imbalanced Data
What is SMOTEENN?
What is SMOTEENN?
SMOTEENN combines two techniques:
- SMOTE (Synthetic Minority Over-sampling Technique):
- Creates synthetic samples of the minority class (diabetes=1)
- Interpolates between existing minority samples
- ENN (Edited Nearest Neighbors):
- Removes noisy samples from both classes
- Cleans up overlap between classes
11. Train Model
Why RandomForestClassifier?
Why RandomForestClassifier?
RandomForest is an ensemble learning method that:
- Builds multiple decision trees
- Aggregates their predictions (voting)
- Handles non-linear relationships well
- Resistant to overfitting
- Works well out-of-the-box with default parameters
n_estimators=100(number of trees)max_depth=None(nodes expanded until pure)min_samples_split=2random_state=None(random)
12. Make Predictions
13. Evaluate Model
- Precision: What % of positive predictions were correct?
- Recall: What % of actual positives were found?
- F1-Score: Harmonic mean of precision and recall
- Support: Number of samples in each class
Key Concepts Demonstrated
Data Exploration
Loading, inspecting, and visualizing the dataset to understand patterns and distributions
Preprocessing
Encoding categorical variables and scaling numeric features for model training
Class Imbalance
Using SMOTEENN to create a balanced training set from imbalanced medical data
Model Training
Training RandomForestClassifier and evaluating performance with classification metrics
Experimentation Ideas
The notebook is perfect for trying different approaches:- Different Models
- Hyperparameter Tuning
- Feature Importance
- Different Resampling
Try other classifiers:
Sample Output
When you run the notebook, you’ll see:Data Preview
Class Distribution Plot
A bar chart showing the imbalance between diabetes=0 and diabetes=1 classes.Classification Report
Actual numbers will vary based on the random train-test split.
Advantages of Phase 1
Interactive Exploration
Interactive Exploration
- Run code cells individually
- See immediate visual feedback
- Experiment without affecting production code
- Easy to share results with team
No Local Setup
No Local Setup
- Runs entirely in Google Colab
- No need to install Python or dependencies
- Free GPU/TPU access available
- Cloud storage integration
Documentation and Learning
Documentation and Learning
- Markdown cells explain each step
- Code and results in one place
- Perfect for presentations and reports
- Educational for understanding ML workflow
Limitations
Next Steps
Move to CLI
Once satisfied with the model, proceed to Phase 2 for command-line tools
Deploy API
For production use, implement Phase 3 REST API
Phase 2: CLI
Command-line tools for batch predictions
Phase 3: API
REST API for production deployments
Model Architecture
Deep dive into RandomForest and preprocessing
Imbalanced Data
Understanding SMOTEENN technique