Overview
The SmartEat AI ML pipeline processes recipe data from Food.com to create a personalized recommendation system. The pipeline transforms raw recipe data into a trained K-Nearest Neighbors (KNN) model that powers nutritional recommendations.Data Source
The pipeline uses the Food.com - Recipes and Reviews dataset from Kaggle, containing over 522,000 recipes with comprehensive nutritional information.Dataset Size
522,517 recipes with 28 columns
Features
Nutritional data, ingredients, categories, and ratings
Pipeline Stages
Data Acquisition
Download the dataset from Google Drive and load it into a pandas DataFrame using the Python engine for error tolerance.The dataset includes:
- Recipe metadata (name, author, category)
- Time information (prep time, cook time)
- Nutritional content (calories, macronutrients)
- Ingredients and instructions
- User ratings and reviews
Data Exploration
Analyze the dataset structure, distributions, and data quality:Key Findings:
- Complete nutritional data for all recipes
- Missing values in optional fields (ratings, servings)
- 32,600 duplicate recipe names requiring deduplication
- Average calories: 484.4 per serving
- Average protein: 17.5g
- Average carbohydrates: 49.1g
- Average fat: 24.6g
Data Cleaning
Clean and prepare the dataset for model training:
- Remove duplicate recipes (same name)
- Handle missing values in non-critical fields
- Parse and normalize ingredient lists
- Convert categorical data to appropriate formats
- Filter outliers in nutritional values
Feature Engineering
Extract and engineer features for the recommendation model:Primary Features:Additional Processing:
- Text vectorization of ingredients using TF-IDF
- Category encoding for meal types and diet types
- Normalization of nutritional values
- Creation of derived features (macronutrient ratios)
Model Training
Train the K-Nearest Neighbors model:The model uses Euclidean distance to find nutritionally similar recipes.
Notebook Reference
The complete pipeline is documented in:ML Training Notebook
notebooks/Cuaderno_SmartEatAI.ipynb contains the full implementation with visualizations and analysisKey Technologies
Data Quality Metrics
| Metric | Value |
|---|---|
| Total Recipes | 522,517 |
| Complete Nutritional Data | 100% |
| Duplicate Names | 32,600 |
| Average Rating | 4.63/5.0 |
| Recipes with Reviews | 51.5% |
The pipeline emphasizes data quality over quantity, ensuring that only recipes with complete, accurate nutritional information are used for recommendations.
Visualizations
The notebook includes comprehensive visualizations:- Nutritional distribution histograms
- Missing data heatmaps
- Word clouds of ingredients
- Correlation matrices
- Recipe category distributions
Next Steps
After the pipeline completes:- The trained model is loaded into the backend (
backend/app/core/ml_model.py) - The KNN model powers recipe similarity searches
- The scaler ensures consistent feature normalization
- The recipe dataframe enables fast lookups
Related Documentation
See KNN Recommender for details on how the model is used in production.
