Overview
TheResultsAnalyzer analyzes ML experiment outcomes using Gemini 3, comparing metrics against baselines and detecting performance patterns. It computes metric comparisons locally and uses Gemini to generate actionable insights.
Key Features
- Compares metrics against baseline, best, and previous experiments
- Detects performance trends (improving, degrading, plateau, fluctuating)
- Handles lower-is-better vs higher-is-better metrics correctly
- Generates key observations for informing next iterations
- Maintains conversation context for Thought Signature continuity
- Graceful fallback when Gemini is unavailable
Class Definition
Constructor
Shared
GeminiClient instance for API calls. Sharing the same client across cognitive components preserves conversation history.Methods
analyze
Analyze the results of a completed experiment.Parameters
The experiment result to analyze, containing:
experiment_name(str): Name of the experimentiteration(int): Iteration numbermodel_type(str): Model class usedmetrics(dict): Dictionary of metric names to valueshypothesis(str): Hypothesis being testedsuccess(bool): Whether experiment completed successfullyerror_message(Optional[str]): Error details if failed
Current experiment state containing:
experiments(list[ExperimentResult]): Historical resultsconfig(Config): Configuration including primary metricbest_metric(Optional[float]): Best metric value so farbest_experiment(Optional[str]): Name of best experiment
Returns
Analysis result containing:
experiment_name(str): Name of analyzed experimentiteration(int): Iteration numbersuccess(bool): Whether experiment succeededprimary_metric(Optional[MetricComparison]): Detailed metric comparisontrend_pattern(TrendPattern): Detected performance trendkey_observations(list[str]): Actionable insights (3-5 bullet points)reasoning(str): Detailed explanation of the analysis
get_analysis_count
Get the number of analyses performed.Returns
Total number of experiment results analyzed by this instance.
Data Structures
MetricComparison
Detailed comparison of the current metric value against historical benchmarks.Name of the primary metric being compared (e.g., “rmse”, “f1”).
Current experiment’s metric value.
Baseline metric value (typically from iteration 0).
Best metric value achieved so far across all experiments.
Metric value from the previous iteration.
Percentage change from baseline. Positive indicates improvement (accounting for metric direction).
Percentage change from best. Positive indicates improvement over current best.
Percentage change from previous iteration. Positive indicates improvement.
Whether current result represents an improvement over best (considering improvement threshold).
Whether current result is the new best result.
TrendPattern
Enum representing detected performance trends across recent iterations.Metric Direction Handling
The analyzer correctly handles metrics where different directions indicate improvement:Lower is Better
- RMSE, MSE, MAE, log_loss, error
Higher is Better
- accuracy, f1, r2, precision, recall, AUC
System Prompt
The analyzer uses a comprehensive system prompt that guides Gemini to:- Compare results against baseline, best, and previous experiments
- Identify meaningful improvements versus noise
- Detect patterns across iterations
- Consider both primary and secondary metrics
- Note anomalies or unexpected results
- Provide observations that inform next experiment design
Usage Examples
Basic Analysis
Detailed Metric Comparison
Trend Detection
Handling Failed Experiments
Integration in Experiment Loop
Custom Metric Comparison
Fallback Behavior
Trend Detection Algorithm
The analyzer uses the last 3 successful experiments to detect trends:- Extract recent metrics: Get primary metric values from last 3 successful experiments
- Calculate changes: Compute percentage changes between consecutive iterations
- Apply threshold: Use
config.improvement_thresholdto determine significance - Classify pattern:
- IMPROVING: All changes show improvement above threshold
- DEGRADING: All changes show degradation beyond threshold
- PLATEAU: All changes within threshold (near zero)
- FLUCTUATING: Mixed improvements and degradations
- INITIAL: Fewer than 3 experiments available
See Also
- GeminiClient - Underlying API client
- ExperimentDesigner - Designs experiments based on analysis
- HypothesisGenerator - Generates hypotheses from analysis
- ReportGenerator - Creates reports from analyses