ResultsAnalyzer

Overview

The ResultsAnalyzer analyzes ML experiment outcomes using Gemini 3, comparing metrics against baselines and detecting performance patterns. It computes metric comparisons locally and uses Gemini to generate actionable insights.

Key Features

Compares metrics against baseline, best, and previous experiments
Detects performance trends (improving, degrading, plateau, fluctuating)
Handles lower-is-better vs higher-is-better metrics correctly
Generates key observations for informing next iterations
Maintains conversation context for Thought Signature continuity
Graceful fallback when Gemini is unavailable

Class Definition

class ResultsAnalyzer:
    """Analyzes ML experiment results using Gemini 3.

    Key features:
    - Compares metrics against baseline, best, and previous
    - Detects performance trends (improving, degrading, plateau)
    - Handles lower-is-better vs higher-is-better metrics correctly
    - Generates key observations for next iteration
    - Maintains conversation context for Thought Signature continuity
    """

    def __init__(self, gemini_client: GeminiClient):
        """Initialize the results analyzer.

        Args:
            gemini_client: Shared GeminiClient instance for API calls.
        """

Constructor

gemini_client

GeminiClient

required

Shared GeminiClient instance for API calls. Sharing the same client across cognitive components preserves conversation history.

Methods

analyze

Analyze the results of a completed experiment.

def analyze(
    self,
    current_result: ExperimentResult,
    state: ExperimentState,
) -> AnalysisResult:
    """Analyze the results of a completed experiment.

    Args:
        current_result: The experiment result to analyze.
        state: Current experiment state with history.

    Returns:
        AnalysisResult with comparisons and observations.
    """

Parameters

current_result

ExperimentResult

required

The experiment result to analyze, containing:

experiment_name (str): Name of the experiment
iteration (int): Iteration number
model_type (str): Model class used
metrics (dict): Dictionary of metric names to values
hypothesis (str): Hypothesis being tested
success (bool): Whether experiment completed successfully
error_message (Optional[str]): Error details if failed

state

ExperimentState

required

Current experiment state containing:

experiments (list[ExperimentResult]): Historical results
config (Config): Configuration including primary metric
best_metric (Optional[float]): Best metric value so far
best_experiment (Optional[str]): Name of best experiment

Returns

AnalysisResult

Analysis result containing:

experiment_name (str): Name of analyzed experiment
iteration (int): Iteration number
success (bool): Whether experiment succeeded
primary_metric (Optional[MetricComparison]): Detailed metric comparison
trend_pattern (TrendPattern): Detected performance trend
key_observations (list[str]): Actionable insights (3-5 bullet points)
reasoning (str): Detailed explanation of the analysis

get_analysis_count

Get the number of analyses performed.

def get_analysis_count(self) -> int:
    """Get the number of analyses performed."""

Returns

count

int

Total number of experiment results analyzed by this instance.

Data Structures

MetricComparison

Detailed comparison of the current metric value against historical benchmarks.

@dataclass
class MetricComparison:
    metric_name: str
    current_value: float
    baseline_value: Optional[float]
    best_value: Optional[float]
    previous_value: Optional[float]
    change_from_baseline_pct: Optional[float]
    change_from_best_pct: Optional[float]
    change_from_previous_pct: Optional[float]
    is_improvement: bool
    is_new_best: bool

Fields:

metric_name

str

Name of the primary metric being compared (e.g., “rmse”, “f1”).

current_value

float

Current experiment’s metric value.

baseline_value

Optional[float]

Baseline metric value (typically from iteration 0).

best_value

Optional[float]

Best metric value achieved so far across all experiments.

previous_value

Optional[float]

Metric value from the previous iteration.

change_from_baseline_pct

Optional[float]

Percentage change from baseline. Positive indicates improvement (accounting for metric direction).

change_from_best_pct

Optional[float]

Percentage change from best. Positive indicates improvement over current best.

change_from_previous_pct

Optional[float]

Percentage change from previous iteration. Positive indicates improvement.

is_improvement

bool

Whether current result represents an improvement over best (considering improvement threshold).

is_new_best

bool

Whether current result is the new best result.

TrendPattern

Enum representing detected performance trends across recent iterations.

class TrendPattern(Enum):
    INITIAL = "initial"          # Fewer than 3 experiments
    IMPROVING = "improving"      # Consistent improvement over 2+ iterations
    DEGRADING = "degrading"      # Consistent degradation over 2+ iterations
    PLATEAU = "plateau"          # Less than 0.5% change for 2+ iterations
    FLUCTUATING = "fluctuating"  # Alternating improvement/degradation

Metric Direction Handling

The analyzer correctly handles metrics where different directions indicate improvement:

Lower is Better

RMSE, MSE, MAE, log_loss, error

Higher is Better

accuracy, f1, r2, precision, recall, AUC

Percentage changes are always calculated so that positive values indicate improvement, regardless of metric direction.

System Prompt

The analyzer uses a comprehensive system prompt that guides Gemini to:

Compare results against baseline, best, and previous experiments
Identify meaningful improvements versus noise
Detect patterns across iterations
Consider both primary and secondary metrics
Note anomalies or unexpected results
Provide observations that inform next experiment design

Usage Examples

Basic Analysis

from src.cognitive.gemini_client import GeminiClient
from src.cognitive.results_analyzer import ResultsAnalyzer

# Initialize
client = GeminiClient()
analyzer = ResultsAnalyzer(gemini_client=client)

# Analyze latest result
analysis = analyzer.analyze(
    current_result=latest_result,
    state=experiment_state
)

print(f"Success: {analysis.success}")
print(f"Trend: {analysis.trend_pattern.value}")
print(f"\nKey Observations:")
for obs in analysis.key_observations:
    print(f"  - {obs}")

Detailed Metric Comparison

analysis = analyzer.analyze(current_result, state)

if analysis.primary_metric:
    mc = analysis.primary_metric
    print(f"Metric: {mc.metric_name}")
    print(f"Current: {mc.current_value:.6f}")
    print(f"Baseline: {mc.baseline_value:.6f}")
    print(f"Best: {mc.best_value:.6f}")
    print(f"")
    print(f"Change from baseline: {mc.change_from_baseline_pct:+.2f}%")
    print(f"Change from best: {mc.change_from_best_pct:+.2f}%")
    print(f"Change from previous: {mc.change_from_previous_pct:+.2f}%")
    print(f"")
    print(f"Is improvement: {mc.is_improvement}")
    print(f"Is new best: {mc.is_new_best}")

Trend Detection

analysis = analyzer.analyze(current_result, state)

trend = analysis.trend_pattern

if trend == TrendPattern.IMPROVING:
    print("Performance is consistently improving!")
elif trend == TrendPattern.PLATEAU:
    print("Performance has plateaued - consider new approaches")
elif trend == TrendPattern.DEGRADING:
    print("Performance is degrading - revert to previous approach")
elif trend == TrendPattern.FLUCTUATING:
    print("Performance is unstable - need more consistent approach")
else:
    print("Too early to detect a trend")

Handling Failed Experiments

# Failed experiment
failed_result = ExperimentResult(
    experiment_name="test_xgboost",
    iteration=3,
    model_type="XGBRegressor",
    metrics={},
    success=False,
    error_message="Memory error during training"
)

analysis = analyzer.analyze(failed_result, state)

if not analysis.success:
    print(f"Experiment failed: {analysis.key_observations[0]}")
    print(f"Reasoning: {analysis.reasoning}")

Integration in Experiment Loop

from src.orchestration.state import ExperimentState

# Initialize components
client = GeminiClient()
analyzer = ResultsAnalyzer(gemini_client=client)
state = ExperimentState(config=config)

# Execute and analyze experiments
for iteration in range(1, 11):
    # Execute experiment (returns ExperimentResult)
    result = execute_experiment(spec)
    
    # Analyze results
    analysis = analyzer.analyze(
        current_result=result,
        state=state
    )
    
    # Log insights
    print(f"\n=== Iteration {iteration} Analysis ===")
    print(f"Trend: {analysis.trend_pattern.value}")
    
    if analysis.primary_metric and analysis.primary_metric.is_new_best:
        print("NEW BEST RESULT!")
    
    print("\nObservations:")
    for obs in analysis.key_observations:
        print(f"  - {obs}")
    
    # Update state
    state.add_experiment(result)
    
    # Use analysis for early stopping
    if analysis.trend_pattern == TrendPattern.PLATEAU:
        if state.iterations_without_improvement >= 3:
            print("Stopping: Plateau detected for 3 iterations")
            break

print(f"\nTotal analyses: {analyzer.get_analysis_count()}")

Custom Metric Comparison

# Access the internal comparison method for custom metrics
from src.cognitive.results_analyzer import ResultsAnalyzer

analyzer = ResultsAnalyzer(client)

# Compute comparison for a specific metric
comparison = analyzer._compute_metric_comparison(
    current_result=result,
    state=state,
    primary_metric="mae"  # Custom metric
)

if comparison:
    print(f"MAE comparison: {comparison.change_from_baseline_pct:+.2f}%")

Fallback Behavior

# When Gemini fails, analyzer provides basic template-based analysis
try:
    analysis = analyzer.analyze(current_result, state)
    
    # Check if fallback was used
    if "fallback" in analysis.reasoning.lower():
        print("Using fallback analysis (Gemini unavailable)")
        # Fallback still includes metric comparisons and trend detection
        print(f"Trend: {analysis.trend_pattern.value}")
        if analysis.primary_metric:
            print(f"Change: {analysis.primary_metric.change_from_baseline_pct:+.2f}%")
except Exception as e:
    print(f"Analysis failed: {e}")

Trend Detection Algorithm

The analyzer uses the last 3 successful experiments to detect trends:

Extract recent metrics: Get primary metric values from last 3 successful experiments
Calculate changes: Compute percentage changes between consecutive iterations
Apply threshold: Use config.improvement_threshold to determine significance
Classify pattern:
- IMPROVING: All changes show improvement above threshold
- DEGRADING: All changes show degradation beyond threshold
- PLATEAU: All changes within threshold (near zero)
- FLUCTUATING: Mixed improvements and degradations
- INITIAL: Fewer than 3 experiments available

Cognitive Components

Execution Layer

Orchestration

Persistence

Overview

Key Features

Class Definition

Constructor

Methods

analyze

Parameters

Returns

get_analysis_count

Returns

Data Structures

MetricComparison

TrendPattern

Metric Direction Handling

Lower is Better

Higher is Better

System Prompt

Usage Examples

Basic Analysis

Detailed Metric Comparison

Trend Detection

Handling Failed Experiments

Integration in Experiment Loop

Custom Metric Comparison

Fallback Behavior

Trend Detection Algorithm

See Also

Build docs developers (and LLMs) love

Cognitive Components

Execution Layer

Orchestration

Persistence

​Overview

​Key Features

​Class Definition

​Constructor

​Methods

​analyze

​Parameters

​Returns

​get_analysis_count

​Returns

​Data Structures

​MetricComparison

​TrendPattern

​Metric Direction Handling

​Lower is Better

​Higher is Better

​System Prompt

​Usage Examples

​Basic Analysis

​Detailed Metric Comparison

​Trend Detection

​Handling Failed Experiments

​Integration in Experiment Loop

​Custom Metric Comparison

​Fallback Behavior

​Trend Detection Algorithm

​See Also

Build docs developers (and LLMs) love

Overview

Key Features

Class Definition

Constructor

Methods

analyze

Parameters

Returns

get_analysis_count

Returns

Data Structures

MetricComparison

TrendPattern

Metric Direction Handling

Lower is Better

Higher is Better

System Prompt

Usage Examples

Basic Analysis

Detailed Metric Comparison

Trend Detection

Handling Failed Experiments

Integration in Experiment Loop

Custom Metric Comparison

Fallback Behavior

Trend Detection Algorithm

See Also