Architectural Overview
Layer 1: Orchestration Layer
The orchestration layer is the control center of the system, managing the experiment lifecycle from start to finish.ExperimentController
Location:src/orchestration/controller.py:46
The ExperimentController class implements the main experiment loop as a state machine with phases:
- INITIALIZING → DATA_PROFILING: Analyze dataset characteristics
- DATA_PROFILING → BASELINE_MODELING: Run simple baseline model
- BASELINE_MODELING → EXPERIMENT_DESIGN: Enter main iteration loop
- EXPERIMENT_DESIGN → CODE_GENERATION → EXPERIMENT_EXECUTION → RESULTS_ANALYSIS → HYPOTHESIS_GENERATION → back to EXPERIMENT_DESIGN (repeat)
- COMPLETED or FAILED: Finalize with report and visualizations
The state machine design allows for resumable sessions. You can interrupt an experiment with
Ctrl+C and resume from the saved state file using the --resume flag.State Management with Pydantic
Location:src/orchestration/state.py
All experiment state is represented using Pydantic models for type safety and JSON serialization:
- ExperimentState: Complete session state (src/orchestration/state.py:194)
- ExperimentConfig: User configuration and constraints (src/orchestration/state.py:95)
- ExperimentResult: Results from a single experiment (src/orchestration/state.py:70)
- DataProfile: Dataset statistics and schema (src/orchestration/state.py:111)
- AnalysisResult: Results analysis with trend detection (src/orchestration/state.py:152)
- HypothesisSet: Generated hypotheses for next iteration (src/orchestration/state.py:177)
outputs/state_<session_id>.json after each phase.
Termination Criteria
The controller checks multiple termination conditions after each iteration (src/orchestration/state.py:251):| Criterion | Default | Configurable Via |
|---|---|---|
| Max iterations | 20 | --max-iterations |
| Time budget | 3600 seconds | --time-budget |
| Performance plateau | 3 iterations without improvement | Constraints file |
| Target metric achieved | None | Constraints file |
| Agent recommendation | Enabled | Automatic |
Layer 2: Cognitive Core (Gemini 3)
The cognitive layer contains four specialized AI agents that make all high-level decisions using Gemini 3. All four components share a singleGeminiClient instance to maintain Thought Signatures.
Four Cognitive Components
ExperimentDesigner
Location:
src/cognitive/experiment_designer.py:83Designs the next experiment based on:- Data profile (schema, distributions, missing values)
- All previous experiment results
- User constraints from Markdown file
- Top hypothesis from previous iteration
ExperimentSpec with model type, hyperparameters, preprocessing config, and reasoningResultsAnalyzer
Location:
src/cognitive/results_analyzer.py:56Analyzes experiment outcomes by:- Comparing metrics against baseline, best, and previous
- Detecting trends (improving, degrading, plateau, fluctuating)
- Generating key observations
AnalysisResult with metric comparisons, trend pattern, and observationsHypothesisGenerator
Location:
src/cognitive/hypothesis_generator.py:52Generates 2-3 ranked hypotheses by:- Synthesizing analysis results
- Balancing exploration vs exploitation
- Providing confidence scores (0-1)
- Suggesting concrete models and parameters
HypothesisSet with prioritized hypothesesReportGenerator
Location:
src/cognitive/report_generator.py:64Creates publication-ready reports with:- Executive summary
- Methodology description
- Results table and best model details
- Key insights and recommendations
Shared GeminiClient
Location:src/cognitive/gemini_client.py:52
All four components use the same GeminiClient instance initialized in the controller (src/orchestration/controller.py:109):
The
GeminiClient maintains a conversation_history list that grows across all iterations. When you run with --verbose, you’ll see the context size displayed as “Context: X turns”.Layer 3: Execution Layer
The execution layer handles code generation and experiment execution without calling Gemini.DataProfiler
Location:src/execution/data_profiler.py
Analyzes the dataset using pandas to extract:
- Schema (column names, types)
- Statistics (mean, std, min, max, quartiles)
- Missing value counts and percentages
- Target distribution (for regression: mean/std/skew, for classification: class counts)
CodeGenerator
Location:src/execution/code_generator.py
Generates executable Python scripts from ExperimentSpec using Jinja2 templates:
outputs/experiments/<session_id>/ and validated with ast.parse() before execution.
ExperimentRunner
Location:src/execution/experiment_runner.py
Executes generated Python scripts as subprocesses with:
- Timeout protection (default: 600 seconds)
- stdout/stderr capture
- JSON output parsing (metrics, model path, errors)
- Graceful failure handling (failed experiments don’t stop the loop)
VisualizationGenerator
Location:src/execution/visualization_generator.py
Generates matplotlib charts:
- Metric progression: Line chart showing primary metric across iterations
- Model comparison: Horizontal bar chart comparing all models
- Improvement over baseline: Bar chart showing % improvement
outputs/plots/.
Layer 4: Persistence Layer
The persistence layer handles logging and artifact storage.MLflow Tracking
Location:src/persistence/mlflow_tracker.py
All experiments are logged to MLflow with:
- Metrics: RMSE, MAE, R2 (regression) or Accuracy, F1, AUC (classification)
- Parameters: Model hyperparameters, preprocessing config
- Tags: Iteration number, hypothesis, model type
- Artifacts: Generated code, saved models, visualizations
JSON State Files
Experiment state is saved tooutputs/state_<session_id>.json after each phase for:
- Crash recovery: Resume from last saved state
- Debugging: Inspect exact state at any point
- Reproducibility: Full record of all decisions and results
Artifact Storage
All generated outputs are organized underoutputs/:
Data Flow
Here’s how data flows through the system during a single iteration:- ExperimentController calls ExperimentDesigner with data profile + history
- ExperimentDesigner calls Gemini → returns
ExperimentSpec(JSON) - CodeGenerator converts
ExperimentSpec→ Python script (Jinja2) - ExperimentRunner executes script → captures
ExperimentResult(JSON output) - ExperimentController updates state, logs to MLflow
- ResultsAnalyzer calls Gemini with results → returns
AnalysisResult - HypothesisGenerator calls Gemini with analysis → returns
HypothesisSet - Loop repeats with top hypothesis fed back into ExperimentDesigner
Why This Architecture?
This layered design provides:- Separation of concerns: Cognitive decisions (Gemini) vs execution (templates + subprocess)
- Testability: Each component has 10-24 unit tests (160 total)
- Resumability: Pydantic state snapshots enable crash recovery
- Observability: MLflow + rich console output + JSON state files
- Safety: Subprocess isolation, timeout protection, graceful failure handling
The architecture is designed for The Marathon Agent track — Gemini runs autonomously for hours, making 100+ API calls while maintaining reasoning continuity through Thought Signatures.