ML Model Training, Hyperparameter Tuning, and AutoML

The Hedge Fund Backend treats ML models as first-class, versioned research artifacts. A model definition records which algorithm to use, with what parameters, and links back to the MLflow run that produced its weights. The Model Training Engine assembles feature datasets into a training matrix, runs time-series cross-validation to measure generalisation, fits a final model on the full window, and persists the serialised artifact to S3/MinIO — all while logging every metric to MLflow so experiments are fully reproducible.

MLModel Entity

The MLModel table stores the identity, configuration, and training outcomes for every model in the system.

{
  "id": "m1000000-0000-0000-0000-000000000001",
  "name": "XGBoost v3 — RSI + Sentiment",
  "model_type": "xgboost",
  "family": "machine_learning",
  "parameters": {
    "max_depth": 6,
    "n_estimators": 300,
    "learning_rate": 0.05,
    "subsample": 0.8
  },
  "mlflow_run_id": "abc123def456",
  "artifact_uri": "s3://mlflow-artifacts/1/abc123def456/artifacts/model",
  "metrics": {
    "mean_mse": 0.0043,
    "std_mse": 0.0007,
    "mean_directional_accuracy": 0.561
  },
  "version": 2,
  "created_at": "2024-01-20T14:00:00Z",
  "updated_at": "2024-01-20T15:30:00Z"
}

Model Fields

Field	Type	Description
`model_type`	`string`	Algorithm identifier, e.g. `"xgboost"`, `"lstm"`, `"arima"`
`family`	`string`	Broad family — `statistical`, `machine_learning`, `deep_learning`, `ensemble`
`parameters`	`object`	Full hyperparameter dict passed to the plugin constructor
`mlflow_run_id`	`string \| null`	MLflow run ID that produced this model’s artifact
`artifact_uri`	`string \| null`	S3/MinIO URI to the serialised model weights
`metrics`	`object`	CV metrics from training (MSE, directional accuracy, etc.)

Model Families

statistical

Classical econometric models: ARIMA, GARCH, Kalman Filter. Best for univariate forecasting and volatility estimation.

machine_learning

Gradient-boosted trees and ensemble methods: XGBoost, LightGBM, CatBoost, Random Forest. Strongest out-of-the-box performance on tabular features.

deep_learning

Sequence models: LSTM, GRU, Transformer. Suited for learning long-range dependencies in high-frequency or multi-asset settings.

ensemble

Meta-learners that stack or blend predictions from multiple base models — stacking, voting, blending.

Dataset Assembly

Before training, the engine assembles a wide feature matrix X and a forward-return target y from the Feature Store. The DatasetSpec in the training request controls what gets assembled:

{
  "feature_ids": [
    "f1000000-0000-0000-0000-000000000001",
    "f1000000-0000-0000-0000-000000000002"
  ],
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2020-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z",
  "target_horizon": 1
}

The engine fetches the latest FeatureDataset for each feature_id, joins them on timestamp into a wide X DataFrame, and derives y as the target_horizon-bar forward return. No future information leaks into X — the join is strictly left-aligned.

Time-Series Cross-Validation

Standard k-fold CV shuffles observations, which destroys temporal order and creates severe look-ahead bias in financial data. The engine never shuffles. All CV splits respect time ordering.

The CVConfig controls the splitting strategy:

{
  "method": "rolling",
  "n_splits": 5,
  "test_size": 0.15,
  "min_train_size": 0.2
}

Two window methods are supported:

Rolling Window

The training window slides forward with a fixed size. Each fold trains on the same number of bars, which prevents the model from over-representing early periods.

[===TRAIN===][TEST]
    [===TRAIN===][TEST]
        [===TRAIN===][TEST]

Expanding Window

The training window grows from a fixed anchor at t=0. Each fold uses all available history up to the test fold start — mimicking how a live strategy accumulates data.

[=TRAIN=][TEST]
[==TRAIN==][TEST]
[===TRAIN===][TEST]

Training Flow

The full training pipeline runs in five stages:

1. Assemble Dataset

The engine fetches FeatureDataset Parquet files from the Feature Store for each feature_id in DatasetSpec, joins them on timestamp, derives the forward-return y, and drops NaN rows. The resulting (X, y) pair is held in memory for CV and final fit.

2. Cross-Validation Evaluation

The engine runs time-series CV over the assembled (X, y) using the configured splits. On each fold it instantiates the plugin, calls plugin.train(X_train, y_train), then plugin.predict(X_test), and computes MSE and directional accuracy. Mean and std across folds are stored in cv_metrics.

3. Final Fit

After CV evaluation, the engine fits the model on the full date range (all of X, all of y) to produce the final weights that will be used for live inference.

4. Artifact Persistence

The fitted model is serialised (joblib for sklearn-compatible models, torch.save for LSTM/GRU) and uploaded to the configured S3/MinIO bucket. The resulting artifact_uri is written back to the MLModel row.

5. MLflow Logging

The engine starts an MLflow run, logs all parameters, cv_metrics, n_train_rows, and feature_columns as run metadata, and records the S3 URI as an artifact pointer. The MLflow run_id is stored in mlflow_run_id for traceability.

The training request and response shapes:

// POST /api/models/{id}/train — request
{
  "dataset": {
    "feature_ids": ["f1000000-0000-0000-0000-000000000001"],
    "symbol": "AAPL",
    "timeframe": "1d",
    "start_date": "2020-01-01T00:00:00Z",
    "end_date": "2024-01-01T00:00:00Z",
    "target_horizon": 1
  },
  "cv": {
    "method": "rolling",
    "n_splits": 5,
    "test_size": 0.15,
    "min_train_size": 0.2
  }
}

// Response
{
  "model_id": "m1000000-0000-0000-0000-000000000001",
  "artifact_uri": "s3://mlflow-artifacts/1/abc123/artifacts/model",
  "cv_metrics": {
    "mean_mse": 0.0043,
    "std_mse": 0.0007,
    "mean_directional_accuracy": 0.561
  },
  "n_train_rows": 1004,
  "feature_columns": ["rsi_14", "atr_14", "sentiment_score"]
}

Hyperparameter Tuning

The platform integrates Optuna for automated hyperparameter search. You declare a param_space using typed spec objects, and the engine runs n_trials of Bayesian optimisation to minimise (or maximise) your chosen CV metric.

// POST /api/models/{id}/tune
{
  "dataset": { "...": "same as train" },
  "plugin_key": "ml.xgboost",
  "param_space": {
    "max_depth":     { "type": "int",   "low": 3, "high": 10 },
    "learning_rate": { "type": "float", "low": 0.01, "high": 0.3, "log": true },
    "subsample":     { "type": "float", "low": 0.5, "high": 1.0 },
    "n_estimators":  { "type": "categorical", "choices": ["100", "200", "300", "500"] }
  },
  "n_trials": 50,
  "metric": "mean_mse",
  "direction": "minimize",
  "cv": { "method": "rolling", "n_splits": 5 }
}

Param spec types map directly to Optuna suggest functions:

`type`	Optuna call	Required fields
`float`	`suggest_float`	`low`, `high` — optionally `log: true` for log-scale
`int`	`suggest_int`	`low`, `high`
`categorical`	`suggest_categorical`	`choices` array

The tuning response returns the best parameters and their score:

{
  "best_params": {
    "max_depth": 7,
    "learning_rate": 0.038,
    "subsample": 0.82,
    "n_estimators": "300"
  },
  "best_score": 0.00391,
  "n_trials": 50
}

AutoML

When you want to compare multiple algorithm families without hand-selecting one, AutoML evaluates every candidate plugin in parallel and returns a ranked leaderboard:

// POST /api/models/automl
{
  "dataset": { "...": "same as train" },
  "candidates": {
    "ml.xgboost":       { "max_depth": 6, "n_estimators": 200 },
    "ml.lightgbm":      { "num_leaves": 63, "n_estimators": 200 },
    "ml.catboost":      { "depth": 6, "iterations": 200 },
    "ml.random_forest": { "n_estimators": 200, "max_depth": 8 },
    "ml.lstm":          { "hidden_size": 64, "num_layers": 2, "epochs": 30 }
  },
  "cv": { "method": "rolling", "n_splits": 5 },
  "metric": "mean_mse"
}

The response leaderboard is sorted ascending by score (lower MSE = better):

{
  "leaderboard": [
    {
      "plugin_key": "ml.lightgbm",
      "params": { "num_leaves": 63, "n_estimators": 200 },
      "score": 0.00381,
      "metrics": { "mean_mse": 0.00381, "std_mse": 0.00045, "mean_directional_accuracy": 0.568 }
    },
    {
      "plugin_key": "ml.xgboost",
      "params": { "max_depth": 6, "n_estimators": 200 },
      "score": 0.00412,
      "metrics": { "mean_mse": 0.00412, "std_mse": 0.00062, "mean_directional_accuracy": 0.554 }
    }
  ]
}

Available Model Plugins

Plugin Key	Algorithm	Family	Notes
`ml.xgboost`	XGBoost	machine_learning	Gradient-boosted trees; supports GPU training
`ml.lightgbm`	LightGBM	machine_learning	Faster training than XGBoost on large datasets
`ml.catboost`	CatBoost	machine_learning	Native categorical handling, robust to overfitting
`ml.random_forest`	Random Forest	machine_learning	Low variance, useful baseline
`ml.lstm`	LSTM	deep_learning	Long short-term memory for sequential patterns

All model plugins implement the BaseModel interface (app/plugins/base.py). To add a custom algorithm, subclass BaseModel, implement train() and predict(), assign a unique key, and register it in the model plugin registry.

API Reference

For the complete endpoint reference — including artifact download, model comparison, and MLflow run linking — see the Models API.

Get Started

Core Concepts

Guides

ML Model Training, Hyperparameter Tuning, and AutoML

MLModel Entity

Model Fields

Model Families

statistical

machine_learning

deep_learning

ensemble

Dataset Assembly

Time-Series Cross-Validation

Rolling Window

Expanding Window

Training Flow

Hyperparameter Tuning

AutoML

Available Model Plugins

API Reference

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Documentation Index

​MLModel Entity

​Model Fields

​Model Families

statistical

machine_learning

deep_learning

ensemble

​Dataset Assembly

​Time-Series Cross-Validation

Rolling Window

Expanding Window

​Training Flow

​Hyperparameter Tuning

​AutoML

​Available Model Plugins

​API Reference

Build docs developers (and LLMs) love

MLModel Entity

Model Fields

Model Families

Dataset Assembly

Time-Series Cross-Validation

Training Flow

Hyperparameter Tuning

AutoML

Available Model Plugins

API Reference