Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/najmulhossainnj/Hedge-fund-backend/llms.txt

Use this file to discover all available pages before exploring further.

The Hedge Fund Backend treats ML models as first-class, versioned research artifacts. A model definition records which algorithm to use, with what parameters, and links back to the MLflow run that produced its weights. The Model Training Engine assembles feature datasets into a training matrix, runs time-series cross-validation to measure generalisation, fits a final model on the full window, and persists the serialised artifact to S3/MinIO — all while logging every metric to MLflow so experiments are fully reproducible.

MLModel Entity

The MLModel table stores the identity, configuration, and training outcomes for every model in the system.
{
  "id": "m1000000-0000-0000-0000-000000000001",
  "name": "XGBoost v3 — RSI + Sentiment",
  "model_type": "xgboost",
  "family": "machine_learning",
  "parameters": {
    "max_depth": 6,
    "n_estimators": 300,
    "learning_rate": 0.05,
    "subsample": 0.8
  },
  "mlflow_run_id": "abc123def456",
  "artifact_uri": "s3://mlflow-artifacts/1/abc123def456/artifacts/model",
  "metrics": {
    "mean_mse": 0.0043,
    "std_mse": 0.0007,
    "mean_directional_accuracy": 0.561
  },
  "version": 2,
  "created_at": "2024-01-20T14:00:00Z",
  "updated_at": "2024-01-20T15:30:00Z"
}

Model Fields

FieldTypeDescription
model_typestringAlgorithm identifier, e.g. "xgboost", "lstm", "arima"
familystringBroad family — statistical, machine_learning, deep_learning, ensemble
parametersobjectFull hyperparameter dict passed to the plugin constructor
mlflow_run_idstring | nullMLflow run ID that produced this model’s artifact
artifact_uristring | nullS3/MinIO URI to the serialised model weights
metricsobjectCV metrics from training (MSE, directional accuracy, etc.)

Model Families

statistical

Classical econometric models: ARIMA, GARCH, Kalman Filter. Best for univariate forecasting and volatility estimation.

machine_learning

Gradient-boosted trees and ensemble methods: XGBoost, LightGBM, CatBoost, Random Forest. Strongest out-of-the-box performance on tabular features.

deep_learning

Sequence models: LSTM, GRU, Transformer. Suited for learning long-range dependencies in high-frequency or multi-asset settings.

ensemble

Meta-learners that stack or blend predictions from multiple base models — stacking, voting, blending.

Dataset Assembly

Before training, the engine assembles a wide feature matrix X and a forward-return target y from the Feature Store. The DatasetSpec in the training request controls what gets assembled:
{
  "feature_ids": [
    "f1000000-0000-0000-0000-000000000001",
    "f1000000-0000-0000-0000-000000000002"
  ],
  "symbol": "AAPL",
  "timeframe": "1d",
  "start_date": "2020-01-01T00:00:00Z",
  "end_date": "2024-01-01T00:00:00Z",
  "target_horizon": 1
}
The engine fetches the latest FeatureDataset for each feature_id, joins them on timestamp into a wide X DataFrame, and derives y as the target_horizon-bar forward return. No future information leaks into X — the join is strictly left-aligned.

Time-Series Cross-Validation

Standard k-fold CV shuffles observations, which destroys temporal order and creates severe look-ahead bias in financial data. The engine never shuffles. All CV splits respect time ordering.
The CVConfig controls the splitting strategy:
{
  "method": "rolling",
  "n_splits": 5,
  "test_size": 0.15,
  "min_train_size": 0.2
}
Two window methods are supported:

Rolling Window

The training window slides forward with a fixed size. Each fold trains on the same number of bars, which prevents the model from over-representing early periods.
[===TRAIN===][TEST]
    [===TRAIN===][TEST]
        [===TRAIN===][TEST]

Expanding Window

The training window grows from a fixed anchor at t=0. Each fold uses all available history up to the test fold start — mimicking how a live strategy accumulates data.
[=TRAIN=][TEST]
[==TRAIN==][TEST]
[===TRAIN===][TEST]

Training Flow

The full training pipeline runs in five stages:
The engine fetches FeatureDataset Parquet files from the Feature Store for each feature_id in DatasetSpec, joins them on timestamp, derives the forward-return y, and drops NaN rows. The resulting (X, y) pair is held in memory for CV and final fit.
The engine runs time-series CV over the assembled (X, y) using the configured splits. On each fold it instantiates the plugin, calls plugin.train(X_train, y_train), then plugin.predict(X_test), and computes MSE and directional accuracy. Mean and std across folds are stored in cv_metrics.
After CV evaluation, the engine fits the model on the full date range (all of X, all of y) to produce the final weights that will be used for live inference.
The fitted model is serialised (joblib for sklearn-compatible models, torch.save for LSTM/GRU) and uploaded to the configured S3/MinIO bucket. The resulting artifact_uri is written back to the MLModel row.
The engine starts an MLflow run, logs all parameters, cv_metrics, n_train_rows, and feature_columns as run metadata, and records the S3 URI as an artifact pointer. The MLflow run_id is stored in mlflow_run_id for traceability.
The training request and response shapes:
// POST /api/models/{id}/train — request
{
  "dataset": {
    "feature_ids": ["f1000000-0000-0000-0000-000000000001"],
    "symbol": "AAPL",
    "timeframe": "1d",
    "start_date": "2020-01-01T00:00:00Z",
    "end_date": "2024-01-01T00:00:00Z",
    "target_horizon": 1
  },
  "cv": {
    "method": "rolling",
    "n_splits": 5,
    "test_size": 0.15,
    "min_train_size": 0.2
  }
}

// Response
{
  "model_id": "m1000000-0000-0000-0000-000000000001",
  "artifact_uri": "s3://mlflow-artifacts/1/abc123/artifacts/model",
  "cv_metrics": {
    "mean_mse": 0.0043,
    "std_mse": 0.0007,
    "mean_directional_accuracy": 0.561
  },
  "n_train_rows": 1004,
  "feature_columns": ["rsi_14", "atr_14", "sentiment_score"]
}

Hyperparameter Tuning

The platform integrates Optuna for automated hyperparameter search. You declare a param_space using typed spec objects, and the engine runs n_trials of Bayesian optimisation to minimise (or maximise) your chosen CV metric.
// POST /api/models/{id}/tune
{
  "dataset": { "...": "same as train" },
  "plugin_key": "ml.xgboost",
  "param_space": {
    "max_depth":     { "type": "int",   "low": 3, "high": 10 },
    "learning_rate": { "type": "float", "low": 0.01, "high": 0.3, "log": true },
    "subsample":     { "type": "float", "low": 0.5, "high": 1.0 },
    "n_estimators":  { "type": "categorical", "choices": ["100", "200", "300", "500"] }
  },
  "n_trials": 50,
  "metric": "mean_mse",
  "direction": "minimize",
  "cv": { "method": "rolling", "n_splits": 5 }
}
Param spec types map directly to Optuna suggest functions:
typeOptuna callRequired fields
floatsuggest_floatlow, high — optionally log: true for log-scale
intsuggest_intlow, high
categoricalsuggest_categoricalchoices array
The tuning response returns the best parameters and their score:
{
  "best_params": {
    "max_depth": 7,
    "learning_rate": 0.038,
    "subsample": 0.82,
    "n_estimators": "300"
  },
  "best_score": 0.00391,
  "n_trials": 50
}

AutoML

When you want to compare multiple algorithm families without hand-selecting one, AutoML evaluates every candidate plugin in parallel and returns a ranked leaderboard:
// POST /api/models/automl
{
  "dataset": { "...": "same as train" },
  "candidates": {
    "ml.xgboost":       { "max_depth": 6, "n_estimators": 200 },
    "ml.lightgbm":      { "num_leaves": 63, "n_estimators": 200 },
    "ml.catboost":      { "depth": 6, "iterations": 200 },
    "ml.random_forest": { "n_estimators": 200, "max_depth": 8 },
    "ml.lstm":          { "hidden_size": 64, "num_layers": 2, "epochs": 30 }
  },
  "cv": { "method": "rolling", "n_splits": 5 },
  "metric": "mean_mse"
}
The response leaderboard is sorted ascending by score (lower MSE = better):
{
  "leaderboard": [
    {
      "plugin_key": "ml.lightgbm",
      "params": { "num_leaves": 63, "n_estimators": 200 },
      "score": 0.00381,
      "metrics": { "mean_mse": 0.00381, "std_mse": 0.00045, "mean_directional_accuracy": 0.568 }
    },
    {
      "plugin_key": "ml.xgboost",
      "params": { "max_depth": 6, "n_estimators": 200 },
      "score": 0.00412,
      "metrics": { "mean_mse": 0.00412, "std_mse": 0.00062, "mean_directional_accuracy": 0.554 }
    }
  ]
}

Available Model Plugins

Plugin KeyAlgorithmFamilyNotes
ml.xgboostXGBoostmachine_learningGradient-boosted trees; supports GPU training
ml.lightgbmLightGBMmachine_learningFaster training than XGBoost on large datasets
ml.catboostCatBoostmachine_learningNative categorical handling, robust to overfitting
ml.random_forestRandom Forestmachine_learningLow variance, useful baseline
ml.lstmLSTMdeep_learningLong short-term memory for sequential patterns
All model plugins implement the BaseModel interface (app/plugins/base.py). To add a custom algorithm, subclass BaseModel, implement train() and predict(), assign a unique key, and register it in the model plugin registry.

API Reference

For the complete endpoint reference — including artifact download, model comparison, and MLflow run linking — see the Models API.

Build docs developers (and LLMs) love