The Hedge Fund Backend treats ML models as first-class, versioned research artifacts. A model definition records which algorithm to use, with what parameters, and links back to the MLflow run that produced its weights. The Model Training Engine assembles feature datasets into a training matrix, runs time-series cross-validation to measure generalisation, fits a final model on the full window, and persists the serialised artifact to S3/MinIO — all while logging every metric to MLflow so experiments are fully reproducible.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/najmulhossainnj/Hedge-fund-backend/llms.txt
Use this file to discover all available pages before exploring further.
MLModel Entity
TheMLModel table stores the identity, configuration, and training outcomes for every model in the system.
Model Fields
| Field | Type | Description |
|---|---|---|
model_type | string | Algorithm identifier, e.g. "xgboost", "lstm", "arima" |
family | string | Broad family — statistical, machine_learning, deep_learning, ensemble |
parameters | object | Full hyperparameter dict passed to the plugin constructor |
mlflow_run_id | string | null | MLflow run ID that produced this model’s artifact |
artifact_uri | string | null | S3/MinIO URI to the serialised model weights |
metrics | object | CV metrics from training (MSE, directional accuracy, etc.) |
Model Families
statistical
Classical econometric models: ARIMA, GARCH, Kalman Filter. Best for univariate forecasting and volatility estimation.
machine_learning
Gradient-boosted trees and ensemble methods: XGBoost, LightGBM, CatBoost, Random Forest. Strongest out-of-the-box performance on tabular features.
deep_learning
Sequence models: LSTM, GRU, Transformer. Suited for learning long-range dependencies in high-frequency or multi-asset settings.
ensemble
Meta-learners that stack or blend predictions from multiple base models — stacking, voting, blending.
Dataset Assembly
Before training, the engine assembles a wide feature matrixX and a forward-return target y from the Feature Store.
The DatasetSpec in the training request controls what gets assembled:
FeatureDataset for each feature_id, joins them on timestamp into a wide X DataFrame, and derives y as the target_horizon-bar forward return. No future information leaks into X — the join is strictly left-aligned.
Time-Series Cross-Validation
TheCVConfig controls the splitting strategy:
Rolling Window
The training window slides forward with a fixed size. Each fold trains on the same number of bars, which prevents the model from over-representing early periods.
Expanding Window
The training window grows from a fixed anchor at
t=0. Each fold uses all available history up to the test fold start — mimicking how a live strategy accumulates data.Training Flow
The full training pipeline runs in five stages:1. Assemble Dataset
1. Assemble Dataset
The engine fetches
FeatureDataset Parquet files from the Feature Store for each feature_id in DatasetSpec, joins them on timestamp, derives the forward-return y, and drops NaN rows. The resulting (X, y) pair is held in memory for CV and final fit.2. Cross-Validation Evaluation
2. Cross-Validation Evaluation
The engine runs time-series CV over the assembled
(X, y) using the configured splits. On each fold it instantiates the plugin, calls plugin.train(X_train, y_train), then plugin.predict(X_test), and computes MSE and directional accuracy. Mean and std across folds are stored in cv_metrics.3. Final Fit
3. Final Fit
After CV evaluation, the engine fits the model on the full date range (all of
X, all of y) to produce the final weights that will be used for live inference.4. Artifact Persistence
4. Artifact Persistence
The fitted model is serialised (joblib for sklearn-compatible models,
torch.save for LSTM/GRU) and uploaded to the configured S3/MinIO bucket. The resulting artifact_uri is written back to the MLModel row.5. MLflow Logging
5. MLflow Logging
The engine starts an MLflow run, logs all
parameters, cv_metrics, n_train_rows, and feature_columns as run metadata, and records the S3 URI as an artifact pointer. The MLflow run_id is stored in mlflow_run_id for traceability.Hyperparameter Tuning
The platform integrates Optuna for automated hyperparameter search. You declare aparam_space using typed spec objects, and the engine runs n_trials of Bayesian optimisation to minimise (or maximise) your chosen CV metric.
type | Optuna call | Required fields |
|---|---|---|
float | suggest_float | low, high — optionally log: true for log-scale |
int | suggest_int | low, high |
categorical | suggest_categorical | choices array |
AutoML
When you want to compare multiple algorithm families without hand-selecting one, AutoML evaluates every candidate plugin in parallel and returns a ranked leaderboard:Available Model Plugins
| Plugin Key | Algorithm | Family | Notes |
|---|---|---|---|
ml.xgboost | XGBoost | machine_learning | Gradient-boosted trees; supports GPU training |
ml.lightgbm | LightGBM | machine_learning | Faster training than XGBoost on large datasets |
ml.catboost | CatBoost | machine_learning | Native categorical handling, robust to overfitting |
ml.random_forest | Random Forest | machine_learning | Low variance, useful baseline |
ml.lstm | LSTM | deep_learning | Long short-term memory for sequential patterns |