Training Alpha Leak's LightGBM signal models

Alpha Leak’s models are trained on the system’s own historical signal data — the same signals, wallet features, and outcome labels that the live pipeline continuously produces. This closed-loop design means the models improve as more data accumulates, and ensures that every feature the model sees in production was computed by the same code that produced it during training.

Training data

Training data is assembled by joining three sources:

Signals
Outcome labels
Context

Every buy signal ever emitted by the pipeline, with the full 68-feature snapshot captured at the moment the signal fired. Using point-in-time feature values — as they were at signal time, not current values — is essential for preventing lookahead bias.

The PeakTracker retrospectively measures the highest price multiple each token reached at 10m, 30m, 1h, 4h, and 24h intervals after each signal. These become the binary training labels: reach_2x_1h = 1 if the token actually reached 2× within an hour of the signal firing.

Avoiding lookahead bias

The most common source of over-optimistic training results in production systems is lookahead bias: using information at training time that would not have been available at the moment the prediction needed to be made.

All features must be assembled using their point-in-time values — the values as they existed when the signal fired, not the values that exist now. Wallet stats are stored as snapshots alongside each signal for exactly this reason. Using current wallet stats to train on historical signals would leak future information into the model.

Train / validation / calibration split

The dataset is split chronologically, not randomly.

Training set

The oldest 70% of signals. LightGBM is fitted on this partition.

Validation set

The next 10–15% of signals (chronologically). Used for early stopping — training halts when validation PR-AUC stops improving, preventing overfit without manual epoch tuning.

Calibration set

The most recent 10–15% of signals. Held out entirely from training and used only to fit the Platt scaling parameters after training is complete.

Random splitting would leak future wallet behaviour and token outcomes into the training set. A wallet’s current graduation rate encodes information about future tokens it hasn’t bought yet at the time of a historical signal. Chronological splitting prevents this entirely.

The validation and calibration sets represent the most recent market conditions — which is also the model’s most likely operating environment. This makes evaluation metrics on these sets more meaningful than metrics on a random holdout.

Algorithm

LightGBM gradient boosting is used for all models.

Handles missing features gracefully

Many signals are missing some features — for example, a signal from a wallet with no trading history yet. LightGBM’s native missing value handling outperforms imputation on tabular data with structural missingness.

Feature importance

LightGBM provides both gain-based and split-count importance rankings. These are used to audit the model and identify which features it actually relies on — surfacing unexpected dependencies before deployment.

ONNX export

LightGBM models compile cleanly to ONNX format, which is required for in-process inference in the Node.js pipeline without a Python runtime dependency.

Training speed

Models train in minutes on a few hundred thousand signals, making the full retrain-evaluate-deploy loop fast enough to respond to market regime shifts within hours.

Hyperparameter optimisation

Key LightGBM parameters — num_leaves, learning_rate, min_child_samples, feature_fraction, bagging_fraction, and regularisation coefficients — are tuned using Optuna-based Bayesian search. The best parameters from each search run are recorded alongside the model’s metadata and used as the starting point for future searches. Early stopping is applied against the validation PR-AUC. Training halts when the validation metric stops improving, preventing overfit without requiring manual epoch tuning.

Handling class imbalance

All targets are heavily imbalanced — relatively few tokens actually reach 3× in 30 minutes. LightGBM’s scale_pos_weight parameter is used to compensate for this imbalance during training. PR-AUC (Precision-Recall Area Under Curve) is the primary evaluation metric, not ROC-AUC.

A model that predicts “no” for every signal would achieve 95%+ ROC-AUC on a dataset where only 5% of signals are positive — but near-zero PR-AUC. ROC-AUC is misleading when the positive class is rare. PR-AUC directly measures how well the model ranks true positives against false positives, which is exactly what matters for signal selection.

Platt calibration

Raw LightGBM outputs are probability-like but often miscalibrated — the model may output 0.80 for signals that actually hit the target only 55% of the time. This miscalibration makes raw thresholds in strategy configs unreliable. Every model is calibrated post-training using Platt scaling: a sigmoid function σ(a·x + b) fitted on the held-out calibration set. The parameters a and b are found by minimising log loss on the calibration set’s true labels.

from sklearn.linear_model import LogisticRegression

# raw_scores: model outputs on calibration set
# y_cal: true labels on calibration set
cal_model = LogisticRegression()
cal_model.fit(raw_scores.reshape(-1, 1), y_cal)

platt_a = cal_model.coef_[0][0]
platt_b = cal_model.intercept_[0]

After calibration, the model’s output is interpretable as an approximate hit rate. A calibrated score of 0.85 means roughly 85% of signals at that score level actually hit the target — which is what makes the strategy threshold values meaningful.

ONNX export and validation

After training and calibration, each model goes through a three-step export process before it is eligible for deployment.

Export to ONNX

The trained LightGBM model is exported using skl2onnx or the native LightGBM ONNX exporter. The exported file is saved as <target>_v<version>.onnx.

Write the metadata sidecar

A _metadata.json file is written alongside the model, containing the model ID, ordered feature list, calibration parameters (platt_a, platt_b), and the validation PR-AUC score. This file is what the inference code reads to assemble feature vectors correctly.

{
  "model_id": "reach_2x_1h_v3",
  "model_type": "classification",
  "target": "reach_2x_1h",
  "version": 3,
  "feature_names": ["alpha_score", "wallet_graduation_rate", ...],
  "feature_count": 68,
  "calibration": {
    "method": "platt",
    "platt_a": 1.42,
    "platt_b": -0.31
  },
  "pr_auc": 0.34
}

Validate outputs

A sample of the training data is run through both the original LightGBM model and the ONNX model. The outputs must match to within floating-point tolerance. This step catches any feature ordering mismatch before the model reaches production.

Deploying a new model

Deploying a trained model to the live system requires no code changes and no service restart.

Copy the file pair

Place both the .onnx file and its _metadata.json sidecar into the src/ml/models/ directory on the production host.

Wait for the scan cycle

MlInference scans the models directory every 5 minutes. When it finds a .onnx file that is not already loaded, it creates a new ONNX inference session for it and adds it to the active model set.

Verify the score column

After the scan cycle completes, new signals should start receiving scores from the updated model. Check the relevant score column (ml_score_1h, ml_score_30m, etc.) in the signals table to confirm scores are being written.

When deploying a new version of an existing model target (e.g. reach_2x_1h_v4 replacing reach_2x_1h_v3), remove or rename the old file after the new one is confirmed to be loading. Both will run concurrently if both files are present, and the newer score will overwrite the older one since they write to the same column.

Retraining cadence

There is no fixed retraining schedule. The ModelMonitor service tracks the live performance of each model against observed signal outcomes and detects drift between calibrated probabilities and actual hit rates. When drift is detected — typically caused by a shift in market dynamics, Pump.fun platform changes, or wallet behaviour patterns — a new training run is initiated. In practice this means retraining every few weeks under normal conditions, or sooner when a significant platform event occurs such as fee structure changes or graduation threshold adjustments.

Genesis model training

The genesis models follow the same methodology — chronological split, LightGBM, Platt calibration, ONNX export — but use a separate 75-feature dataset. Features are assembled from the first-60-second observation windows stored by GenesisWatcher, and the training labels are the same outcome targets applied to tokens rather than to specific wallet signals.

Model architecture

ONNX deployment, hot reloading, inference pipeline, and composite scoring.

Feature reference

Complete documentation for all 68 features in the standard model vector.

Getting Started

The Pipeline

Intelligence

ML System

Live Trader

Training Alpha Leak's LightGBM signal models

Training data

Avoiding lookahead bias

Train / validation / calibration split

Algorithm

Hyperparameter optimisation

Handling class imbalance

Platt calibration

ONNX export and validation

Deploying a new model

Retraining cadence

Genesis model training

Model architecture

Feature reference

Build docs developers (and LLMs) love

Getting Started

The Pipeline

Intelligence

ML System

Live Trader

Documentation Index

​Training data

​Avoiding lookahead bias

​Train / validation / calibration split

​Algorithm

​Hyperparameter optimisation

​Handling class imbalance

​Platt calibration

​ONNX export and validation

​Deploying a new model

​Retraining cadence

​Genesis model training

Model architecture

Feature reference

Build docs developers (and LLMs) love

Training data

Avoiding lookahead bias

Train / validation / calibration split

Algorithm

Hyperparameter optimisation

Handling class imbalance

Platt calibration

ONNX export and validation

Deploying a new model

Retraining cadence

Genesis model training