Model training: data pipeline, calibration, deployment

Alpha Leak’s models are trained on the system’s own historical signal data — the same signals, wallet features, and outcome labels that the live pipeline produces. This closed-loop design means the models improve continuously as data accumulates, and it ensures that every feature the model sees in production was computed by the same code that produced it during training.

Training data

Training data is built from three sources joined together at signal time. Signals — every buy signal ever emitted, with the full 68-feature snapshot as it existed when the signal fired. Using point-in-time feature values — wallet stats as they were at signal time, not current values — is essential for preventing lookahead bias. The wallet_score_at_entry field records the alpha score at the moment of the buy; training uses this, not the current alpha score. Outcomes — the PeakTracker retrospectively measures the highest price multiple each token reached at 10m, 30m, 1h, 4h, and 24h intervals after the signal. These become the binary training labels: reach_2x_1h = 1 if the token actually reached 2× within one hour of the signal firing. Context — wallet features, creator stats, co-occurrence graph scores, market regime snapshots, and token state at time of signal are all joined in. The full 68-feature vector is assembled identically to how it is assembled during live inference, using the same FEATURE_ORDER and the same default values for missing data.

Target labels

Each model is trained on a separate binary target derived from PeakTracker output:

Model	Label	Positive condition
`reach_2x_1h`	`reach_2x_1h`	Token reached 2× within 1 hour of signal
`reach_3x_30m`	`reach_3x_30m`	Token reached 3× within 30 minutes of signal
`reach_2x_10m`	`reach_2x_10m`	Token reached 2× within 10 minutes of signal
`is_dead_soon`	`is_dead_soon`	Token died quickly (fast rug, no recovery)

Train / validation split

The dataset is split chronologically, not randomly. The most recent 20% of signals are held out as the validation set. Random splitting would leak future wallet behaviour and token outcomes into the training set, producing models that appear to perform well in evaluation but fail in production.

Never use random splits for time-series signal data. A wallet’s future win rate must not be visible to the model during training on that wallet’s past signals.

The chronological split also means the validation set represents the model’s most likely operating conditions — recent market dynamics — rather than an average over all historical conditions.

Algorithm

LightGBM gradient boosting is used for all models. Key reasons for this choice:

Native missing value handling — many signals are missing some features (for example, a wallet with no history yet). LightGBM’s built-in handling of missing values outperforms imputation on tabular data with structural missingness.
Feature importance — LightGBM provides both gain-based and split-count importance rankings, used to audit the model and identify which signals it actually relies on.
ONNX export — LightGBM models compile cleanly to ONNX, which is required for in-process inference in the Node.js pipeline.
Training speed — models train in minutes on a few hundred thousand signals, making the full retrain-evaluate-deploy loop fast enough to respond to market regime shifts.

Early stopping is applied against the validation PR-AUC, preventing overfit without requiring manual epoch tuning.

Target imbalance and evaluation metric

All targets are heavily imbalanced — relatively few tokens actually reach 3× in 30 minutes. LightGBM’s scale_pos_weight parameter is used to compensate for this imbalance during training. PR-AUC (Precision-Recall Area Under Curve) is the primary evaluation metric, not ROC-AUC. PR-AUC is far more informative when the positive class is rare: a model that predicts “no” for every signal would achieve 95%+ ROC-AUC but near-zero PR-AUC. The pr_auc value stored in each model’s _metadata.json is the validation-set PR-AUC at the time of training.

Calibration

Raw LightGBM outputs are probability-like but often miscalibrated — the model may output 0.80 for signals that actually hit the target only 55% of the time. Every model is calibrated post-training using Platt scaling: a sigmoid function σ(a·x + b) is fitted on the held-out validation set. After calibration, the model’s output is interpretable as an approximate hit rate. A calibrated score of 0.85 means roughly 85% of signals at that score level actually hit the target — which is what makes the strategy threshold values meaningful rather than arbitrary. The fitted platt_a and platt_b constants are stored in the model’s metadata file and applied at inference time.

Hyperparameter optimisation

Key LightGBM parameters — num_leaves, learning_rate, min_child_samples, feature_fraction, bagging_fraction, and regularisation coefficients — are tuned using Optuna-based Bayesian search. The best parameters from each search run are recorded alongside the model and used as the starting point for future searches.

ONNX export and validation

After training and calibration, each model goes through a three-step export process before deployment.

Export to ONNX

The trained LightGBM model is compiled to ONNX format for in-process inference via onnxruntime-node.

Write the metadata sidecar

A _metadata.json file is generated containing the model_id, version, ordered feature_names, feature_count, calibration parameters, and pr_auc. The inference code reads this file to assemble the feature vector in the correct order.

Validate ONNX output

A sample of the training data is run through both the original LightGBM model and the ONNX model. Outputs are confirmed to match within floating-point tolerance. This catches any feature ordering mismatch before the model reaches production.

Model versioning

Each model has a model_id (e.g. reach_2x_1h_v3) and an integer version field in its metadata. The inference service tracks which version is currently loaded. When the 5-minute hot-reload scan finds a new file, it compares the model ID against the loaded set — if it is new, it loads it and releases the old ONNX session.

Retraining cadence

There is no fixed retraining schedule. The ModelMonitor service — introduced in Phase 3 — tracks the live performance of each model against observed signal outcomes and detects drift between calibrated probabilities and actual hit rates. When drift is detected — typically caused by a shift in market dynamics, Pump.fun platform changes, or wallet behaviour patterns — a new training run is initiated with the accumulated data. In practice this means retraining every few weeks under normal conditions, or sooner when a significant platform event occurs (fee structure changes, graduation threshold adjustments, and similar).

The ModelMonitor service compares rolling actual hit rates against the model’s calibrated probability bands. A sustained gap between the two signals that the distribution has shifted and the model’s calibration is no longer reliable.

Genesis model training

The genesis models follow the same methodology but use a separate 75-feature dataset assembled from the first-60-second observation windows stored by the GenesisWatcher. The targets are the same outcome labels but applied to tokens rather than to specific wallet signals. See Genesis Watcher for the full feature breakdown.

Get Started

The Pipeline

Intelligence

ML System

Live Trader

Model training: data pipeline, calibration, deployment

Training data

Target labels

Train / validation split

Algorithm

Target imbalance and evaluation metric

Calibration

Hyperparameter optimisation

ONNX export and validation

Model versioning

Retraining cadence

Genesis model training

Build docs developers (and LLMs) love

Get Started

The Pipeline

Intelligence

ML System

Live Trader

Documentation Index

​Training data

​Target labels

​Train / validation split

​Algorithm

​Target imbalance and evaluation metric

​Calibration

​Hyperparameter optimisation

​ONNX export and validation

​Model versioning

​Retraining cadence

​Genesis model training

Build docs developers (and LLMs) love

Training data

Target labels

Train / validation split

Algorithm

Target imbalance and evaluation metric

Calibration

Hyperparameter optimisation

ONNX export and validation

Model versioning

Retraining cadence

Genesis model training