Alpha Leak’s models are trained on the system’s own historical signal data — the same signals, wallet features, and outcome labels that the live pipeline continuously produces. This closed-loop design means the models improve as more data accumulates, and ensures that every feature the model sees in production was computed by the same code that produced it during training.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/alphaleaks60-maker/docs2/llms.txt
Use this file to discover all available pages before exploring further.
Training data
Training data is assembled by joining three sources:- Signals
- Outcome labels
- Context
Every buy signal ever emitted by the pipeline, with the full 68-feature snapshot captured at the moment the signal fired. Using point-in-time feature values — as they were at signal time, not current values — is essential for preventing lookahead bias.
Avoiding lookahead bias
The most common source of over-optimistic training results in production systems is lookahead bias: using information at training time that would not have been available at the moment the prediction needed to be made.Train / validation / calibration split
The dataset is split chronologically, not randomly.Validation set
The next 10–15% of signals (chronologically). Used for early stopping — training halts when validation PR-AUC stops improving, preventing overfit without manual epoch tuning.
Random splitting would leak future wallet behaviour and token outcomes into the training set. A wallet’s current graduation rate encodes information about future tokens it hasn’t bought yet at the time of a historical signal. Chronological splitting prevents this entirely.
Algorithm
LightGBM gradient boosting is used for all models.Handles missing features gracefully
Handles missing features gracefully
Many signals are missing some features — for example, a signal from a wallet with no trading history yet. LightGBM’s native missing value handling outperforms imputation on tabular data with structural missingness.
Feature importance
Feature importance
LightGBM provides both gain-based and split-count importance rankings. These are used to audit the model and identify which features it actually relies on — surfacing unexpected dependencies before deployment.
ONNX export
ONNX export
LightGBM models compile cleanly to ONNX format, which is required for in-process inference in the Node.js pipeline without a Python runtime dependency.
Training speed
Training speed
Models train in minutes on a few hundred thousand signals, making the full retrain-evaluate-deploy loop fast enough to respond to market regime shifts within hours.
Hyperparameter optimisation
Key LightGBM parameters —num_leaves, learning_rate, min_child_samples, feature_fraction, bagging_fraction, and regularisation coefficients — are tuned using Optuna-based Bayesian search. The best parameters from each search run are recorded alongside the model’s metadata and used as the starting point for future searches.
Early stopping is applied against the validation PR-AUC. Training halts when the validation metric stops improving, preventing overfit without requiring manual epoch tuning.
Handling class imbalance
All targets are heavily imbalanced — relatively few tokens actually reach 3× in 30 minutes. LightGBM’sscale_pos_weight parameter is used to compensate for this imbalance during training.
PR-AUC (Precision-Recall Area Under Curve) is the primary evaluation metric, not ROC-AUC.
A model that predicts “no” for every signal would achieve 95%+ ROC-AUC on a dataset where only 5% of signals are positive — but near-zero PR-AUC. ROC-AUC is misleading when the positive class is rare. PR-AUC directly measures how well the model ranks true positives against false positives, which is exactly what matters for signal selection.
Platt calibration
Raw LightGBM outputs are probability-like but often miscalibrated — the model may output 0.80 for signals that actually hit the target only 55% of the time. This miscalibration makes raw thresholds in strategy configs unreliable. Every model is calibrated post-training using Platt scaling: a sigmoid functionσ(a·x + b) fitted on the held-out calibration set. The parameters a and b are found by minimising log loss on the calibration set’s true labels.
0.85 means roughly 85% of signals at that score level actually hit the target — which is what makes the strategy threshold values meaningful.
ONNX export and validation
After training and calibration, each model goes through a three-step export process before it is eligible for deployment.Export to ONNX
The trained LightGBM model is exported using
skl2onnx or the native LightGBM ONNX exporter. The exported file is saved as <target>_v<version>.onnx.Write the metadata sidecar
A
_metadata.json file is written alongside the model, containing the model ID, ordered feature list, calibration parameters (platt_a, platt_b), and the validation PR-AUC score. This file is what the inference code reads to assemble feature vectors correctly.Deploying a new model
Deploying a trained model to the live system requires no code changes and no service restart.Copy the file pair
Place both the
.onnx file and its _metadata.json sidecar into the src/ml/models/ directory on the production host.Wait for the scan cycle
MlInference scans the models directory every 5 minutes. When it finds a .onnx file that is not already loaded, it creates a new ONNX inference session for it and adds it to the active model set.Retraining cadence
There is no fixed retraining schedule. TheModelMonitor service tracks the live performance of each model against observed signal outcomes and detects drift between calibrated probabilities and actual hit rates. When drift is detected — typically caused by a shift in market dynamics, Pump.fun platform changes, or wallet behaviour patterns — a new training run is initiated.
In practice this means retraining every few weeks under normal conditions, or sooner when a significant platform event occurs such as fee structure changes or graduation threshold adjustments.
Genesis model training
The genesis models follow the same methodology — chronological split, LightGBM, Platt calibration, ONNX export — but use a separate 75-feature dataset. Features are assembled from the first-60-second observation windows stored byGenesisWatcher, and the training labels are the same outcome targets applied to tokens rather than to specific wallet signals.
Model architecture
ONNX deployment, hot reloading, inference pipeline, and composite scoring.
Feature reference
Complete documentation for all 68 features in the standard model vector.