Overview
Tree-based models are powerful non-linear algorithms that work by learning decision rules from features. Ensemble methods combine multiple trees to achieve better performance and robustness.
Decision Trees Single tree classifiers and regressors
Random Forests Bagging ensemble of decision trees
Gradient Boosting Sequential boosting algorithms
Other Ensembles Bagging, AdaBoost, and more
Decision Trees
Decision trees learn hierarchical decision rules to partition the feature space.
Decision Tree Classifier
import { DecisionTreeClassifier } from "bun-scikit" ;
const X = [
[ 0 , 0 ],
[ 1 , 1 ],
[ 0 , 1 ],
[ 1 , 0 ],
];
const y = [ 0 , 0 , 1 , 1 ];
const tree = new DecisionTreeClassifier ({
maxDepth: 5 ,
minSamplesSplit: 2 ,
minSamplesLeaf: 1 ,
maxFeatures: "sqrt" ,
randomState: 42 ,
});
tree . fit ( X , y );
const predictions = tree . predict ([[ 0.5 , 0.5 ]]);
console . log ( predictions );
// Get prediction probabilities
const probabilities = tree . predictProba ([[ 0.5 , 0.5 ]]);
console . log ( probabilities ); // [[0.5, 0.5]]
Configuration Options
maxDepth
number
default: "undefined"
Maximum depth of the tree. Unlimited if not specified.
Minimum number of samples required to split an internal node
Minimum number of samples required to be at a leaf node
maxFeatures
'sqrt' | 'log2' | number | null
default: "null"
Number of features to consider when looking for the best split
randomState
number
default: "undefined"
Random seed for reproducibility
Decision Tree Regressor
import { DecisionTreeRegressor } from "bun-scikit" ;
const X = [[ 0 ], [ 1 ], [ 2 ], [ 3 ], [ 4 ]];
const y = [ 0 , 1 , 4 , 9 , 16 ]; // y = x²
const regressor = new DecisionTreeRegressor ({
maxDepth: 10 ,
minSamplesSplit: 2 ,
});
regressor . fit ( X , y );
const predictions = regressor . predict ([[ 2.5 ]]);
console . log ( predictions );
Feature Importance
tree . fit ( X , y );
if ( tree . featureImportances_ ) {
console . log ( "Feature importances:" , tree . featureImportances_ );
// Find most important feature
const maxIdx = tree . featureImportances_ . indexOf (
Math . max ( ... tree . featureImportances_ )
);
console . log ( `Most important feature: ${ maxIdx } ` );
}
Decision trees use native Zig backend by default when BUN_SCIKIT_TREE_BACKEND=zig. This provides significant performance improvements.
Random Forests
Random forests build multiple decision trees and aggregate their predictions through voting (classification) or averaging (regression).
Random Forest Classifier
import { RandomForestClassifier } from "bun-scikit" ;
const X = [
[ 0 , 0 ], [ 0.1 , - 0.1 ],
[ 1 , 1 ], [ 1.1 , 0.9 ],
[ 5 , 5 ], [ 5.1 , 4.9 ],
];
const y = [ 0 , 0 , 1 , 1 , 2 , 2 ];
const forest = new RandomForestClassifier ({
nEstimators: 100 ,
maxDepth: 12 ,
minSamplesSplit: 2 ,
minSamplesLeaf: 1 ,
maxFeatures: "sqrt" ,
bootstrap: true ,
randomState: 42 ,
});
forest . fit ( X , y );
console . log ( forest . classes_ ); // [0, 1, 2]
console . log ( forest . fitBackend_ ); // "zig" when native backend enabled
const predictions = forest . predict ([[ 0.5 , 0.5 ], [ 5.2 , 5.1 ]]);
console . log ( predictions );
// Get class probabilities
const probabilities = forest . predictProba ([[ 1 , 1 ]]);
console . log ( probabilities ); // [[0.05, 0.90, 0.05]]
Configuration Options
Number of trees in the forest
Maximum depth of each tree
maxFeatures
'sqrt' | 'log2' | number | null
default: "'sqrt'"
Number of features to consider for each split. ‘sqrt’ uses √n_features.
Whether to use bootstrap samples when building trees
randomState
number
default: "undefined"
Random seed for reproducibility
Random Forest Regressor
import { RandomForestRegressor } from "bun-scikit" ;
const X = [
[ 1 , 2 ], [ 2 , 3 ], [ 3 , 4 ],
[ 4 , 5 ], [ 5 , 6 ], [ 6 , 7 ],
];
const y = [ 10 , 20 , 30 , 40 , 50 , 60 ];
const rfRegressor = new RandomForestRegressor ({
nEstimators: 100 ,
maxDepth: 10 ,
randomState: 42 ,
});
rfRegressor . fit ( X , y );
const predictions = rfRegressor . predict ([[ 3.5 , 4.5 ]]);
console . log ( predictions );
Feature Importance Analysis
forest . fit ( X , y );
if ( forest . featureImportances_ ) {
const importances = forest . featureImportances_ ;
// Create feature ranking
const ranking = importances
. map (( importance , idx ) => ({ feature: idx , importance }))
. sort (( a , b ) => b . importance - a . importance );
console . log ( "Top 3 features:" );
ranking . slice ( 0 , 3 ). forEach (({ feature , importance }) => {
console . log ( ` Feature ${ feature } : ${ importance . toFixed ( 4 ) } ` );
});
}
Gradient Boosting
Gradient boosting builds trees sequentially, with each tree correcting errors made by previous trees.
Gradient Boosting Classifier
import { GradientBoostingClassifier } from "bun-scikit" ;
const X = [
[ 0 , 0 ], [ 0 , 1 ],
[ 1 , 0 ], [ 1 , 1 ],
];
const y = [ 0 , 1 , 1 , 0 ]; // XOR problem
const gbm = new GradientBoostingClassifier ({
nEstimators: 100 ,
learningRate: 0.1 ,
maxDepth: 3 ,
minSamplesSplit: 2 ,
minSamplesLeaf: 1 ,
subsample: 1.0 ,
randomState: 42 ,
});
gbm . fit ( X , y );
const predictions = gbm . predict ([[ 0 , 0 ], [ 1 , 1 ]]);
console . log ( predictions );
const probabilities = gbm . predictProba ([[ 0.5 , 0.5 ]]);
console . log ( probabilities );
Configuration
Number of boosting stages to perform
Learning rate shrinks the contribution of each tree
Maximum depth of individual trees
Fraction of samples to use for fitting individual trees (< 1.0 enables stochastic gradient boosting)
Gradient Boosting Regressor
import { GradientBoostingRegressor } from "bun-scikit" ;
const X = [[ i ] for ( let i = 0 ; i < 100 ; i ++ )];
const y = X . map (([ x ]) => Math . sin ( x / 10 ) * 50 );
const gbr = new GradientBoostingRegressor ({
nEstimators: 100 ,
learningRate: 0.1 ,
maxDepth: 4 ,
});
gbr . fit ( X , y );
const predictions = gbr . predict ([[ 50 ]]);
Histogram-Based Gradient Boosting
For large datasets, use histogram-based gradient boosting:
import { HistGradientBoostingClassifier } from "bun-scikit" ;
const histGbm = new HistGradientBoostingClassifier ({
maxIter: 100 ,
maxDepth: 10 ,
learningRate: 0.1 ,
});
histGbm . fit ( X , y );
Histogram-based boosting is significantly faster on large datasets (> 10k samples) by binning continuous features.
Other Ensembles
Extremely Randomized Trees use random thresholds instead of optimal thresholds:
import { ExtraTreesClassifier , ExtraTreesRegressor } from "bun-scikit" ;
const extraTrees = new ExtraTreesClassifier ({
nEstimators: 100 ,
maxDepth: 10 ,
randomState: 42 ,
});
extraTrees . fit ( X , y );
AdaBoost
Adaptive Boosting weights samples based on classification difficulty:
import { AdaBoostClassifier } from "bun-scikit" ;
const adaboost = new AdaBoostClassifier ({
nEstimators: 50 ,
learningRate: 1.0 ,
randomState: 42 ,
});
adaboost . fit ( X , y );
Bagging
Bootstrap Aggregating with any base estimator:
import { BaggingClassifier } from "bun-scikit" ;
import { DecisionTreeClassifier } from "bun-scikit" ;
const bagging = new BaggingClassifier ({
baseEstimator: new DecisionTreeClassifier ({ maxDepth: 5 }),
nEstimators: 10 ,
maxSamples: 1.0 ,
bootstrap: true ,
randomState: 42 ,
});
bagging . fit ( X , y );
Voting Ensemble
Combine multiple different estimators:
import { VotingClassifier } from "bun-scikit" ;
import { LogisticRegression } from "bun-scikit" ;
import { RandomForestClassifier } from "bun-scikit" ;
import { GradientBoostingClassifier } from "bun-scikit" ;
const voting = new VotingClassifier ({
estimators: [
[ "lr" , new LogisticRegression ()],
[ "rf" , new RandomForestClassifier ({ nEstimators: 50 })],
[ "gb" , new GradientBoostingClassifier ({ nEstimators: 50 })],
],
voting: "soft" , // Use predict_proba
});
voting . fit ( X , y );
Stacking
Stack predictions from multiple models:
import { StackingClassifier } from "bun-scikit" ;
import { LogisticRegression } from "bun-scikit" ;
import { RandomForestClassifier } from "bun-scikit" ;
const stacking = new StackingClassifier ({
estimators: [
[ "rf" , new RandomForestClassifier ()],
[ "gb" , new GradientBoostingClassifier ()],
],
finalEstimator: new LogisticRegression (),
cv: 5 ,
});
stacking . fit ( X , y );
Enable Zig backend for significant speedup: export BUN_SCIKIT_TREE_BACKEND = zig
bun run native:build
This provides 5-10x speedup for tree training.
Random forests train trees independently and can benefit from parallel execution. The native backend handles this automatically.
For large datasets:
Reduce maxDepth to limit tree size
Use maxFeatures to limit split candidates
Enable bootstrap=true for random forests
Use histogram-based boosting for 100k+ samples
Hyperparameter Tuning
Grid Search for Random Forest
import { GridSearchCV } from "bun-scikit" ;
import { RandomForestClassifier } from "bun-scikit" ;
const search = new GridSearchCV (
( params ) => new RandomForestClassifier ({
nEstimators: params . nEstimators as number ,
maxDepth: params . maxDepth as number ,
maxFeatures: params . maxFeatures as "sqrt" | "log2" ,
}),
{
nEstimators: [ 50 , 100 , 200 ],
maxDepth: [ 10 , 20 , 30 ],
maxFeatures: [ "sqrt" , "log2" ],
},
{ cv: 5 , scoring: "accuracy" }
);
search . fit ( X , y );
console . log ( "Best parameters:" , search . bestParams_ );
console . log ( "Best score:" , search . bestScore_ );
Common Patterns
Early Stopping for Gradient Boosting
const X_train = X . slice ( 0 , 80 );
const y_train = y . slice ( 0 , 80 );
const X_val = X . slice ( 80 );
const y_val = y . slice ( 80 );
const gbm = new GradientBoostingClassifier ({ nEstimators: 1000 });
for ( let n = 50 ; n <= 1000 ; n += 50 ) {
const partial = new GradientBoostingClassifier ({ nEstimators: n });
partial . fit ( X_train , y_train );
const score = partial . score ( X_val , y_val );
console . log ( `n= ${ n } , validation accuracy= ${ score . toFixed ( 4 ) } ` );
}
Out-of-Bag Evaluation
const rf = new RandomForestClassifier ({
nEstimators: 100 ,
bootstrap: true ,
oobScore: true , // Enable OOB scoring if available
});
rf . fit ( X , y );
// Check rf.oobScore_ for out-of-bag accuracy
When to Use Each Model
Decision Trees
Interpretable models
Small datasets
Quick prototyping
Random Forests
General-purpose classification/regression
Robust to overfitting
Feature importance needed
Gradient Boosting
Maximum accuracy
Structured/tabular data
Competitions
Extra Trees
Faster training than RF
More randomization
Less prone to overfitting
Next Steps
Model Selection Cross-validation and tuning
Zig Acceleration Enable native backend for 10x speedup