Tree-Based Models & Ensembles

Overview

Tree-based models are powerful non-linear algorithms that work by learning decision rules from features. Ensemble methods combine multiple trees to achieve better performance and robustness.

Decision Trees

Single tree classifiers and regressors

Random Forests

Bagging ensemble of decision trees

Gradient Boosting

Sequential boosting algorithms

Other Ensembles

Bagging, AdaBoost, and more

Decision Trees

Decision trees learn hierarchical decision rules to partition the feature space.

Decision Tree Classifier

import { DecisionTreeClassifier } from "bun-scikit";

const X = [
  [0, 0],
  [1, 1],
  [0, 1],
  [1, 0],
];
const y = [0, 0, 1, 1];

const tree = new DecisionTreeClassifier({
  maxDepth: 5,
  minSamplesSplit: 2,
  minSamplesLeaf: 1,
  maxFeatures: "sqrt",
  randomState: 42,
});

tree.fit(X, y);

const predictions = tree.predict([[0.5, 0.5]]);
console.log(predictions);

// Get prediction probabilities
const probabilities = tree.predictProba([[0.5, 0.5]]);
console.log(probabilities); // [[0.5, 0.5]]

Configuration Options

maxDepth

number

default:"undefined"

Maximum depth of the tree. Unlimited if not specified.

minSamplesSplit

number

default:"2"

Minimum number of samples required to split an internal node

minSamplesLeaf

number

default:"1"

Minimum number of samples required to be at a leaf node

maxFeatures

'sqrt' | 'log2' | number | null

default:"null"

Number of features to consider when looking for the best split

randomState

number

default:"undefined"

Random seed for reproducibility

Decision Tree Regressor

import { DecisionTreeRegressor } from "bun-scikit";

const X = [[0], [1], [2], [3], [4]];
const y = [0, 1, 4, 9, 16]; // y = x²

const regressor = new DecisionTreeRegressor({
  maxDepth: 10,
  minSamplesSplit: 2,
});

regressor.fit(X, y);
const predictions = regressor.predict([[2.5]]);
console.log(predictions);

Feature Importance

tree.fit(X, y);

if (tree.featureImportances_) {
  console.log("Feature importances:", tree.featureImportances_);
  
  // Find most important feature
  const maxIdx = tree.featureImportances_.indexOf(
    Math.max(...tree.featureImportances_)
  );
  console.log(`Most important feature: ${maxIdx}`);
}

Decision trees use native Zig backend by default when BUN_SCIKIT_TREE_BACKEND=zig. This provides significant performance improvements.

Random Forests

Random forests build multiple decision trees and aggregate their predictions through voting (classification) or averaging (regression).

Random Forest Classifier

import { RandomForestClassifier } from "bun-scikit";

const X = [
  [0, 0], [0.1, -0.1],
  [1, 1], [1.1, 0.9],
  [5, 5], [5.1, 4.9],
];
const y = [0, 0, 1, 1, 2, 2];

const forest = new RandomForestClassifier({
  nEstimators: 100,
  maxDepth: 12,
  minSamplesSplit: 2,
  minSamplesLeaf: 1,
  maxFeatures: "sqrt",
  bootstrap: true,
  randomState: 42,
});

forest.fit(X, y);

console.log(forest.classes_); // [0, 1, 2]
console.log(forest.fitBackend_); // "zig" when native backend enabled

const predictions = forest.predict([[0.5, 0.5], [5.2, 5.1]]);
console.log(predictions);

// Get class probabilities
const probabilities = forest.predictProba([[1, 1]]);
console.log(probabilities); // [[0.05, 0.90, 0.05]]

Configuration Options

nEstimators

number

default:"50"

Number of trees in the forest

maxDepth

number

default:"12"

Maximum depth of each tree

maxFeatures

'sqrt' | 'log2' | number | null

default:"'sqrt'"

Number of features to consider for each split. ‘sqrt’ uses √n_features.

bootstrap

boolean

default:"true"

Whether to use bootstrap samples when building trees

randomState

number

default:"undefined"

Random seed for reproducibility

Random Forest Regressor

import { RandomForestRegressor } from "bun-scikit";

const X = [
  [1, 2], [2, 3], [3, 4],
  [4, 5], [5, 6], [6, 7],
];
const y = [10, 20, 30, 40, 50, 60];

const rfRegressor = new RandomForestRegressor({
  nEstimators: 100,
  maxDepth: 10,
  randomState: 42,
});

rfRegressor.fit(X, y);
const predictions = rfRegressor.predict([[3.5, 4.5]]);
console.log(predictions);

Feature Importance Analysis

forest.fit(X, y);

if (forest.featureImportances_) {
  const importances = forest.featureImportances_;
  
  // Create feature ranking
  const ranking = importances
    .map((importance, idx) => ({ feature: idx, importance }))
    .sort((a, b) => b.importance - a.importance);
  
  console.log("Top 3 features:");
  ranking.slice(0, 3).forEach(({ feature, importance }) => {
    console.log(`  Feature ${feature}: ${importance.toFixed(4)}`);
  });
}

Gradient Boosting

Gradient boosting builds trees sequentially, with each tree correcting errors made by previous trees.

Gradient Boosting Classifier

import { GradientBoostingClassifier } from "bun-scikit";

const X = [
  [0, 0], [0, 1],
  [1, 0], [1, 1],
];
const y = [0, 1, 1, 0]; // XOR problem

const gbm = new GradientBoostingClassifier({
  nEstimators: 100,
  learningRate: 0.1,
  maxDepth: 3,
  minSamplesSplit: 2,
  minSamplesLeaf: 1,
  subsample: 1.0,
  randomState: 42,
});

gbm.fit(X, y);

const predictions = gbm.predict([[0, 0], [1, 1]]);
console.log(predictions);

const probabilities = gbm.predictProba([[0.5, 0.5]]);
console.log(probabilities);

Configuration

nEstimators

number

default:"100"

Number of boosting stages to perform

learningRate

number

default:"0.1"

Learning rate shrinks the contribution of each tree

maxDepth

number

default:"3"

Maximum depth of individual trees

subsample

number

default:"1.0"

Fraction of samples to use for fitting individual trees (< 1.0 enables stochastic gradient boosting)

Gradient Boosting Regressor

import { GradientBoostingRegressor } from "bun-scikit";

const X = [[i] for (let i = 0; i < 100; i++)];
const y = X.map(([x]) => Math.sin(x / 10) * 50);

const gbr = new GradientBoostingRegressor({
  nEstimators: 100,
  learningRate: 0.1,
  maxDepth: 4,
});

gbr.fit(X, y);
const predictions = gbr.predict([[50]]);

Histogram-Based Gradient Boosting

For large datasets, use histogram-based gradient boosting:

import { HistGradientBoostingClassifier } from "bun-scikit";

const histGbm = new HistGradientBoostingClassifier({
  maxIter: 100,
  maxDepth: 10,
  learningRate: 0.1,
});

histGbm.fit(X, y);

Histogram-based boosting is significantly faster on large datasets (> 10k samples) by binning continuous features.

Other Ensembles

Extra Trees

Extremely Randomized Trees use random thresholds instead of optimal thresholds:

import { ExtraTreesClassifier, ExtraTreesRegressor } from "bun-scikit";

const extraTrees = new ExtraTreesClassifier({
  nEstimators: 100,
  maxDepth: 10,
  randomState: 42,
});

extraTrees.fit(X, y);

AdaBoost

Adaptive Boosting weights samples based on classification difficulty:

import { AdaBoostClassifier } from "bun-scikit";

const adaboost = new AdaBoostClassifier({
  nEstimators: 50,
  learningRate: 1.0,
  randomState: 42,
});

adaboost.fit(X, y);

Bagging

Bootstrap Aggregating with any base estimator:

import { BaggingClassifier } from "bun-scikit";
import { DecisionTreeClassifier } from "bun-scikit";

const bagging = new BaggingClassifier({
  baseEstimator: new DecisionTreeClassifier({ maxDepth: 5 }),
  nEstimators: 10,
  maxSamples: 1.0,
  bootstrap: true,
  randomState: 42,
});

bagging.fit(X, y);

Voting Ensemble

Combine multiple different estimators:

import { VotingClassifier } from "bun-scikit";
import { LogisticRegression } from "bun-scikit";
import { RandomForestClassifier } from "bun-scikit";
import { GradientBoostingClassifier } from "bun-scikit";

const voting = new VotingClassifier({
  estimators: [
    ["lr", new LogisticRegression()],
    ["rf", new RandomForestClassifier({ nEstimators: 50 })],
    ["gb", new GradientBoostingClassifier({ nEstimators: 50 })],
  ],
  voting: "soft", // Use predict_proba
});

voting.fit(X, y);

Stacking

Stack predictions from multiple models:

import { StackingClassifier } from "bun-scikit";
import { LogisticRegression } from "bun-scikit";
import { RandomForestClassifier } from "bun-scikit";

const stacking = new StackingClassifier({
  estimators: [
    ["rf", new RandomForestClassifier()],
    ["gb", new GradientBoostingClassifier()],
  ],
  finalEstimator: new LogisticRegression(),
  cv: 5,
});

stacking.fit(X, y);

Performance Optimization

Native Zig Backend

Enable Zig backend for significant speedup:

export BUN_SCIKIT_TREE_BACKEND=zig
bun run native:build

This provides 5-10x speedup for tree training.

Parallel Training

Random forests train trees independently and can benefit from parallel execution. The native backend handles this automatically.

Memory Management

For large datasets:

Reduce maxDepth to limit tree size
Use maxFeatures to limit split candidates
Enable bootstrap=true for random forests
Use histogram-based boosting for 100k+ samples

Hyperparameter Tuning

Grid Search for Random Forest

import { GridSearchCV } from "bun-scikit";
import { RandomForestClassifier } from "bun-scikit";

const search = new GridSearchCV(
  (params) => new RandomForestClassifier({
    nEstimators: params.nEstimators as number,
    maxDepth: params.maxDepth as number,
    maxFeatures: params.maxFeatures as "sqrt" | "log2",
  }),
  {
    nEstimators: [50, 100, 200],
    maxDepth: [10, 20, 30],
    maxFeatures: ["sqrt", "log2"],
  },
  { cv: 5, scoring: "accuracy" }
);

search.fit(X, y);

console.log("Best parameters:", search.bestParams_);
console.log("Best score:", search.bestScore_);

Common Patterns

Early Stopping for Gradient Boosting

const X_train = X.slice(0, 80);
const y_train = y.slice(0, 80);
const X_val = X.slice(80);
const y_val = y.slice(80);

const gbm = new GradientBoostingClassifier({ nEstimators: 1000 });

for (let n = 50; n <= 1000; n += 50) {
  const partial = new GradientBoostingClassifier({ nEstimators: n });
  partial.fit(X_train, y_train);
  const score = partial.score(X_val, y_val);
  console.log(`n=${n}, validation accuracy=${score.toFixed(4)}`);
}

Out-of-Bag Evaluation

const rf = new RandomForestClassifier({
  nEstimators: 100,
  bootstrap: true,
  oobScore: true, // Enable OOB scoring if available
});

rf.fit(X, y);
// Check rf.oobScore_ for out-of-bag accuracy

When to Use Each Model

Decision Trees

Interpretable models
Small datasets
Quick prototyping

Random Forests

General-purpose classification/regression
Robust to overfitting
Feature importance needed

Gradient Boosting

Maximum accuracy
Structured/tabular data
Competitions

Extra Trees

Faster training than RF
More randomization
Less prone to overfitting

Next Steps

Model Selection

Cross-validation and tuning

Zig Acceleration

Enable native backend for 10x speedup

Get Started

Core Concepts

Guides

Performance

Documentation Index

​Overview

Decision Trees

Random Forests

Gradient Boosting

Other Ensembles

​Decision Trees

​Decision Tree Classifier

​Configuration Options

​Decision Tree Regressor

​Feature Importance

​Random Forests

​Random Forest Classifier

​Configuration Options

​Random Forest Regressor

​Feature Importance Analysis

​Gradient Boosting

​Gradient Boosting Classifier

​Configuration

​Gradient Boosting Regressor

​Histogram-Based Gradient Boosting

​Other Ensembles

​Extra Trees

​AdaBoost

​Bagging

​Voting Ensemble

​Stacking

​Performance Optimization

​Hyperparameter Tuning

​Grid Search for Random Forest

​Common Patterns

​Early Stopping for Gradient Boosting

​Out-of-Bag Evaluation

​When to Use Each Model

Decision Trees

Random Forests

Gradient Boosting

Extra Trees

​Next Steps

Model Selection

Zig Acceleration

Build docs developers (and LLMs) love

Overview

Decision Trees

Decision Tree Classifier

Configuration Options

Decision Tree Regressor

Feature Importance

Random Forests

Random Forest Classifier

Configuration Options

Random Forest Regressor

Feature Importance Analysis

Gradient Boosting

Gradient Boosting Classifier

Configuration

Gradient Boosting Regressor

Histogram-Based Gradient Boosting

Other Ensembles

Extra Trees

AdaBoost

Bagging

Voting Ensemble

Stacking

Performance Optimization

Hyperparameter Tuning

Grid Search for Random Forest

Common Patterns

Early Stopping for Gradient Boosting

Out-of-Bag Evaluation

When to Use Each Model

Next Steps