Skip to main content

Overview

Tree-based models are powerful non-linear algorithms that work by learning decision rules from features. Ensemble methods combine multiple trees to achieve better performance and robustness.

Decision Trees

Single tree classifiers and regressors

Random Forests

Bagging ensemble of decision trees

Gradient Boosting

Sequential boosting algorithms

Other Ensembles

Bagging, AdaBoost, and more

Decision Trees

Decision trees learn hierarchical decision rules to partition the feature space.

Decision Tree Classifier

import { DecisionTreeClassifier } from "bun-scikit";

const X = [
  [0, 0],
  [1, 1],
  [0, 1],
  [1, 0],
];
const y = [0, 0, 1, 1];

const tree = new DecisionTreeClassifier({
  maxDepth: 5,
  minSamplesSplit: 2,
  minSamplesLeaf: 1,
  maxFeatures: "sqrt",
  randomState: 42,
});

tree.fit(X, y);

const predictions = tree.predict([[0.5, 0.5]]);
console.log(predictions);

// Get prediction probabilities
const probabilities = tree.predictProba([[0.5, 0.5]]);
console.log(probabilities); // [[0.5, 0.5]]

Configuration Options

maxDepth
number
default:"undefined"
Maximum depth of the tree. Unlimited if not specified.
minSamplesSplit
number
default:"2"
Minimum number of samples required to split an internal node
minSamplesLeaf
number
default:"1"
Minimum number of samples required to be at a leaf node
maxFeatures
'sqrt' | 'log2' | number | null
default:"null"
Number of features to consider when looking for the best split
randomState
number
default:"undefined"
Random seed for reproducibility

Decision Tree Regressor

import { DecisionTreeRegressor } from "bun-scikit";

const X = [[0], [1], [2], [3], [4]];
const y = [0, 1, 4, 9, 16]; // y = x²

const regressor = new DecisionTreeRegressor({
  maxDepth: 10,
  minSamplesSplit: 2,
});

regressor.fit(X, y);
const predictions = regressor.predict([[2.5]]);
console.log(predictions);

Feature Importance

tree.fit(X, y);

if (tree.featureImportances_) {
  console.log("Feature importances:", tree.featureImportances_);
  
  // Find most important feature
  const maxIdx = tree.featureImportances_.indexOf(
    Math.max(...tree.featureImportances_)
  );
  console.log(`Most important feature: ${maxIdx}`);
}
Decision trees use native Zig backend by default when BUN_SCIKIT_TREE_BACKEND=zig. This provides significant performance improvements.

Random Forests

Random forests build multiple decision trees and aggregate their predictions through voting (classification) or averaging (regression).

Random Forest Classifier

import { RandomForestClassifier } from "bun-scikit";

const X = [
  [0, 0], [0.1, -0.1],
  [1, 1], [1.1, 0.9],
  [5, 5], [5.1, 4.9],
];
const y = [0, 0, 1, 1, 2, 2];

const forest = new RandomForestClassifier({
  nEstimators: 100,
  maxDepth: 12,
  minSamplesSplit: 2,
  minSamplesLeaf: 1,
  maxFeatures: "sqrt",
  bootstrap: true,
  randomState: 42,
});

forest.fit(X, y);

console.log(forest.classes_); // [0, 1, 2]
console.log(forest.fitBackend_); // "zig" when native backend enabled

const predictions = forest.predict([[0.5, 0.5], [5.2, 5.1]]);
console.log(predictions);

// Get class probabilities
const probabilities = forest.predictProba([[1, 1]]);
console.log(probabilities); // [[0.05, 0.90, 0.05]]

Configuration Options

nEstimators
number
default:"50"
Number of trees in the forest
maxDepth
number
default:"12"
Maximum depth of each tree
maxFeatures
'sqrt' | 'log2' | number | null
default:"'sqrt'"
Number of features to consider for each split. ‘sqrt’ uses √n_features.
bootstrap
boolean
default:"true"
Whether to use bootstrap samples when building trees
randomState
number
default:"undefined"
Random seed for reproducibility

Random Forest Regressor

import { RandomForestRegressor } from "bun-scikit";

const X = [
  [1, 2], [2, 3], [3, 4],
  [4, 5], [5, 6], [6, 7],
];
const y = [10, 20, 30, 40, 50, 60];

const rfRegressor = new RandomForestRegressor({
  nEstimators: 100,
  maxDepth: 10,
  randomState: 42,
});

rfRegressor.fit(X, y);
const predictions = rfRegressor.predict([[3.5, 4.5]]);
console.log(predictions);

Feature Importance Analysis

forest.fit(X, y);

if (forest.featureImportances_) {
  const importances = forest.featureImportances_;
  
  // Create feature ranking
  const ranking = importances
    .map((importance, idx) => ({ feature: idx, importance }))
    .sort((a, b) => b.importance - a.importance);
  
  console.log("Top 3 features:");
  ranking.slice(0, 3).forEach(({ feature, importance }) => {
    console.log(`  Feature ${feature}: ${importance.toFixed(4)}`);
  });
}

Gradient Boosting

Gradient boosting builds trees sequentially, with each tree correcting errors made by previous trees.

Gradient Boosting Classifier

import { GradientBoostingClassifier } from "bun-scikit";

const X = [
  [0, 0], [0, 1],
  [1, 0], [1, 1],
];
const y = [0, 1, 1, 0]; // XOR problem

const gbm = new GradientBoostingClassifier({
  nEstimators: 100,
  learningRate: 0.1,
  maxDepth: 3,
  minSamplesSplit: 2,
  minSamplesLeaf: 1,
  subsample: 1.0,
  randomState: 42,
});

gbm.fit(X, y);

const predictions = gbm.predict([[0, 0], [1, 1]]);
console.log(predictions);

const probabilities = gbm.predictProba([[0.5, 0.5]]);
console.log(probabilities);

Configuration

nEstimators
number
default:"100"
Number of boosting stages to perform
learningRate
number
default:"0.1"
Learning rate shrinks the contribution of each tree
maxDepth
number
default:"3"
Maximum depth of individual trees
subsample
number
default:"1.0"
Fraction of samples to use for fitting individual trees (< 1.0 enables stochastic gradient boosting)

Gradient Boosting Regressor

import { GradientBoostingRegressor } from "bun-scikit";

const X = [[i] for (let i = 0; i < 100; i++)];
const y = X.map(([x]) => Math.sin(x / 10) * 50);

const gbr = new GradientBoostingRegressor({
  nEstimators: 100,
  learningRate: 0.1,
  maxDepth: 4,
});

gbr.fit(X, y);
const predictions = gbr.predict([[50]]);

Histogram-Based Gradient Boosting

For large datasets, use histogram-based gradient boosting:
import { HistGradientBoostingClassifier } from "bun-scikit";

const histGbm = new HistGradientBoostingClassifier({
  maxIter: 100,
  maxDepth: 10,
  learningRate: 0.1,
});

histGbm.fit(X, y);
Histogram-based boosting is significantly faster on large datasets (> 10k samples) by binning continuous features.

Other Ensembles

Extra Trees

Extremely Randomized Trees use random thresholds instead of optimal thresholds:
import { ExtraTreesClassifier, ExtraTreesRegressor } from "bun-scikit";

const extraTrees = new ExtraTreesClassifier({
  nEstimators: 100,
  maxDepth: 10,
  randomState: 42,
});

extraTrees.fit(X, y);

AdaBoost

Adaptive Boosting weights samples based on classification difficulty:
import { AdaBoostClassifier } from "bun-scikit";

const adaboost = new AdaBoostClassifier({
  nEstimators: 50,
  learningRate: 1.0,
  randomState: 42,
});

adaboost.fit(X, y);

Bagging

Bootstrap Aggregating with any base estimator:
import { BaggingClassifier } from "bun-scikit";
import { DecisionTreeClassifier } from "bun-scikit";

const bagging = new BaggingClassifier({
  baseEstimator: new DecisionTreeClassifier({ maxDepth: 5 }),
  nEstimators: 10,
  maxSamples: 1.0,
  bootstrap: true,
  randomState: 42,
});

bagging.fit(X, y);

Voting Ensemble

Combine multiple different estimators:
import { VotingClassifier } from "bun-scikit";
import { LogisticRegression } from "bun-scikit";
import { RandomForestClassifier } from "bun-scikit";
import { GradientBoostingClassifier } from "bun-scikit";

const voting = new VotingClassifier({
  estimators: [
    ["lr", new LogisticRegression()],
    ["rf", new RandomForestClassifier({ nEstimators: 50 })],
    ["gb", new GradientBoostingClassifier({ nEstimators: 50 })],
  ],
  voting: "soft", // Use predict_proba
});

voting.fit(X, y);

Stacking

Stack predictions from multiple models:
import { StackingClassifier } from "bun-scikit";
import { LogisticRegression } from "bun-scikit";
import { RandomForestClassifier } from "bun-scikit";

const stacking = new StackingClassifier({
  estimators: [
    ["rf", new RandomForestClassifier()],
    ["gb", new GradientBoostingClassifier()],
  ],
  finalEstimator: new LogisticRegression(),
  cv: 5,
});

stacking.fit(X, y);

Performance Optimization

Enable Zig backend for significant speedup:
export BUN_SCIKIT_TREE_BACKEND=zig
bun run native:build
This provides 5-10x speedup for tree training.
Random forests train trees independently and can benefit from parallel execution. The native backend handles this automatically.
For large datasets:
  • Reduce maxDepth to limit tree size
  • Use maxFeatures to limit split candidates
  • Enable bootstrap=true for random forests
  • Use histogram-based boosting for 100k+ samples

Hyperparameter Tuning

Grid Search for Random Forest

import { GridSearchCV } from "bun-scikit";
import { RandomForestClassifier } from "bun-scikit";

const search = new GridSearchCV(
  (params) => new RandomForestClassifier({
    nEstimators: params.nEstimators as number,
    maxDepth: params.maxDepth as number,
    maxFeatures: params.maxFeatures as "sqrt" | "log2",
  }),
  {
    nEstimators: [50, 100, 200],
    maxDepth: [10, 20, 30],
    maxFeatures: ["sqrt", "log2"],
  },
  { cv: 5, scoring: "accuracy" }
);

search.fit(X, y);

console.log("Best parameters:", search.bestParams_);
console.log("Best score:", search.bestScore_);

Common Patterns

Early Stopping for Gradient Boosting

const X_train = X.slice(0, 80);
const y_train = y.slice(0, 80);
const X_val = X.slice(80);
const y_val = y.slice(80);

const gbm = new GradientBoostingClassifier({ nEstimators: 1000 });

for (let n = 50; n <= 1000; n += 50) {
  const partial = new GradientBoostingClassifier({ nEstimators: n });
  partial.fit(X_train, y_train);
  const score = partial.score(X_val, y_val);
  console.log(`n=${n}, validation accuracy=${score.toFixed(4)}`);
}

Out-of-Bag Evaluation

const rf = new RandomForestClassifier({
  nEstimators: 100,
  bootstrap: true,
  oobScore: true, // Enable OOB scoring if available
});

rf.fit(X, y);
// Check rf.oobScore_ for out-of-bag accuracy

When to Use Each Model

Decision Trees

  • Interpretable models
  • Small datasets
  • Quick prototyping

Random Forests

  • General-purpose classification/regression
  • Robust to overfitting
  • Feature importance needed

Gradient Boosting

  • Maximum accuracy
  • Structured/tabular data
  • Competitions

Extra Trees

  • Faster training than RF
  • More randomization
  • Less prone to overfitting

Next Steps

Model Selection

Cross-validation and tuning

Zig Acceleration

Enable native backend for 10x speedup

Build docs developers (and LLMs) love