Data Preprocessing & Scaling

Preprocessing is a critical step in the machine learning pipeline. bun-scikit provides a comprehensive set of transformers to clean, scale, and prepare your data for training.

Overview

All preprocessing transformers in bun-scikit follow a consistent API pattern:

fit(X) - Learn parameters from training data
transform(X) - Apply the transformation
fitTransform(X) - Fit and transform in one step
inverseTransform(X) - Reverse the transformation (when applicable)

Preprocessing transformers store learned parameters with a trailing underscore (e.g., mean_, scale_) to indicate they were computed during fitting.

Standard Scaling

Standardize features by removing the mean and scaling to unit variance. This is the most common preprocessing technique for many machine learning algorithms.

import { StandardScaler } from "bun-scikit";

const X = [
  [1, 2],
  [2, 4],
  [3, 6],
  [4, 8],
];

const scaler = new StandardScaler();
const XScaled = scaler.fitTransform(X);

console.log("Mean:", scaler.mean_);
console.log("Scale:", scaler.scale_);

How It Works

StandardScaler transforms each feature to have:

Mean = 0: Subtracts the mean from each value
Standard deviation = 1: Divides by the standard deviation

The transformation formula is:

X_scaled = (X - mean) / std

View Source Implementation

// Simplified from src/preprocessing/StandardScaler.ts
fit(X: Matrix): this {
  const nSamples = X.length;
  const nFeatures = X[0].length;
  const means = new Array(nFeatures).fill(0);
  const variances = new Array(nFeatures).fill(0);

  // Calculate means
  for (let i = 0; i < nSamples; i += 1) {
    for (let j = 0; j < nFeatures; j += 1) {
      means[j] += X[i][j];
    }
  }
  for (let j = 0; j < nFeatures; j += 1) {
    means[j] /= nSamples;
  }

  // Calculate standard deviations
  for (let i = 0; i < nSamples; i += 1) {
    for (let j = 0; j < nFeatures; j += 1) {
      const diff = X[i][j] - means[j];
      variances[j] += diff * diff;
    }
  }

  const scales = variances.map((v) => {
    const std = Math.sqrt(v / nSamples);
    return std === 0 ? 1 : std;  // Avoid division by zero
  });

  this.mean_ = means;
  this.scale_ = scales;
  return this;
}

Min-Max Scaling

Scale features to a specific range (default: [0, 1]). Useful when you need bounded values or when the algorithm is sensitive to feature magnitude.

import { MinMaxScaler } from "bun-scikit";

const scaler = new MinMaxScaler({ featureRange: [0, 1] });
const XScaled = scaler.fitTransform(X);

// Access learned parameters
console.log("Data min:", scaler.dataMin_);
console.log("Data max:", scaler.dataMax_);

Custom Range

// Scale to [-1, 1] range
const scaler = new MinMaxScaler({ featureRange: [-1, 1] });
const XScaled = scaler.fitTransform(X);

Other Scalers

RobustScaler
MaxAbsScaler
Normalizer

Uses median and IQR instead of mean and standard deviation, making it robust to outliers.

import { RobustScaler } from "bun-scikit";

const scaler = new RobustScaler();
const XScaled = scaler.fitTransform(X);

Scales by dividing each feature by its maximum absolute value. Preserves sparsity and sign.

import { MaxAbsScaler } from "bun-scikit";

const scaler = new MaxAbsScaler();
const XScaled = scaler.fitTransform(X);

Scales individual samples to have unit norm (L1, L2, or max norm).

import { Normalizer } from "bun-scikit";

const normalizer = new Normalizer({ norm: "l2" });
const XNormalized = normalizer.fitTransform(X);

Encoding Categorical Data

Label Encoding

Convert categorical labels to numeric values.

import { LabelEncoder } from "bun-scikit";

const encoder = new LabelEncoder();
const y = ["cat", "dog", "cat", "bird", "dog"];
const yEncoded = encoder.fitTransform(y);
// [0, 1, 0, 2, 1]

console.log("Classes:", encoder.classes_);
// ["cat", "dog", "bird"]

One-Hot Encoding

Convert categorical features into binary vectors.

import { OneHotEncoder } from "bun-scikit";

const encoder = new OneHotEncoder();
const X = [["red"], ["blue"], ["green"], ["red"]];
const XEncoded = encoder.fitTransform(X);
// [[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0]]

Feature Engineering

Polynomial Features

Generate polynomial and interaction features.

import { PolynomialFeatures } from "bun-scikit";

const X = [[1, 2], [3, 4]];

const poly = new PolynomialFeatures({ degree: 2 });
const XPoly = poly.fitTransform(X);
// [[1, 1, 2, 1, 2, 4],     // [1, x1, x2, x1^2, x1*x2, x2^2]
//  [1, 3, 4, 9, 12, 16]]

Binarizer

Threshold features to binary values.

import { Binarizer } from "bun-scikit";

const binarizer = new Binarizer({ threshold: 2.5 });
const X = [[1, 2], [3, 4]];
const XBinary = binarizer.transform(X);
// [[0, 0], [1, 1]]

Handling Missing Values

Replace missing values using various strategies.

import { SimpleImputer } from "bun-scikit";

const X = [
  [1, 2],
  [NaN, 3],
  [7, NaN],
  [4, 5],
];

const imputer = new SimpleImputer({ strategy: "mean" });
const XFilled = imputer.fitTransform(X);

Mean Strategy
Median Strategy
Most Frequent
Constant

const imputer = new SimpleImputer({ strategy: "mean" });

Replaces missing values with the mean of each column.

const imputer = new SimpleImputer({ strategy: "median" });

Replaces missing values with the median of each column (robust to outliers).

const imputer = new SimpleImputer({ strategy: "most_frequent" });

Replaces missing values with the most frequent value in each column.

const imputer = new SimpleImputer({ 
  strategy: "constant", 
  fillValue: 0 
});

Replaces missing values with a specified constant.

Best Practices

Fit on training data only

Always fit transformers on training data to prevent data leakage:

scaler.fit(XTrain);
const XTrainScaled = scaler.transform(XTrain);
const XTestScaled = scaler.transform(XTest);  // Use same parameters

Use pipelines

Combine preprocessing and models in pipelines to ensure consistency:

const pipeline = new Pipeline([
  ["imputer", new SimpleImputer()],
  ["scaler", new StandardScaler()],
  ["model", new LinearRegression()],
]);

Check for finite values

bun-scikit validates that inputs are finite (no NaN or Infinity) unless you’re using an imputer:

// This will throw an error if X contains NaN
scaler.fit(X);

// Use SimpleImputer first to handle missing values

Complete Example

Here’s a complete preprocessing workflow:

import {
  SimpleImputer,
  StandardScaler,
  PolynomialFeatures,
  Pipeline,
  LinearRegression,
  trainTestSplit,
} from "bun-scikit";

// Sample data with missing values
const X = [
  [1, 2],
  [NaN, 3],
  [3, NaN],
  [4, 5],
  [5, 6],
  [6, 7],
];
const y = [3, 5, 7, 9, 11, 13];

// Split data
const { XTrain, XTest, yTrain, yTest } = trainTestSplit(X, y, {
  testSize: 0.33,
  randomState: 42,
});

// Create preprocessing pipeline
const pipeline = new Pipeline([
  ["imputer", new SimpleImputer({ strategy: "mean" })],
  ["poly", new PolynomialFeatures({ degree: 2 })],
  ["scaler", new StandardScaler()],
  ["regressor", new LinearRegression()],
]);

// Fit and predict
pipeline.fit(XTrain, yTrain);
const predictions = pipeline.predict(XTest);
const score = pipeline.score(XTest, yTest);

console.log("R² Score:", score);

Model Training - Learn about the fit/predict pattern
Pipelines - Combine preprocessing and models
Model Evaluation - Evaluate your models

Get Started

Core Concepts

Guides

Performance

Data Preprocessing & Scaling

Overview

Standard Scaling

How It Works

Min-Max Scaling

Custom Range

Other Scalers

Encoding Categorical Data

Label Encoding

One-Hot Encoding

Feature Engineering

Polynomial Features

Binarizer

Handling Missing Values

Best Practices

Complete Example

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Performance

Documentation Index

​Overview

​Standard Scaling

​How It Works

​Min-Max Scaling

​Custom Range

​Other Scalers

​Encoding Categorical Data

​Label Encoding

​One-Hot Encoding

​Feature Engineering

​Polynomial Features

​Binarizer

​Handling Missing Values

​Best Practices

​Complete Example

​Related Topics

Build docs developers (and LLMs) love

Overview

Standard Scaling

How It Works

Min-Max Scaling

Custom Range

Other Scalers

Encoding Categorical Data

Label Encoding

One-Hot Encoding

Feature Engineering

Polynomial Features

Binarizer

Handling Missing Values

Best Practices

Complete Example

Related Topics