Skip to main content
Pipelines are a powerful way to chain together preprocessing steps and models into a single, cohesive workflow. They help prevent data leakage, ensure consistency, and make your code cleaner.

Why Use Pipelines?

1

Prevent data leakage

Ensures that preprocessing is fit only on training data, then applied to test data:
// Without pipeline - WRONG!
const scaler = new StandardScaler();
const XScaled = scaler.fitTransform(X);  // Leaked test data!
const { XTrain, XTest } = trainTestSplit(XScaled, y);

// With pipeline - CORRECT!
const { XTrain, XTest, yTrain, yTest } = trainTestSplit(X, y);
pipeline.fit(XTrain, yTrain);  // Only fits on training data
2

Simplify your workflow

One object handles everything from raw data to predictions:
// Fit once, predict many times
pipeline.fit(XTrain, yTrain);
const predictions = pipeline.predict(XTest);
3

Enable hyperparameter tuning

Use pipelines with GridSearchCV to tune preprocessing and model parameters together:
const grid = new GridSearchCV(pipeline, params);
grid.fit(X, y);

Creating a Pipeline

A pipeline is a sequence of named steps, where each step (except the last) must be a transformer, and the last step can be a transformer or a predictor.

Basic Example

import { Pipeline, StandardScaler, LinearRegression } from "bun-scikit";

const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["regressor", new LinearRegression()],
]);

// Use it like any other model
pipeline.fit(XTrain, yTrain);
const predictions = pipeline.predict(XTest);
const score = pipeline.score(XTest, yTest);

Step Naming Convention

Each step is a tuple of [name, transformer]:
const pipeline = new Pipeline([
  ["step1", transformer1],  // String name, transformer object
  ["step2", transformer2],
  ["final", model],
]);
Step names must be unique and non-empty strings. They’re used to access steps and set parameters.

Pipeline Workflow

During Fitting

The pipeline applies fitTransform() to each intermediate step, passing the transformed data to the next step:
// What happens during pipeline.fit(X, y):

// Step 1: Fit and transform
step1.fit(X, y);
X1 = step1.transform(X);

// Step 2: Fit and transform
step2.fit(X1, y);
X2 = step2.transform(X1);

// Final step: Just fit (if it's a predictor)
finalStep.fit(X2, y);

During Prediction

The pipeline applies transform() to each intermediate step:
// What happens during pipeline.predict(X):

// Step 1: Transform only
X1 = step1.transform(X);

// Step 2: Transform only
X2 = step2.transform(X1);

// Final step: Predict
predictions = finalStep.predict(X2);

Accessing Pipeline Steps

You can access individual steps in several ways:
const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["model", new LinearRegression()],
]);

pipeline.fit(X, y);

// Access via namedSteps_
const scaler = pipeline.namedSteps_["scaler"];
console.log("Mean:", scaler.mean_);

// Access all steps
console.log("All steps:", pipeline.steps_);
// [["scaler", StandardScaler], ["model", LinearRegression]]

Classification Pipelines

Pipelines work seamlessly with classification models:
import {
  Pipeline,
  StandardScaler,
  LogisticRegression,
  accuracyScore,
} from "bun-scikit";

const X = [
  [0, 0], [0, 1], [1, 0], [1, 1],
  [2, 2], [2, 3], [3, 2], [3, 3],
];
const y = [0, 0, 0, 0, 1, 1, 1, 1];

const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["classifier", new LogisticRegression({ maxIter: 1000 })],
]);

pipeline.fit(X, y);

// Classification methods work through the pipeline
const predictions = pipeline.predict(X);
const probabilities = pipeline.predictProba(X);
const accuracy = pipeline.score(X, y);

console.log("Accuracy:", accuracy);

Complex Pipelines

Multiple Preprocessing Steps

import {
  Pipeline,
  SimpleImputer,
  StandardScaler,
  PolynomialFeatures,
  RandomForestRegressor,
} from "bun-scikit";

const pipeline = new Pipeline([
  ["imputer", new SimpleImputer({ strategy: "mean" })],
  ["scaler", new StandardScaler()],
  ["poly", new PolynomialFeatures({ degree: 2 })],
  ["forest", new RandomForestRegressor({ nEstimators: 100 })],
]);

pipeline.fit(XTrain, yTrain);
const predictions = pipeline.predict(XTest);

Pipeline with Feature Selection

import {
  Pipeline,
  StandardScaler,
  SelectKBest,
  f_regression,
  LinearRegression,
} from "bun-scikit";

const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["selector", new SelectKBest({ scoreFunc: f_regression, k: 5 })],
  ["regressor", new LinearRegression()],
]);

pipeline.fit(X, y);

Setting Pipeline Parameters

During Initialization

const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["model", new LogisticRegression({ 
    learningRate: 0.1, 
    maxIter: 1000 
  })],
]);

After Creation with setParams

Use double-underscore notation to set nested parameters:
const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["model", new LogisticRegression()],
]);

// Set parameters on specific steps
pipeline.setParams({
  "model__learningRate": 0.01,
  "model__maxIter": 5000,
});

pipeline.fit(X, y);

Getting Pipeline Parameters

const params = pipeline.getParams(true);  // true for nested params

console.log(params);
// {
//   "scaler": StandardScaler { ... },
//   "model": LogisticRegression { ... },
//   "model__learningRate": 0.01,
//   "model__maxIter": 5000,
//   ...
// }

Pipeline with Sample Weights

Pass sample weights through the entire pipeline:
import { Pipeline, StandardScaler, LinearRegression } from "bun-scikit";

const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["regressor", new LinearRegression()],
]);

const sampleWeight = [1, 2, 1, 3, 2];  // Give different weights to samples

pipeline.fit(X, y, sampleWeight);
Sample weights are automatically routed to all steps that support them.

Pipeline Methods

Pipelines support all standard model methods:
Train the entire pipeline.
pipeline.fit(X, y, sampleWeight?);
Fits each step sequentially, transforming data between steps.

Hyperparameter Tuning with Pipelines

Pipelines integrate seamlessly with GridSearchCV:
import {
  Pipeline,
  StandardScaler,
  LogisticRegression,
  GridSearchCV,
} from "bun-scikit";

const pipeline = new Pipeline([
  ["scaler", new StandardScaler()],
  ["classifier", new LogisticRegression()],
]);

const paramGrid = {
  "classifier__learningRate": [0.01, 0.1, 1.0],
  "classifier__maxIter": [100, 1000, 5000],
  "classifier__l2": [0, 0.1, 1.0],
};

const grid = new GridSearchCV(pipeline, paramGrid, { cv: 5 });
grid.fit(X, y);

console.log("Best parameters:", grid.bestParams_);
console.log("Best score:", grid.bestScore_);

// Use the best pipeline
const bestPipeline = grid.bestEstimator_;
const predictions = bestPipeline.predict(XTest);

Complete Example

Here’s a real-world pipeline for a classification task:
import {
  Pipeline,
  SimpleImputer,
  StandardScaler,
  PolynomialFeatures,
  LogisticRegression,
  trainTestSplit,
  accuracyScore,
  classificationReport,
} from "bun-scikit";

// Prepare data (with missing values)
const X = [
  [1, 2], [NaN, 3], [3, NaN], [4, 5],
  [5, 6], [6, 7], [7, 8], [8, 9],
];
const y = [0, 0, 0, 0, 1, 1, 1, 1];

// Split data
const { XTrain, XTest, yTrain, yTest } = trainTestSplit(X, y, {
  testSize: 0.25,
  randomState: 42,
});

// Create comprehensive pipeline
const pipeline = new Pipeline([
  ["imputer", new SimpleImputer({ strategy: "mean" })],
  ["poly", new PolynomialFeatures({ degree: 2 })],
  ["scaler", new StandardScaler()],
  ["classifier", new LogisticRegression({
    solver: "gd",
    learningRate: 0.1,
    maxIter: 1000,
    tolerance: 1e-5,
  })],
]);

// Train
pipeline.fit(XTrain, yTrain);

// Evaluate
const yPred = pipeline.predict(XTest);
const accuracy = accuracyScore(yTest, yPred);
const report = classificationReport(yTest, yPred);

console.log("Accuracy:", accuracy);
console.log("\nClassification Report:");
for (const [label, metrics] of Object.entries(report.perLabel)) {
  console.log(`Class ${label}:`);
  console.log(`  Precision: ${metrics.precision.toFixed(3)}`);
  console.log(`  Recall: ${metrics.recall.toFixed(3)}`);
  console.log(`  F1-Score: ${metrics.f1Score.toFixed(3)}`);
}

// Inspect pipeline steps
const imputer = pipeline.namedSteps_["imputer"];
console.log("\nImputer learned means:", imputer.statistics_);

const scaler = pipeline.namedSteps_["scaler"];
console.log("Scaler learned means:", scaler.mean_);

Best Practices

1

Always use pipelines with preprocessing

This prevents accidentally fitting on test data:
// WRONG: Fit scaler on all data
const XScaled = scaler.fitTransform(X);
trainTestSplit(XScaled, y);

// CORRECT: Pipeline fits only on training data
const split = trainTestSplit(X, y);
pipeline.fit(split.XTrain, split.yTrain);
2

Use meaningful step names

Names help with debugging and parameter tuning:
const pipeline = new Pipeline([
  ["imputation", new SimpleImputer()],
  ["scaling", new StandardScaler()],
  ["classification", new LogisticRegression()],
]);
3

Order matters

Put steps in the right order:
  1. Handle missing values (SimpleImputer)
  2. Generate features (PolynomialFeatures)
  3. Scale/normalize (StandardScaler)
  4. Select features (SelectKBest)
  5. Train model
4

Save pipelines for production

A fitted pipeline contains everything needed for inference:
pipeline.fit(XTrain, yTrain);

// Later, in production:
const predictions = pipeline.predict(newData);
// All preprocessing is applied automatically!

Advanced Topics

ColumnTransformer

Apply different transformations to different columns:
import { ColumnTransformer, StandardScaler, OneHotEncoder } from "bun-scikit";

// Apply different preprocessing to different columns
const preprocessor = new ColumnTransformer([
  ["num", new StandardScaler(), [0, 1, 2]],      // Numeric columns
  ["cat", new OneHotEncoder(), [3, 4]],          // Categorical columns
]);

const pipeline = new Pipeline([
  ["preprocessor", preprocessor],
  ["classifier", new LogisticRegression()],
]);

FeatureUnion

Combine multiple transformers in parallel:
import { FeatureUnion, PCA, PolynomialFeatures } from "bun-scikit";

// Combine features from multiple sources
const featureUnion = new FeatureUnion([
  ["pca", new PCA({ nComponents: 2 })],
  ["poly", new PolynomialFeatures({ degree: 2 })],
]);

Build docs developers (and LLMs) love