Models Module

The datatable.models module contains built-in machine learning models and dataset preparation utilities. All models operate directly on datatable Frame objects, with no conversion to pandas or numpy required.

from datatable.models import Ftrl, LinearModel, aggregate, kfold, kfold_random

`Ftrl`

Ftrl implements the Follow the Regularized Leader (FTRL-Proximal) online learning algorithm. It supports binomial logistic regression, multinomial classification, and regression for continuous targets. Training is fully parallel using the Hogwild approach. Features are hashed with a 64-bit function: integers and booleans via identity, floats via mantissa trimming, strings via Murmur2, and date/time types via their internal integer representation.

Constructor

from datatable.models import Ftrl

model = Ftrl(
    alpha=0.005,
    beta=1.0,
    lambda1=0.0,
    lambda2=0.0,
    nbins=1_000_000,
    nepochs=1,
    interactions=None,
    model_type="auto",
    double_precision=False,
)

alpha

float

default:"0.005"

Learning rate α in the per-coordinate FTRL-Proximal algorithm. Controls the step size.

beta

float

default:"1.0"

Smoothing parameter β in the per-coordinate FTRL-Proximal algorithm.

lambda1

float

default:"0.0"

L1 regularization parameter λ₁. Encourages sparsity in the model weights.

lambda2

float

default:"0.0"

L2 regularization parameter λ₂. Penalizes large weights.

nbins

int

default:"1000000"

Number of bins used by the hashing trick. Larger values reduce hash collisions.

nepochs

float

default:"1"

Number of training epochs. Fractional values are supported.

interactions

List[List[str]] | None

default:"None"

Explicit feature interaction pairs. Each inner list specifies column names whose hash values are combined.

model_type

"auto" | "regression" | "binomial" | "multinomial"

default:"\"auto\""

Determines the type of model to build. "auto" infers the type from the target column.

double_precision

bool

default:"false"

Use float64 arithmetic internally instead of float32.

Methods

`fit(X, y)`

Train the model on feature frame X and target frame y. Can be called multiple times to continue training.

model.fit(X_train, y_train)

`predict(X)`

Return predictions for frame X as a new Frame. Output type depends on model_type.

predictions = model.predict(X_test)

`reset()`

Reset model weights to their initial state without changing hyperparameters.

model.reset()

Key properties

Property	Type	Description
`model_type`	`str`	The model type to build (`"auto"`, `"binomial"`, etc.).
`model_type_trained`	`str`	The model type that was actually trained.
`feature_importances`	`Frame`	Feature importances computed during training.
`labels`	`Frame`	Classification labels (multinomial/binomial).
`params`	`namedtuple`	All hyperparameters as a named tuple.
`model`	`Frame`	The model’s `z` and `n` coefficient columns.
`colnames`	`List[str]`	Column names of the training frame.
`mantissa_nbits`	`int`	Mantissa bits used when hashing float features.

Example

import datatable as dt
from datatable.models import Ftrl

train = dt.fread("train.csv")
test  = dt.fread("test.csv")

X_train = train[:, :-1]
y_train = train[:, -1]

model = Ftrl(alpha=0.01, nbins=500_000, nepochs=3)
model.fit(X_train, y_train)

preds = model.predict(test)

`LinearModel`

LinearModel implements a linear model with stochastic gradient descent (SGD) learning. It supports linear regression, binomial classification, and multinomial classification. Both fit and predict are fully parallel.

Constructor

from datatable.models import LinearModel

model = LinearModel(
    eta0=0.005,
    eta_decay=0.0,
    eta_drop_rate=100.0,
    eta_schedule="constant",
    lambda1=0.0,
    lambda2=0.0,
    nepochs=1,
    model_type="auto",
    double_precision=False,
    negative_class=False,
    seed=0,
)

eta0

float

default:"0.005"

Initial learning rate.

eta_decay

float

default:"0.0"

Decay coefficient for "time-based" and "step-based" learning rate schedules.

eta_drop_rate

float

default:"100.0"

Drop rate for the "step-based" learning rate schedule.

eta_schedule

"constant" | "time-based" | "step-based" | "exponential"

default:"\"constant\""

Learning rate schedule. Controls how eta0 changes across epochs.

lambda1

float

default:"0.0"

L1 regularization parameter.

lambda2

float

default:"0.0"

L2 regularization parameter.

nepochs

float

default:"1"

Number of training epochs.

model_type

"auto" | "regression" | "binomial" | "multinomial"

default:"\"auto\""

Type of model to build.

double_precision

bool

default:"false"

Use float64 arithmetic instead of float32.

negative_class

bool

default:"false"

If True, an explicit “negative” class is added for multinomial classification.

seed

int

default:"0"

Seed for quasi-random row shuffling during SGD.

Methods

`fit(X, y)`

Train the model on feature frame X and target frame y.

model.fit(X_train, y_train)

`predict(X)`

Return predictions for frame X.

preds = model.predict(X_test)

`is_fitted()`

Return True if the model has been trained, False otherwise.

if model.is_fitted():
    preds = model.predict(X)

`reset()`

Clear trained weights and return the model to its initial untrained state.

model.reset()

Example

import datatable as dt
from datatable.models import LinearModel

DT = dt.fread("dataset.csv")
X = DT[:, :-1]
y = DT[:, -1]

model = LinearModel(eta0=0.01, nepochs=5, eta_schedule="time-based")
model.fit(X, y)

preds = model.predict(X)

`aggregate(frame, ...)`

Aggregate a Frame into clusters. Each cluster consists of a set of member rows and is represented by one exemplar row. Useful for summarizing large datasets before visualization or modeling.

from datatable.models import aggregate

exemplars, members = aggregate(
    frame,
    min_rows=500,
    n_bins=500,
    nx_bins=50,
    ny_bins=50,
    nd_max_bins=500,
    max_dimensions=50,
    seed=0,
    double_precision=False,
    fixed_radius=None,
)

Parameters

frame

Frame

Input frame with numeric or string columns. Non-numeric columns are ignored in the ND aggregation algorithm.

min_rows

int

default:"500"

Minimum number of rows required for aggregation to run. Frames smaller than this threshold have all rows treated as exemplars.

n_bins

int

default:"500"

Number of bins for 1D aggregation.

nd_max_bins

int

default:"500"

Maximum number of exemplars produced by the ND algorithm. The exact count may vary across runs due to parallelization.

max_dimensions

int

default:"50"

Column count at which the projection method is used for ND aggregation.

seed

int

default:"0"

Seed for the projection method’s random number generator.

fixed_radius

float | None

default:"None"

Fixed bubble radius for the ND algorithm. When set, nd_max_bins has no effect. Use with caution on large data — the number of exemplars can equal the number of rows.

Returns a tuple of two frames:

Exemplars frame — shape (nexemplars, ncols + 1). Contains the original columns plus a members_count column (int32) indicating how many rows each exemplar represents.
Members frame — shape (nrows, 1). The exemplar_id column (int32) maps each input row to its exemplar’s row index.

Example

import datatable as dt
from datatable.models import aggregate

DT = dt.fread("large_dataset.csv")

exemplars, members = aggregate(DT, nd_max_bins=200, seed=42)
print(f"Reduced {DT.nrows} rows to {exemplars.nrows} exemplars")

`kfold(nrows, nsplits)`

Split nrows rows into nsplits sequential train/test folds. The i-th fold uses rows [i·nrows/nsplits, (i+1)·nrows/nsplits) as the test set and all remaining rows as training data.

from datatable.models import kfold

splits = kfold(nrows=1000, nsplits=5)

for train_rows, test_rows in splits:
    X_train = DT[train_rows, features]
    X_test  = DT[test_rows,  features]
    y_train = DT[train_rows, target]
    y_test  = DT[test_rows,  target]

Parameters

nrows

int

Total number of rows to split. Must match the row count of the frame you apply the selectors to.

nsplits

int

Number of folds. Must be at least 2 and no larger than nrows.

Returns List[Tuple] — a list of nsplits tuples (train_rows, test_rows), where each component is a row selector (a Python range or a single-column Frame).

`kfold_random(nrows, nsplits, seed=None)`

Like kfold, but assigns rows to folds randomly so each row has an equal probability of ending up in any fold. Row indices within each fold are sorted.

from datatable.models import kfold_random

splits = kfold_random(nrows=1000, nsplits=5, seed=42)

for train_rows, test_rows in splits:
    X_train = DT[train_rows, :]
    X_test  = DT[test_rows,  :]

Parameters

nrows

int

Total number of rows to split.

nsplits

int

Number of folds. Must be at least 2 and no larger than nrows.

seed

int | None

default:"None"

Random seed. Providing the same seed guarantees reproducible splits across runs.

Returns List[Tuple] — a list of nsplits tuples (train_rows, test_rows).

Use kfold_random when your data has an ordering that could bias sequential splits (e.g., time-sorted data where you only want temporal cross-validation for kfold).

Core

Functions

Modules

`Ftrl`

Constructor

Methods

`fit(X, y)`

`predict(X)`

`reset()`

Key properties

Example

`LinearModel`

Constructor

Methods

`fit(X, y)`

`predict(X)`

`is_fitted()`

`reset()`

Example

`aggregate(frame, ...)`

Example

`kfold(nrows, nsplits)`

`kfold_random(nrows, nsplits, seed=None)`

Build docs developers (and LLMs) love

Core

Functions

Modules

​Ftrl

​Constructor

​Methods

​fit(X, y)

​predict(X)

​reset()

​Key properties

​Example

​LinearModel

​Constructor

​Methods

​fit(X, y)

​predict(X)

​is_fitted()

​reset()

​Example

​aggregate(frame, ...)

​Example

​kfold(nrows, nsplits)

​kfold_random(nrows, nsplits, seed=None)

Build docs developers (and LLMs) love

`Ftrl`

Constructor

Methods

`fit(X, y)`

`predict(X)`

`reset()`

Key properties

Example

`LinearModel`

Constructor

Methods

`fit(X, y)`

`predict(X)`

`is_fitted()`

`reset()`

Example

`aggregate(frame, ...)`

Example

`kfold(nrows, nsplits)`

`kfold_random(nrows, nsplits, seed=None)`