Skip to main content
The datatable.models module contains built-in machine learning models and dataset preparation utilities. All models operate directly on datatable Frame objects, with no conversion to pandas or numpy required.
from datatable.models import Ftrl, LinearModel, aggregate, kfold, kfold_random

Ftrl

Ftrl implements the Follow the Regularized Leader (FTRL-Proximal) online learning algorithm. It supports binomial logistic regression, multinomial classification, and regression for continuous targets. Training is fully parallel using the Hogwild approach. Features are hashed with a 64-bit function: integers and booleans via identity, floats via mantissa trimming, strings via Murmur2, and date/time types via their internal integer representation.

Constructor

from datatable.models import Ftrl

model = Ftrl(
    alpha=0.005,
    beta=1.0,
    lambda1=0.0,
    lambda2=0.0,
    nbins=1_000_000,
    nepochs=1,
    interactions=None,
    model_type="auto",
    double_precision=False,
)
alpha
float
default:"0.005"
Learning rate α in the per-coordinate FTRL-Proximal algorithm. Controls the step size.
beta
float
default:"1.0"
Smoothing parameter β in the per-coordinate FTRL-Proximal algorithm.
lambda1
float
default:"0.0"
L1 regularization parameter λ₁. Encourages sparsity in the model weights.
lambda2
float
default:"0.0"
L2 regularization parameter λ₂. Penalizes large weights.
nbins
int
default:"1000000"
Number of bins used by the hashing trick. Larger values reduce hash collisions.
nepochs
float
default:"1"
Number of training epochs. Fractional values are supported.
interactions
List[List[str]] | None
default:"None"
Explicit feature interaction pairs. Each inner list specifies column names whose hash values are combined.
model_type
"auto" | "regression" | "binomial" | "multinomial"
default:"\"auto\""
Determines the type of model to build. "auto" infers the type from the target column.
double_precision
bool
default:"false"
Use float64 arithmetic internally instead of float32.

Methods

fit(X, y)

Train the model on feature frame X and target frame y. Can be called multiple times to continue training.
model.fit(X_train, y_train)

predict(X)

Return predictions for frame X as a new Frame. Output type depends on model_type.
predictions = model.predict(X_test)

reset()

Reset model weights to their initial state without changing hyperparameters.
model.reset()

Key properties

PropertyTypeDescription
model_typestrThe model type to build ("auto", "binomial", etc.).
model_type_trainedstrThe model type that was actually trained.
feature_importancesFrameFeature importances computed during training.
labelsFrameClassification labels (multinomial/binomial).
paramsnamedtupleAll hyperparameters as a named tuple.
modelFrameThe model’s z and n coefficient columns.
colnamesList[str]Column names of the training frame.
mantissa_nbitsintMantissa bits used when hashing float features.

Example

import datatable as dt
from datatable.models import Ftrl

train = dt.fread("train.csv")
test  = dt.fread("test.csv")

X_train = train[:, :-1]
y_train = train[:, -1]

model = Ftrl(alpha=0.01, nbins=500_000, nepochs=3)
model.fit(X_train, y_train)

preds = model.predict(test)

LinearModel

LinearModel implements a linear model with stochastic gradient descent (SGD) learning. It supports linear regression, binomial classification, and multinomial classification. Both fit and predict are fully parallel.

Constructor

from datatable.models import LinearModel

model = LinearModel(
    eta0=0.005,
    eta_decay=0.0,
    eta_drop_rate=100.0,
    eta_schedule="constant",
    lambda1=0.0,
    lambda2=0.0,
    nepochs=1,
    model_type="auto",
    double_precision=False,
    negative_class=False,
    seed=0,
)
eta0
float
default:"0.005"
Initial learning rate.
eta_decay
float
default:"0.0"
Decay coefficient for "time-based" and "step-based" learning rate schedules.
eta_drop_rate
float
default:"100.0"
Drop rate for the "step-based" learning rate schedule.
eta_schedule
"constant" | "time-based" | "step-based" | "exponential"
default:"\"constant\""
Learning rate schedule. Controls how eta0 changes across epochs.
lambda1
float
default:"0.0"
L1 regularization parameter.
lambda2
float
default:"0.0"
L2 regularization parameter.
nepochs
float
default:"1"
Number of training epochs.
model_type
"auto" | "regression" | "binomial" | "multinomial"
default:"\"auto\""
Type of model to build.
double_precision
bool
default:"false"
Use float64 arithmetic instead of float32.
negative_class
bool
default:"false"
If True, an explicit “negative” class is added for multinomial classification.
seed
int
default:"0"
Seed for quasi-random row shuffling during SGD.

Methods

fit(X, y)

Train the model on feature frame X and target frame y.
model.fit(X_train, y_train)

predict(X)

Return predictions for frame X.
preds = model.predict(X_test)

is_fitted()

Return True if the model has been trained, False otherwise.
if model.is_fitted():
    preds = model.predict(X)

reset()

Clear trained weights and return the model to its initial untrained state.
model.reset()

Example

import datatable as dt
from datatable.models import LinearModel

DT = dt.fread("dataset.csv")
X = DT[:, :-1]
y = DT[:, -1]

model = LinearModel(eta0=0.01, nepochs=5, eta_schedule="time-based")
model.fit(X, y)

preds = model.predict(X)

aggregate(frame, ...)

Aggregate a Frame into clusters. Each cluster consists of a set of member rows and is represented by one exemplar row. Useful for summarizing large datasets before visualization or modeling.
from datatable.models import aggregate

exemplars, members = aggregate(
    frame,
    min_rows=500,
    n_bins=500,
    nx_bins=50,
    ny_bins=50,
    nd_max_bins=500,
    max_dimensions=50,
    seed=0,
    double_precision=False,
    fixed_radius=None,
)
Parameters
frame
Frame
Input frame with numeric or string columns. Non-numeric columns are ignored in the ND aggregation algorithm.
min_rows
int
default:"500"
Minimum number of rows required for aggregation to run. Frames smaller than this threshold have all rows treated as exemplars.
n_bins
int
default:"500"
Number of bins for 1D aggregation.
nd_max_bins
int
default:"500"
Maximum number of exemplars produced by the ND algorithm. The exact count may vary across runs due to parallelization.
max_dimensions
int
default:"50"
Column count at which the projection method is used for ND aggregation.
seed
int
default:"0"
Seed for the projection method’s random number generator.
fixed_radius
float | None
default:"None"
Fixed bubble radius for the ND algorithm. When set, nd_max_bins has no effect. Use with caution on large data — the number of exemplars can equal the number of rows.
Returns a tuple of two frames:
  • Exemplars frame — shape (nexemplars, ncols + 1). Contains the original columns plus a members_count column (int32) indicating how many rows each exemplar represents.
  • Members frame — shape (nrows, 1). The exemplar_id column (int32) maps each input row to its exemplar’s row index.

Example

import datatable as dt
from datatable.models import aggregate

DT = dt.fread("large_dataset.csv")

exemplars, members = aggregate(DT, nd_max_bins=200, seed=42)
print(f"Reduced {DT.nrows} rows to {exemplars.nrows} exemplars")

kfold(nrows, nsplits)

Split nrows rows into nsplits sequential train/test folds. The i-th fold uses rows [i·nrows/nsplits, (i+1)·nrows/nsplits) as the test set and all remaining rows as training data.
from datatable.models import kfold

splits = kfold(nrows=1000, nsplits=5)

for train_rows, test_rows in splits:
    X_train = DT[train_rows, features]
    X_test  = DT[test_rows,  features]
    y_train = DT[train_rows, target]
    y_test  = DT[test_rows,  target]
Parameters
nrows
int
Total number of rows to split. Must match the row count of the frame you apply the selectors to.
nsplits
int
Number of folds. Must be at least 2 and no larger than nrows.
Returns List[Tuple] — a list of nsplits tuples (train_rows, test_rows), where each component is a row selector (a Python range or a single-column Frame).

kfold_random(nrows, nsplits, seed=None)

Like kfold, but assigns rows to folds randomly so each row has an equal probability of ending up in any fold. Row indices within each fold are sorted.
from datatable.models import kfold_random

splits = kfold_random(nrows=1000, nsplits=5, seed=42)

for train_rows, test_rows in splits:
    X_train = DT[train_rows, :]
    X_test  = DT[test_rows,  :]
Parameters
nrows
int
Total number of rows to split.
nsplits
int
Number of folds. Must be at least 2 and no larger than nrows.
seed
int | None
default:"None"
Random seed. Providing the same seed guarantees reproducible splits across runs.
Returns List[Tuple] — a list of nsplits tuples (train_rows, test_rows).
Use kfold_random when your data has an ordering that could bias sequential splits (e.g., time-sorted data where you only want temporal cross-validation for kfold).

Build docs developers (and LLMs) love