String Module

The datatable.str module provides functions that operate on string (str32 / str64) columns inside datatable expressions. Functions accept FExpr arguments and return FExpr results, composing naturally with f selectors and other datatable operations.

import datatable as dt
from datatable import f

DT = dt.Frame(name=["Alice", "Bob", "Charlie"])

# Compute string lengths
DT[:, dt.str.len(f.name)]

Functions in dt.str operate lazily. They return an FExpr that is only evaluated when used inside a DT[rows, cols] expression.

Functions

`dt.str.len(column)`

Compute the length (number of characters) of each string value in column. Parameters

column

FExpr[str]

A string column expression.

Returns FExpr[int64] — an integer column containing the character count for each row. NA strings produce NA output.

`dt.str.slice(col, start, stop, step=1)`

Apply the slice [start:stop:step] to each string value in col. You can also write f.col[start:stop:step] as a shorthand. Parameters

col

FExpr[str]

The string column to slice.

start

int | None

Start index of the slice (inclusive). Negative indices count from the end. None means the beginning of the string.

stop

int | None

Stop index of the slice (exclusive). Negative indices count from the end. None means the end of the string.

step

int

default:"1"

Step size. Defaults to 1.

Returns FExpr[str] — a string column with the sliced values.

`dt.str.split_into_nhot(frame, sep=",", sort=False)`

Split and n-hot encode a single-column string frame. Each value is split on sep, whitespace is trimmed, and the resulting labels become boolean columns in the output frame. Parameters

frame

Frame

A single-column frame with str32 or str64 stype.

sep

str

default:"\",\""

Single-character delimiter to split on.

sort

bool

default:"false"

If True, output columns are sorted alphabetically by label. Due to parallelization, column order is otherwise not guaranteed.

Returns Frame — one boolean column per unique label, one row per input row.

split_into_nhot operates on a full Frame (not a lazy FExpr). Pass the single-column frame directly to this function, not inside a DT[rows, cols] expression.

Examples

Compute string lengths and filter

import datatable as dt
from datatable import f

DT = dt.Frame(word=["cat", "elephant", "ox", None, "rhinoceros"])

# Add a length column, then keep only words longer than 3 characters
result = DT[:, {"word": f.word, "length": dt.str.len(f.word)}]
result = result[f.length > 3, :]

Slice the first five characters of each string

import datatable as dt
from datatable import f

DT = dt.Frame(A=["apples", "bananas", "cherries", "dates"])

# Using dt.str.slice
DT[:, dt.str.slice(f.A, None, 5)]

# Equivalent shorthand via f-expression subscript
DT[:, f.A[:5]]

N-hot encode a multi-label column

import datatable as dt

DT = dt.Frame(tags=["python,data", "data,ml", "python,ml,deep-learning"])

# Produce one boolean column per unique tag
encoded = dt.str.split_into_nhot(DT[:, "tags"], sep=",", sort=True)

Core

Functions

Modules

Functions

`dt.str.len(column)`

`dt.str.slice(col, start, stop, step=1)`

`dt.str.split_into_nhot(frame, sep=",", sort=False)`

Examples

Compute string lengths and filter

Slice the first five characters of each string

N-hot encode a multi-label column

Build docs developers (and LLMs) love

Core

Functions

Modules

​Functions

​dt.str.len(column)

​dt.str.slice(col, start, stop, step=1)

​dt.str.split_into_nhot(frame, sep=",", sort=False)

​Examples

​Compute string lengths and filter

​Slice the first five characters of each string

​N-hot encode a multi-label column

Build docs developers (and LLMs) love

Functions

`dt.str.len(column)`

`dt.str.slice(col, start, stop, step=1)`

`dt.str.split_into_nhot(frame, sep=",", sort=False)`

Examples

Compute string lengths and filter

Slice the first five characters of each string

N-hot encode a multi-label column