Skip to main content
If you know SQL, you already understand the core of datatable’s query model. The DT[i, j, by] syntax maps cleanly onto SQL clauses:
SQLdatatable
SELECTj (second argument)
WHEREi (first argument)
GROUP BYby() (third argument)
ORDER BYsort()
LEFT JOINjoin()
UNION ALLrbind()
Most examples below use the iris dataset.
from datatable import dt, f, g, by, join, sort, update, fread

iris = fread("https://raw.githubusercontent.com/h2oai/datatable/main/docs/_static/iris.csv")

SELECT

# Select specific columns (first 5 rows)
iris[:5, ["sepal_length", "sepal_width", "petal_length"]]

# Select all columns
iris[:5, :]

# Select a single column (no i needed)
iris["sepal_length"].head(5)

# Add a computed column
iris[:5, f[:].extend({"sepal_length_doubled": f.sepal_length * 2})]

# Add column in-place with update()
iris[:, update(sepal_length_doubled=f.sepal_length * 2)]

WHERE

# Filter rows
iris[f.species == "virginica", :].head(5)

# Multiple conditions (AND)
iris[(f.species == "setosa") & (f.sepal_length == 5), :]

# Multiple conditions (OR)
iris[(f.species == "setosa") | (f.sepal_length > 7), :]

# Filter out NULL / NA values
null_data[f.b != None, :]
# or using isna:
null_data[~dt.math.isna(f.b), :]

# Keep only NULL rows
null_data[dt.isna(f.b), :]
SQL uses = for equality; Python uses ==. SQL uses AND/OR; datatable uses &/| with each condition wrapped in parentheses.

ORDER BY

# Sort ascending (first 5)
iris[:5, :, sort("sepal_length")]

# Sort descending — two equivalent forms
iris[:5, :, sort("sepal_length", reverse=True)]
iris[:5, :, sort(-f.sepal_length)]

# Sort by multiple columns
iris[:, :, sort("species", -f.sepal_length)]

GROUP BY

Single aggregation per group

iris[:, dt.count(), by("species")]

Multiple aggregations per group

iris[:,
     {"mean_sepal_length": dt.mean(f.sepal_length), "N": dt.count()},
     by("species")]

Grouping on multiple columns

fruits_data[:,
            {"sum_num": dt.sum(f.Number)},
            by("Fruit", "Name")]

WHERE with GROUP BY

# Chain: filter first, then group
iris[f.sepal_width >= 3, :][:,
     {"avg_sepal_length": dt.mean(f.sepal_length)},
     by("species")]

HAVING with GROUP BY

# Compute aggregation, then filter with i (chaining)
fruits_data[:,
            {"sum_num": dt.sum(f.Number)},
            by("Fruit", "Name")][f.sum_num > 50, :]

Grouping on a condition

iris[:,
     {"avg_sepal_length": dt.mean(f.sepal_length)},
     by(f.sepal_width >= 3)]

LEFT OUTER JOIN

DT = dt.Frame(x=["b"] * 3 + ["a"] * 3 + ["c"] * 3,
              y=[1, 3, 6] * 3,
              v=range(1, 10))
X = dt.Frame({"x": ("c", "b"), "v": (8, 7), "foo": (4, 2)})

# The joined frame must be keyed; key column must be unique
X.key = "x"

DT[:, [f.x, f.y, f.v, g.foo], join(X)]
datatable supports left outer joins only at this time. The frame being joined must be keyed, the key column must contain unique values, and the join column must share the same name in both frames.

UNION ALL

# UNION ALL — stack all rows including duplicates
dt.rbind([DT[:, ("x", "v")], X[:, ("x", "v")]])
SQL’s UNION (without ALL) removes duplicate rows. datatable does not have a built-in deduplication function equivalent yet.

Window Functions

TOP n rows per group (ROW_NUMBER)

# Top 3 rows per species by sepal_length descending
iris[:3, :, by("species"), sort(-f.sepal_length)]

Filter above per-group average

# Add per-group flag, then filter
iris[:, update(temp=f.sepal_length > dt.mean(f.sepal_length)),
     by("species")]
iris[f.temp == 1, f[:-1]].head(5)

LEAD and LAG

# shift() with negative n = LEAD; positive n = LAG
source_data[:,
            f[:].extend({
                "lead1": dt.shift(f.dep_date, -1),
                "lead2": dt.shift(f.dep_date, -2),
                "lag1":  dt.shift(f.dep_date),
                "lag3":  dt.shift(f.dep_date, 3),
            }),
            sort("dep_date", "name")]

Total sum and proportions

proportions = dt.Frame({"t": [1, 2, 3]})

proportions[:,
            f[:].extend({
                "sum": dt.sum(f.t),
                "pct": f.t / dt.sum(f.t),
            })]

Concept Mapping Summary

SQL clausedatatable positionNotes
SELECT coljDT[:, "col"] or DT[:, f.col]
SELECT *j = ":"DT[:, :]
SELECT expr AS namej = {"name": expr}Dict in j
WHERE condiDT[f.col > 0, :]
WHERE x IS NULLiDT[dt.isna(f.x), :]
ORDER BY col ASCsort("col")Third argument
ORDER BY col DESCsort(-f.col)Negate with -
GROUP BY colby("col")Third argument
HAVING agg > nChain [agg_expr, by(...)][f.result > n, :]Filter after aggregation
LEFT JOINjoin(keyed_frame)Frame must be keyed
UNION ALLdt.rbind([DT1, DT2])Row-bind
LAG(col, n)dt.shift(f.col, n)Positive n
LEAD(col, n)dt.shift(f.col, -n)Negative n

Build docs developers (and LLMs) love