Grouping with by()

by() implements a split-apply-combine strategy. It splits the frame into groups defined by one or more columns or expressions, evaluates i and j within each group, and then reassembles the results.

from datatable import (dt, f, by, ifelse, update, sort,
                       count, min, max, mean, sum, rowsum)

The three main use cases are:

Aggregation

Reduce each group to a single row (count, sum, mean, …)

Transformation

Compute group-level values and append them back — frame shape is preserved

Filtration

Select a subset of rows within each group using i

Aggregation

The aggregate function goes in the j position.

Group by one column

df = dt.Frame("""Fruit   Date       Name  Number
                 Apples  10/6/2016  Bob     7
                 Apples  10/6/2016  Bob     8
                 Apples  10/6/2016  Mike    9
                 Apples  10/7/2016  Steve  10
                 Apples  10/7/2016  Bob     1
                 Oranges 10/7/2016  Bob     2
                 Oranges 10/6/2016  Tom    15
                 Oranges 10/6/2016  Mike   57
                 Oranges 10/6/2016  Bob    65
                 Oranges 10/7/2016  Tony    1
                 Grapes  10/7/2016  Bob     1
                 Grapes  10/7/2016  Tom    87
                 Grapes  10/7/2016  Bob    22
                 Grapes  10/7/2016  Bob    12
                 Grapes  10/7/2016  Tony   15""")

df[:, sum(f.Number), by('Fruit')]
# Apples: 35, Grapes: 137, Oranges: 140

Group by multiple columns

df[:, sum(f.Number), by('Fruit', 'Name')]

Multiple aggregate functions

df[:, {"min": min(f.Number), "max": max(f.Number)}, by('Fruit', 'Date')]

Aggregate over a column range

df[:, sum(f["col3":"col4"]), by('col1', 'col2')]

Different functions on different columns

df[:, [max(f.col3), min(f.col4)], by('col1', 'col2')]

Compute between aggregated values

# Range (max - min) per group
df[:, max(f.VALUE) - min(f.VALUE), by('GROUP')]

Exclude grouping columns from output

df[:, sum(f.Number), by('Fruit', add_columns=False)]

The grouping column(s) always appear first in the result.
Grouping columns are sorted in ascending order by default.
To sort descending, negate the f-expression: by(-f.Fruit).

Group by Expressions

By column position

df[:, sum(f.Number), by(f[0])]

By boolean expression

df[:, sum(f.Number), by(f.Fruit == "Apples")]

By computed expression (e.g. bin numeric values)

df[:, sum(f.Amount), by(ifelse(f.Number >= 5, "B", "A"))]

By floor-divided value

DT[:, :, by(dt.math.floor(f.A / 100))]

Transformation (preserving frame shape)

Use update() inside a by() to append group-level values back to the original frame. The frame’s row count is unchanged.

Append group min and max

df[:, update(min_col=min(f.c), max_col=max(f.c)), by('y')]
# min_col and max_col are appended; each row gets its group's value

Fill missing values with group mean

df[:, update(value=ifelse(f.value == None, mean(f.value), f.value)), by('name')]

Compute a group-level ratio

df[:, update(ratio=f.c / sum(f.c * f.d)), by('a', 'b')]

Window-style computation (SQL analogue)

df[:, update(ratio=dt.sum(f.liabilities) * 100 / dt.sum(f.assets)), by(f.exporter)]
# Each row in a group gets the same group-level ratio

Filtration (select rows within groups)

Pass an integer or slice to i to select specific rows within each group. Note that i is applied after grouping.

df = dt.Frame("""A   B
                 1  10
                 1  20
                 2  30
                 2  40
                 3  10""")

# First row per group
df[0, :, by('A')]

# Last row per group
df[-1, :, by('A')]

# Second row per group (groups with fewer rows are omitted)
df[1, :, by('A')]

f-expressions in the i section are not yet implemented for groupby. Only integers and slices are supported.

HAVING — filter groups by aggregate

# Keep only groups where count > 1
df[:, update(filter_col=count()), by('A')]
df[f.filter_col > 1, f[:-1]]

Filter rows by minimum value within group

df[:, update(filter_col=f.diff == min(f.diff)), by('item')]
df[f.filter_col == 1, :-1]

Keep groups that contain both 0 and 1 in a column

df[:, update(filter_col=sum(f.sale)), by('make')]
df[f.filter_col == 1, :-1]

Using `sort()` with `by()`

When sort() and by() are both present, sorting happens within each group.

# Get the latest entry per group (sort date ascending, take last row)
df[-1, :, by('id'), sort('date')]

# Get rows with the highest count per word
df[0, :, by('word'), sort(-f.count)]

# Get min value row per category with renamed columns
df[0, {"value_date": f.date, "value_min": f.value}, by("category"), sort('value')]

Nested and Conditional Aggregations

Row-sum of column groups, then aggregate

df[:,
   {"AB": sum(rowsum(f['A':'B'])),
    "CD": sum(rowsum(f['C':'D']))},
   by('cat')]

Conditional sums

df[:,
   {"sum_up":      sum(f.B == "up"),
    "sum_down":    sum(f.B == "down"),
    "over_200_up": sum((f.B == "up") & (f.C > 200))},
   by('A_id')]

Aggregate on conditional values using `ifelse`

df[:,
   {"TMAX": max(ifelse(f.Element == "TMAX", f.Data_Value, None)),
    "TMIN": min(ifelse(f.Element == "TMIN", f.Data_Value, None))},
   by(f.Day)]

Conditional group sum appended back to frame

expression = ((f.Num == 17) | (f.Num == 12)) & (f.Letter == "D")
df[:, update(Total=sum(expression * f.Count)), by(f.ID)]

Handling Null Values in Groups

By default, None forms its own group:

df[:, sum(f[:]), by('b')]
# null values in 'b' appear as a separate group at the top

To exclude null groups, pre-filter before grouping:

df[f.b != None, :][:, sum(f[:]), by('b')]

Get Started

Core Concepts

Working with Data

Machine Learning

Migration & Comparisons

Grouping with by()

Aggregation

Transformation

Filtration

Aggregation

Group by one column

Group by multiple columns

Multiple aggregate functions

Aggregate over a column range

Different functions on different columns

Compute between aggregated values

Exclude grouping columns from output

Group by Expressions

By column position

By boolean expression

By computed expression (e.g. bin numeric values)

By floor-divided value

Transformation (preserving frame shape)

Append group min and max

Fill missing values with group mean

Compute a group-level ratio

Window-style computation (SQL analogue)

Filtration (select rows within groups)

HAVING — filter groups by aggregate

Filter rows by minimum value within group

Keep groups that contain both 0 and 1 in a column

Using `sort()` with `by()`

Nested and Conditional Aggregations

Row-sum of column groups, then aggregate

Conditional sums

Aggregate on conditional values using `ifelse`

Conditional group sum appended back to frame

Handling Null Values in Groups

Build docs developers (and LLMs) love

Get Started

Core Concepts

Working with Data

Machine Learning

Migration & Comparisons

Aggregation

Transformation

Filtration

​Aggregation

​Group by one column

​Group by multiple columns

​Multiple aggregate functions

​Aggregate over a column range

​Different functions on different columns

​Compute between aggregated values

​Exclude grouping columns from output

​Group by Expressions

​By column position

​By boolean expression

​By computed expression (e.g. bin numeric values)

​By floor-divided value

​Transformation (preserving frame shape)

​Append group min and max

​Fill missing values with group mean

​Compute a group-level ratio

​Window-style computation (SQL analogue)

​Filtration (select rows within groups)

​HAVING — filter groups by aggregate

​Filter rows by minimum value within group

​Keep groups that contain both 0 and 1 in a column

​Using sort() with by()

​Nested and Conditional Aggregations

​Row-sum of column groups, then aggregate

​Conditional sums

​Aggregate on conditional values using ifelse

​Conditional group sum appended back to frame

​Handling Null Values in Groups

Build docs developers (and LLMs) love

Aggregation

Group by one column

Group by multiple columns

Multiple aggregate functions

Aggregate over a column range

Different functions on different columns

Compute between aggregated values

Exclude grouping columns from output

Group by Expressions

By column position

By boolean expression

By computed expression (e.g. bin numeric values)

By floor-divided value

Transformation (preserving frame shape)

Append group min and max

Fill missing values with group mean

Compute a group-level ratio

Window-style computation (SQL analogue)

Filtration (select rows within groups)

HAVING — filter groups by aggregate

Filter rows by minimum value within group

Keep groups that contain both 0 and 1 in a column

Using `sort()` with `by()`

Nested and Conditional Aggregations

Row-sum of column groups, then aggregate

Conditional sums

Aggregate on conditional values using `ifelse`

Conditional group sum appended back to frame

Handling Null Values in Groups