Split a frame into groups, apply aggregations or transformations per group, filter within groups, and combine results — all with datatable’s by() modifier.
by() implements a split-apply-combine strategy. It splits the frame into groups defined by one or more columns or expressions, evaluates i and j within each group, and then reassembles the results.
df = dt.Frame("""Fruit Date Name Number Apples 10/6/2016 Bob 7 Apples 10/6/2016 Bob 8 Apples 10/6/2016 Mike 9 Apples 10/7/2016 Steve 10 Apples 10/7/2016 Bob 1 Oranges 10/7/2016 Bob 2 Oranges 10/6/2016 Tom 15 Oranges 10/6/2016 Mike 57 Oranges 10/6/2016 Bob 65 Oranges 10/7/2016 Tony 1 Grapes 10/7/2016 Bob 1 Grapes 10/7/2016 Tom 87 Grapes 10/7/2016 Bob 22 Grapes 10/7/2016 Bob 12 Grapes 10/7/2016 Tony 15""")df[:, sum(f.Number), by('Fruit')]# Apples: 35, Grapes: 137, Oranges: 140
Pass an integer or slice to i to select specific rows within each group. Note that i is applied after grouping.
df = dt.Frame("""A B 1 10 1 20 2 30 2 40 3 10""")# First row per groupdf[0, :, by('A')]# Last row per groupdf[-1, :, by('A')]# Second row per group (groups with fewer rows are omitted)df[1, :, by('A')]
f-expressions in the i section are not yet implemented for groupby. Only integers and slices are supported.
When sort() and by() are both present, sorting happens within each group.
# Get the latest entry per group (sort date ascending, take last row)df[-1, :, by('id'), sort('date')]# Get rows with the highest count per worddf[0, :, by('word'), sort(-f.count)]# Get min value row per category with renamed columnsdf[0, {"value_date": f.date, "value_min": f.value}, by("category"), sort('value')]