LLM aggregate functions: reduce, rerank, first, last

Flock’s aggregate functions work like any SQL aggregate — they process a group of rows defined by GROUP BY and return one result per group. The difference is that the “aggregation logic” is delegated to a language model: you supply a prompt describing what you want, and Flock batches the rows in the group and sends them to the model. All aggregate functions share the same two-argument signature as the scalar functions:

A model configuration struct with model_name and an optional secret_name
A prompt configuration struct with a prompt or prompt_name, an optional version, and a context_columns array

The context_columns API is identical to the scalar functions:

{'data': column_ref}                        -- basic text column
{'data': column_ref, 'name': 'alias'}       -- column with a named alias
{'data': image_url_col, 'type': 'image'}    -- image column

llm_reduce

llm_reduce collapses all rows in a group into a single text output. The model receives every row’s column values and the prompt describing the aggregation — summarization, consolidation, opinion synthesis, and similar tasks. Return type: JSON

Parameters

Model configuration (first argument)

model_name

string

required

The registered model name to use for aggregation.

secret_name

string

The DuckDB secret holding the API key for this model.

Prompt configuration (second argument)

prompt

string

An inline prompt instructing the model how to aggregate the rows. Mutually exclusive with prompt_name.

prompt_name

string

The name of a pre-configured prompt in Flock’s prompt registry. Mutually exclusive with prompt.

version

integer

The version of the named prompt to use. Only valid with prompt_name.

context_columns

array

required

Columns whose values are passed to the model for each row in the group.

Examples

SELECT llm_reduce(
    {'model_name': 'gpt-4o'},
    {
        'prompt': 'Summarize the following product descriptions',
        'context_columns': [{'data': product_description}]
    }
) AS product_summary
FROM UNNEST([
    'High-performance laptop with M2 chip and stunning Retina display',
    'Wireless earbuds with active noise cancellation and spatial audio',
    'Lightweight tablet perfect for creativity and productivity on the go'
]) AS t(product_description);

llm_rerank

llm_rerank reorders the rows in a group by relevance to a query prompt and returns the full set of rows as a JSON array sorted from most to least relevant. It is built on the sliding-window listwise reranking method described by Ma et al. (2023), which handles groups larger than a model’s context window by progressively ranking overlapping subsets. Return type: JSON (array of row objects, ordered by relevance)

llm_rerank returns all rows in the group as a JSON array. If you only need the single best or worst match, use llm_first or llm_last instead — they are more efficient for that case.

Sliding window mechanism

When a group contains more rows than a model can rank in one call, llm_rerank uses a sliding window strategy:

Rank the last m documents in a window.
Shift the window toward the beginning of the list by m/2.
Repeat until the window covers the start of the list.

This ensures the most relevant documents bubble to the top quickly without requiring a single model call over the entire group.

Parameters

Model configuration (first argument)

model_name

string

required

The registered model name to use for reranking.

secret_name

string

The DuckDB secret holding the API key for this model.

Prompt configuration (second argument)

prompt

string

The query or relevance criterion to rank against. Mutually exclusive with prompt_name.

prompt_name

string

The name of a pre-configured ranking prompt. Mutually exclusive with prompt.

version

integer

The version of the named prompt to use. Only valid with prompt_name.

context_columns

array

required

Columns whose values the model uses to assess relevance for each row.

Examples

SELECT llm_rerank(
    {'model_name': 'gpt-4o'},
    {
        'prompt': 'AI and machine learning innovations',
        'context_columns': [{'data': document_title}, {'data': document_content}]
    }
) AS reranked_documents
FROM VALUES
    ('Introduction to AI',           'This document covers the basics of artificial intelligence and its applications'),
    ('Machine Learning Fundamentals','Comprehensive guide to machine learning algorithms and techniques'),
    ('Advanced Neural Networks',     'Deep dive into neural network architectures and optimization'),
    ('Data Science Overview',        'General overview of data science methodologies and tools')
AS t(document_title, document_content);

Output format

llm_rerank returns a JSON array of objects. Each object mirrors the columns provided in context_columns, ordered from most to least relevant:

[
  {
    "document_title": "Advanced Neural Networks",
    "document_content": "Deep dive into neural network architectures and optimization"
  },
  {
    "document_title": "Introduction to AI",
    "document_content": "This document covers the basics of artificial intelligence and its applications"
  }
]

llm_first

llm_first reranks the rows in a group by relevance to a prompt and returns only the most relevant row as a JSON object. It is equivalent to running llm_rerank and taking the first element, but avoids materializing the full ranked list. Return type: JSON (single row object)

Parameters

Model configuration (first argument)

model_name

string

required

The registered model name to use for selection.

secret_name

string

The DuckDB secret holding the API key for this model.

Prompt configuration (second argument)

prompt

string

The relevance criterion. The model selects the row that best matches this prompt. Mutually exclusive with prompt_name.

prompt_name

string

The name of a pre-configured selection prompt. Mutually exclusive with prompt.

version

integer

The version of the named prompt to use. Only valid with prompt_name.

context_columns

array

required

Columns the model uses to assess relevance.

Examples

SELECT llm_first(
    {'model_name': 'gpt-4o'},
    {
        'prompt': 'high-performance computing',
        'context_columns': [{'data': product_name}, {'data': product_description}]
    }
) AS first_product_feature
FROM VALUES
    ('MacBook Pro', 'High-performance laptop with M2 chip and Retina display'),
    ('AirPods Pro', 'Wireless earbuds with active noise cancellation'),
    ('iPad Air',    'Lightweight tablet perfect for creativity and productivity')
AS t(product_name, product_description);

Output format

llm_first returns a single JSON object containing the column values of the most relevant row:

{
  "product_name": "Wireless Headphones",
  "product_description": "High-quality wireless headphones with noise cancellation."
}

llm_last

llm_last is the complement of llm_first — it reranks the rows in a group by relevance to a prompt and returns the least relevant row. Use it to identify outliers, flag low-quality entries, or find the weakest match in a group. Return type: JSON (single row object)

Parameters

Model configuration (first argument)

model_name

string

required

The registered model name to use for selection.

secret_name

string

The DuckDB secret holding the API key for this model.

Prompt configuration (second argument)

prompt

string

The relevance criterion. The model selects the row that least matches this prompt. Mutually exclusive with prompt_name.

prompt_name

string

The name of a pre-configured selection prompt. Mutually exclusive with prompt.

version

integer

The version of the named prompt to use. Only valid with prompt_name.

context_columns

array

required

Columns the model uses to assess relevance.

Examples

SELECT llm_last(
    {'model_name': 'gpt-4o'},
    {
        'prompt': 'professional work productivity',
        'context_columns': [{'data': product_name}, {'data': product_description}]
    }
) AS last_product_feature
FROM VALUES
    ('MacBook Pro', 'High-performance laptop with M2 chip and Retina display'),
    ('AirPods Pro', 'Wireless earbuds with active noise cancellation'),
    ('iPad Air',    'Lightweight tablet perfect for creativity and productivity')
AS t(product_name, product_description);

Output format

llm_last returns a single JSON object containing the column values of the least relevant row:

{
  "product_name": "Wireless Keyboard",
  "product_description": "Ergonomic wireless keyboard with backlight."
}

Get Started

SQL Functions

Multimodal

Advanced Features

Development

LLM aggregate functions: reduce, rerank, first, last

llm_reduce

Parameters

Examples

llm_rerank

Sliding window mechanism

Parameters

Examples

Output format

llm_first

Parameters

Examples

Output format

llm_last

Parameters

Examples

Output format

Build docs developers (and LLMs) love

Get Started

SQL Functions

Multimodal

Advanced Features

Development

Documentation Index

​llm_reduce

​Parameters

​Examples

​llm_rerank

​Sliding window mechanism

​Parameters

​Examples

​Output format

​llm_first

​Parameters

​Examples

​Output format

​llm_last

​Parameters

​Examples

​Output format

Build docs developers (and LLMs) love

llm_reduce

Parameters

Examples

llm_rerank

Sliding window mechanism

Parameters

Examples

Output format

llm_first

Parameters

Examples

Output format

llm_last

Parameters

Examples

Output format