Documentation Index
Fetch the complete documentation index at: https://mintlify.com/timeplus-io/proton/llms.txt
Use this file to discover all available pages before exploring further.
Aggregation functions perform calculations across multiple rows and return a single result. In Timeplus Proton, these functions work in both batch and streaming contexts, with special semantics for continuous stream processing.
Basic Aggregate Functions
count()
Counts the number of rows or non-NULL values.
Syntax:
count() -- Count all rows
count(column) -- Count non-NULL values
count(*) -- Count all rows (same as count())
Examples:
-- Batch: Count all events
SELECT count() FROM events;
-- Streaming: Continuous count by device
SELECT device, count()
FROM devices
GROUP BY device;
-- With window: Count events per 5-second window
SELECT window_start, device, count()
FROM tumble(devices, 5s)
GROUP BY window_start, device;
Properties:
- Returns
0 when input is empty
count(column) excludes NULL values
count(*) and count() include all rows
Implementation: Defined in src/AggregateFunctions/AggregateFunctionCount.cpp
sum()
Calculates the sum of numeric values.
Syntax:
Variants:
sum(x) - Standard sum with type promotion
sum_with_overflow(x) - Sum with potential overflow, same result type
sum_kahan(x) - Kahan compensated summation for better floating-point precision
Examples:
-- Total temperature readings
SELECT sum(temperature) FROM sensor_data;
-- Streaming sum by device
SELECT device, sum(temperature)
FROM devices
GROUP BY device;
-- Windowed sum
SELECT window_start, sum(value)
FROM tumble(events, 10s)
GROUP BY window_start;
Properties:
- Returns NULL for empty input
- Type promotion: integer → Int64/UInt64, float → Float64
- Supports integers, floats, and decimals
Implementation: Defined in src/AggregateFunctions/AggregateFunctionSum.cpp
avg()
Calculates the arithmetic mean of values.
Syntax:
Examples:
-- Average temperature
SELECT avg(temperature) FROM sensor_data;
-- Streaming average per device
SELECT device, avg(temperature)
FROM devices
GROUP BY device;
-- Moving average with tumbling window
SELECT window_start, device, avg(temperature)
FROM tumble(devices, 1m)
GROUP BY window_start, device;
Properties:
- Returns NULL for empty input
- Supports int, uint, float, and decimal types
- Result type is always Float64 for integers, preserves precision for decimals
Implementation: Defined in src/AggregateFunctions/AggregateFunctionAvg.cpp
min() / max()
Return the minimum or maximum value.
Syntax:
Examples:
-- Temperature range
SELECT min(temperature), max(temperature)
FROM sensor_data;
-- Streaming min/max per window
SELECT
window_start,
device,
min(temperature) as min_temp,
max(temperature) as max_temp
FROM tumble(devices, 5m)
GROUP BY window_start, device;
Properties:
- Returns NULL for empty input
- Works with any comparable type (numbers, strings, dates)
- Maintains data type of input column
Implementation: Defined in src/AggregateFunctions/AggregateFunctionMin.cpp and AggregateFunctionMax.cpp
Advanced Aggregate Functions
arg_min() / arg_max()
Returns the value of one column when another column reaches its minimum or maximum.
Syntax:
arg_min(value_column, comparison_column)
arg_max(value_column, comparison_column)
Examples:
-- Device ID with lowest temperature reading
SELECT arg_min(device, temperature) FROM sensor_data;
-- Most recent status for each user
SELECT
user_id,
arg_max(status, timestamp) as latest_status
FROM user_events
GROUP BY user_id;
Properties:
- Order-dependent operation
- Useful for “latest value” patterns in streaming
uniq()
Estimates the number of unique values using HyperLogLog algorithm.
Syntax:
Examples:
-- Unique visitors per window
SELECT window_start, uniq(user_id) as unique_users
FROM tumble(page_views, 1h)
GROUP BY window_start;
Properties:
- Approximate count with memory efficiency
- Accurate for small cardinalities, ~2% error for large sets
group_array()
Collects all values into an array.
Syntax:
group_array(column)
group_array(limit)(column) -- Limit array size
Examples:
-- Collect all temperatures in window
SELECT window_start, device, group_array(temperature)
FROM tumble(sensor_data, 1m)
GROUP BY window_start, device;
Calculate quantiles (percentiles) of numeric data.
Syntax:
quantile(level)(column) -- e.g., quantile(0.5) for median
median(column) -- Alias for quantile(0.5)
quantiles(level1, level2, ...)(column) -- Multiple quantiles
Examples:
-- 95th percentile response time
SELECT quantile(0.95)(response_time) FROM requests;
-- Median per window
SELECT window_start, median(temperature)
FROM tumble(sensor_data, 5m)
GROUP BY window_start;
stddev() / variance()
Calculate standard deviation and variance.
Syntax:
stddev(column) -- Sample standard deviation
stddevPop(column) -- Population standard deviation
variance(column) -- Sample variance
varPop(column) -- Population variance
Streaming Aggregation Semantics
Continuous Computation
In streaming queries, aggregations are computed continuously as new data arrives:
-- This query continuously updates as new data arrives
SELECT device, count(), avg(temperature)
FROM devices
GROUP BY device;
Key behaviors:
- Results update incrementally with each new event
- State is maintained in memory for active groups
- No final “complete” state - always evolving
Window-Based Aggregation
Combine aggregations with window functions for bounded computation:
-- Tumbling window: non-overlapping 5-second windows
SELECT window_start, window_end, device, count(), avg(temperature)
FROM tumble(devices, 5s)
GROUP BY window_start, window_end, device;
Window aggregation properties:
- Bounded state: Each window has finite data
- Window emission: Results emitted when window completes
- Resource management: Completed windows can be freed from memory
State Management
Proton manages aggregation state efficiently:
- Incremental updates: State updated with each event, not recalculated
- Memory optimization: Compact state representations
- Checkpointing: State can be persisted for fault tolerance (in clustering mode)
Example of state evolution:
-- Each new event updates the running sum and count
SELECT device, sum(value), count()
FROM sensor_stream
GROUP BY device;
-- State for device 'sensor1':
-- Event 1: sum=10, count=1
-- Event 2: sum=25, count=2 (incremental update)
-- Event 3: sum=40, count=3 (incremental update)
Combinator Functions
Aggregation functions can be modified with combinators for special behaviors:
-
-If combinator: Conditional aggregation
SELECT countIf(temperature > 100), sumIf(value, status = 'active')
FROM devices;
-
-State combinator: Return intermediate state
SELECT countState() FROM events; -- Returns state object
-
-Merge combinator: Merge states
SELECT countMerge(state) FROM intermediate_results;
Aggregation with Multiple Levels
Group by multiple columns for hierarchical aggregation:
SELECT
window_start,
region,
device_type,
count() as event_count,
avg(temperature) as avg_temp,
max(temperature) as max_temp
FROM tumble(devices, 1m)
GROUP BY window_start, region, device_type;
Memory Usage
- Unbounded groups: Unlimited GROUP BY keys consume memory
- Use windows: Bound memory with time-based windows
- State cleanup: Completed windows release memory
Computational Efficiency
- Incremental computation: O(1) update per event for most aggregates
- Pre-aggregation: Consider pre-aggregating in materialized views
- Parallelization: Aggregations can be parallelized across partitions
Best Practices
- Use appropriate windows for streaming queries to bound state
- Choose efficient aggregates:
count() and sum() are more efficient than median()
- Limit GROUP BY cardinality when possible
- Consider approximate algorithms:
uniq() vs count(distinct)
Aggregation in Materialized Views
Materialized views can maintain pre-computed aggregations:
CREATE MATERIALIZED VIEW device_stats AS
SELECT
window_start,
device,
count() as event_count,
avg(temperature) as avg_temp,
min(temperature) as min_temp,
max(temperature) as max_temp
FROM tumble(devices, 1m)
GROUP BY window_start, device;
-- Query the materialized view (fast!)
SELECT * FROM device_stats WHERE device = 'sensor1';
Benefits:
- Pre-computed results for fast queries
- Reduced computation on raw data
- Automatic maintenance as data arrives
Examples: Common Patterns
Running Totals
SELECT device, sum(value) as running_total
FROM events
GROUP BY device;
Moving Averages
SELECT window_start, device, avg(temperature) as moving_avg
FROM tumble(sensor_data, 1m)
GROUP BY window_start, device;
Count Distinct Users
SELECT window_start, uniq(user_id) as unique_users
FROM tumble(page_views, 1h)
GROUP BY window_start;
Top K Items
SELECT window_start, topK(10)(product_id) as top_products
FROM tumble(purchases, 1d)
GROUP BY window_start;
Statistical Analysis
SELECT
window_start,
avg(response_time) as avg_rt,
quantile(0.5)(response_time) as median_rt,
quantile(0.95)(response_time) as p95_rt,
quantile(0.99)(response_time) as p99_rt,
stddev(response_time) as stddev_rt
FROM tumble(requests, 5m)
GROUP BY window_start;
See Also