Timelines Aggregation Framework

Overview

The Timelines Aggregation Framework is a set of libraries and utilities that allows teams to flexibly compute aggregate (counting) features in both batch and real-time. Aggregate features can capture historical interactions between arbitrary entities (and sets thereof), conditional on provided features and labels.

These types of engineered aggregate features have proven to be highly impactful across different teams at Twitter.

What Are Aggregate Features?

Aggregate features are computed on provided grouping keys with the constraint that these keys must be sparse binary features (or sets thereof).

Common Use Cases

User Engagement History

Calculate a user’s past engagement history with various types of tweets (photo, video, retweets, etc.), specific authors, or specific in-network engagers.Aggregation keys: userId, (userId, authorId), (userId, engagerId)

Tweet-Level Aggregates

Compute custom aggregate engagement counts on every tweetId for Timelines and MagicRecs.Aggregation keys: tweetId, (tweetId, userId)

Other Entity Aggregates

Calculate aggregates on other entities like advertisers or media.Aggregation keys: advertiserId, mediaId

Framework Capabilities

The framework supports computing aggregate features on any sparse binary grouping key:

Batch Processing

Daily batch processing of DataRecords containing all required input features

Real-Time Streaming

Real-time aggregation through Storm with memcache backing

Flexible Keys

Support for arbitrary sparse binary features as grouping keys

Conditional Aggregation

Compute aggregates conditional on provided features and labels

Implementation Details

Offline (Batch) Implementation

Data Collection

Daily batch processing of DataRecords containing all required input features to generate aggregate features

Aggregation

Compute aggregate counts based on configured grouping keys and conditions

Storage

Upload computed features to Manhattan for online hydration

Serving

Features are hydrated from Manhattan during inference

Online (Real-Time) Implementation

Stream Processing

Real-time aggregation of DataRecords through Storm topology

Cache Layer

Backing memcache stores real-time aggregate features

Query Interface

Features can be queried in real-time for immediate use

Feature Freshness

Continuously updated as new user actions occur

Architecture

Example Features

Here are some examples of aggregate features that can be computed:

# Count of user favorites on tweets from a specific author
user_author_favorite_count = aggregate(
    keys=['userId', 'authorId'],
    metric='favorite',
    window='30d'
)

Where Is This Used?

The aggregation framework is extensively used across Twitter’s recommendation systems:

Home Timeline Heavy Ranker

Uses a variety of both batch and real-time features generated by this framework. These features are critical for ranking tweets in the main timeline.

Email Recommendations

Aggregate features power personalized email content selection

Other Recommendations

Various recommendation surfaces leverage these features

Feature Types

Counting Features

The framework specializes in counting aggregations:

Engagement counts: How many times a user liked tweets from an author
Interaction frequency: How often two entities interact
Time-windowed counts: Engagements in the last 7/30/90 days
Conditional counts: Engagements filtered by content type, topic, etc.

Grouping Keys

Any sparse binary feature can serve as a grouping key:

Key Type	Example	Use Case
Single Entity	`userId`, `tweetId`	User-level or tweet-level aggregates
Pair	`(userId, authorId)`	Interaction between two entities
Triple	`(userId, authorId, topicId)`	Multi-dimensional aggregates
Set	`userId` with content filters	Conditional aggregates

Grouping keys must be sparse binary features. Dense features are not supported by the framework.

Real-Time vs Batch

When to Use Batch Features

Use batch for historical patterns and stable features

Use batch when low latency is not critical

Use batch for complex aggregations over long time windows

When to Use Real-Time Features

Use real-time for recent user behavior (last few hours/days)

Use real-time when feature freshness impacts model performance

Use real-time to capture trending and viral content

Hybrid Approach

The Home Timeline heavy ranker uses both batch and real-time features:

Batch features: Long-term user preferences (30/90 day windows)
Real-time features: Recent engagement patterns (24h/7d windows)
Combined approach provides both stability and freshness

Performance Considerations

Batch Processing Performance

Processes millions of user records daily
Optimized for throughput over latency
Manhattan provides fast online lookups
Features updated once per day

Real-Time Processing Performance

Storm topology handles high-throughput streams
Memcache provides low-latency access (under 10ms)
Features updated within seconds of user action
Trade-off between freshness and computational cost

Storage Efficiency

Sparse binary keys minimize storage requirements
Only store non-zero counts
TTL policies manage storage growth
Compression for batch features in Manhattan

Integration with Data Pipeline

The aggregation framework integrates with other components:

Signal Collection

User actions flow from UUA to USS

Feature Engineering

Aggregation framework processes signals into features

Feature Storage

Features stored in Manhattan (batch) or Memcache (real-time)

Model Serving

Heavy ranker hydrates features during inference

User Signal Service

Provides input signals for aggregation

Unified User Actions

Source of raw user action events

Retrieval Signals

Uses aggregate features for candidate scoring

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Overview

What Are Aggregate Features?

Common Use Cases

Framework Capabilities

Batch Processing

Real-Time Streaming

Flexible Keys

Conditional Aggregation

Implementation Details

Offline (Batch) Implementation

Online (Real-Time) Implementation

Architecture

Example Features

Where Is This Used?

Home Timeline Heavy Ranker

Email Recommendations

Other Recommendations

Feature Types

Counting Features

Grouping Keys

Real-Time vs Batch

When to Use Batch Features

When to Use Real-Time Features

Hybrid Approach

Performance Considerations

Integration with Data Pipeline

User Signal Service

Unified User Actions

Retrieval Signals

Build docs developers (and LLMs) love

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Documentation Index

​Overview

​What Are Aggregate Features?

​Common Use Cases

​Framework Capabilities

Batch Processing

Real-Time Streaming

Flexible Keys

Conditional Aggregation

​Implementation Details

​Offline (Batch) Implementation

​Online (Real-Time) Implementation

​Architecture

​Example Features

​Where Is This Used?

Home Timeline Heavy Ranker

Email Recommendations

Other Recommendations

​Feature Types

​Counting Features

​Grouping Keys

​Real-Time vs Batch

​When to Use Batch Features

​When to Use Real-Time Features

​Hybrid Approach

​Performance Considerations

​Integration with Data Pipeline

​Related Components

User Signal Service

Unified User Actions

Retrieval Signals

Build docs developers (and LLMs) love

Overview

What Are Aggregate Features?

Common Use Cases

Framework Capabilities

Implementation Details

Offline (Batch) Implementation

Online (Real-Time) Implementation

Architecture

Example Features

Where Is This Used?

Feature Types

Counting Features

Grouping Keys

Real-Time vs Batch

When to Use Batch Features

When to Use Real-Time Features

Hybrid Approach

Performance Considerations

Integration with Data Pipeline

Related Components