Overview
The Timelines Aggregation Framework is a set of libraries and utilities that allows teams to flexibly compute aggregate (counting) features in both batch and real-time. Aggregate features can capture historical interactions between arbitrary entities (and sets thereof), conditional on provided features and labels.These types of engineered aggregate features have proven to be highly impactful across different teams at Twitter.
What Are Aggregate Features?
Aggregate features are computed on provided grouping keys with the constraint that these keys must be sparse binary features (or sets thereof).Common Use Cases
User Engagement History
User Engagement History
Calculate a user’s past engagement history with various types of tweets (photo, video, retweets, etc.), specific authors, or specific in-network engagers.Aggregation keys:
userId, (userId, authorId), (userId, engagerId)Tweet-Level Aggregates
Tweet-Level Aggregates
Compute custom aggregate engagement counts on every
tweetId for Timelines and MagicRecs.Aggregation keys: tweetId, (tweetId, userId)Other Entity Aggregates
Other Entity Aggregates
Calculate aggregates on other entities like advertisers or media.Aggregation keys:
advertiserId, mediaIdFramework Capabilities
The framework supports computing aggregate features on any sparse binary grouping key:Batch Processing
Daily batch processing of DataRecords containing all required input features
Real-Time Streaming
Real-time aggregation through Storm with memcache backing
Flexible Keys
Support for arbitrary sparse binary features as grouping keys
Conditional Aggregation
Compute aggregates conditional on provided features and labels
Implementation Details
Offline (Batch) Implementation
Data Collection
Daily batch processing of DataRecords containing all required input features to generate aggregate features
Online (Real-Time) Implementation
Architecture
Example Features
Here are some examples of aggregate features that can be computed:Where Is This Used?
The aggregation framework is extensively used across Twitter’s recommendation systems:Home Timeline Heavy Ranker
Uses a variety of both batch and real-time features generated by this framework. These features are critical for ranking tweets in the main timeline.
Email Recommendations
Aggregate features power personalized email content selection
Other Recommendations
Various recommendation surfaces leverage these features
Feature Types
Counting Features
The framework specializes in counting aggregations:- Engagement counts: How many times a user liked tweets from an author
- Interaction frequency: How often two entities interact
- Time-windowed counts: Engagements in the last 7/30/90 days
- Conditional counts: Engagements filtered by content type, topic, etc.
Grouping Keys
Any sparse binary feature can serve as a grouping key:| Key Type | Example | Use Case |
|---|---|---|
| Single Entity | userId, tweetId | User-level or tweet-level aggregates |
| Pair | (userId, authorId) | Interaction between two entities |
| Triple | (userId, authorId, topicId) | Multi-dimensional aggregates |
| Set | userId with content filters | Conditional aggregates |
Real-Time vs Batch
When to Use Batch Features
Use batch for historical patterns and stable features
Use batch when low latency is not critical
Use batch for complex aggregations over long time windows
When to Use Real-Time Features
Use real-time for recent user behavior (last few hours/days)
Use real-time when feature freshness impacts model performance
Use real-time to capture trending and viral content
Hybrid Approach
The Home Timeline heavy ranker uses both batch and real-time features:- Batch features: Long-term user preferences (30/90 day windows)
- Real-time features: Recent engagement patterns (24h/7d windows)
- Combined approach provides both stability and freshness
Performance Considerations
Batch Processing Performance
Batch Processing Performance
- Processes millions of user records daily
- Optimized for throughput over latency
- Manhattan provides fast online lookups
- Features updated once per day
Real-Time Processing Performance
Real-Time Processing Performance
- Storm topology handles high-throughput streams
- Memcache provides low-latency access (under 10ms)
- Features updated within seconds of user action
- Trade-off between freshness and computational cost
Storage Efficiency
Storage Efficiency
- Sparse binary keys minimize storage requirements
- Only store non-zero counts
- TTL policies manage storage growth
- Compression for batch features in Manhattan
Integration with Data Pipeline
The aggregation framework integrates with other components:Related Components
User Signal Service
Provides input signals for aggregation
Unified User Actions
Source of raw user action events
Retrieval Signals
Uses aggregate features for candidate scoring