Skip to main content

Overview

The Timelines Aggregation Framework is a set of libraries and utilities that allows teams to flexibly compute aggregate (counting) features in both batch and real-time. Aggregate features can capture historical interactions between arbitrary entities (and sets thereof), conditional on provided features and labels.
These types of engineered aggregate features have proven to be highly impactful across different teams at Twitter.

What Are Aggregate Features?

Aggregate features are computed on provided grouping keys with the constraint that these keys must be sparse binary features (or sets thereof).

Common Use Cases

Calculate a user’s past engagement history with various types of tweets (photo, video, retweets, etc.), specific authors, or specific in-network engagers.Aggregation keys: userId, (userId, authorId), (userId, engagerId)
Compute custom aggregate engagement counts on every tweetId for Timelines and MagicRecs.Aggregation keys: tweetId, (tweetId, userId)
Calculate aggregates on other entities like advertisers or media.Aggregation keys: advertiserId, mediaId

Framework Capabilities

The framework supports computing aggregate features on any sparse binary grouping key:

Batch Processing

Daily batch processing of DataRecords containing all required input features

Real-Time Streaming

Real-time aggregation through Storm with memcache backing

Flexible Keys

Support for arbitrary sparse binary features as grouping keys

Conditional Aggregation

Compute aggregates conditional on provided features and labels

Implementation Details

Offline (Batch) Implementation

1

Data Collection

Daily batch processing of DataRecords containing all required input features to generate aggregate features
2

Aggregation

Compute aggregate counts based on configured grouping keys and conditions
3

Storage

Upload computed features to Manhattan for online hydration
4

Serving

Features are hydrated from Manhattan during inference

Online (Real-Time) Implementation

1

Stream Processing

Real-time aggregation of DataRecords through Storm topology
2

Cache Layer

Backing memcache stores real-time aggregate features
3

Query Interface

Features can be queried in real-time for immediate use
4

Feature Freshness

Continuously updated as new user actions occur

Architecture

Example Features

Here are some examples of aggregate features that can be computed:
# Count of user favorites on tweets from a specific author
user_author_favorite_count = aggregate(
    keys=['userId', 'authorId'],
    metric='favorite',
    window='30d'
)

Where Is This Used?

The aggregation framework is extensively used across Twitter’s recommendation systems:

Home Timeline Heavy Ranker

Uses a variety of both batch and real-time features generated by this framework. These features are critical for ranking tweets in the main timeline.

Email Recommendations

Aggregate features power personalized email content selection

Other Recommendations

Various recommendation surfaces leverage these features

Feature Types

Counting Features

The framework specializes in counting aggregations:
  • Engagement counts: How many times a user liked tweets from an author
  • Interaction frequency: How often two entities interact
  • Time-windowed counts: Engagements in the last 7/30/90 days
  • Conditional counts: Engagements filtered by content type, topic, etc.

Grouping Keys

Any sparse binary feature can serve as a grouping key:
Key TypeExampleUse Case
Single EntityuserId, tweetIdUser-level or tweet-level aggregates
Pair(userId, authorId)Interaction between two entities
Triple(userId, authorId, topicId)Multi-dimensional aggregates
SetuserId with content filtersConditional aggregates
Grouping keys must be sparse binary features. Dense features are not supported by the framework.

Real-Time vs Batch

When to Use Batch Features

Use batch for historical patterns and stable features
Use batch when low latency is not critical
Use batch for complex aggregations over long time windows

When to Use Real-Time Features

Use real-time for recent user behavior (last few hours/days)
Use real-time when feature freshness impacts model performance
Use real-time to capture trending and viral content

Hybrid Approach

The Home Timeline heavy ranker uses both batch and real-time features:
  • Batch features: Long-term user preferences (30/90 day windows)
  • Real-time features: Recent engagement patterns (24h/7d windows)
  • Combined approach provides both stability and freshness

Performance Considerations

  • Processes millions of user records daily
  • Optimized for throughput over latency
  • Manhattan provides fast online lookups
  • Features updated once per day
  • Storm topology handles high-throughput streams
  • Memcache provides low-latency access (under 10ms)
  • Features updated within seconds of user action
  • Trade-off between freshness and computational cost
  • Sparse binary keys minimize storage requirements
  • Only store non-zero counts
  • TTL policies manage storage growth
  • Compression for batch features in Manhattan

Integration with Data Pipeline

The aggregation framework integrates with other components:
1

Signal Collection

User actions flow from UUA to USS
2

Feature Engineering

Aggregation framework processes signals into features
3

Feature Storage

Features stored in Manhattan (batch) or Memcache (real-time)
4

Model Serving

Heavy ranker hydrates features during inference

User Signal Service

Provides input signals for aggregation

Unified User Actions

Source of raw user action events

Retrieval Signals

Uses aggregate features for candidate scoring

Build docs developers (and LLMs) love