Real Graph

Overview

Real Graph predicts the likelihood of a Twitter user interacting with another user using a gradient boosting tree classifier. It combines batch aggregation of user interactions with ML-based probability scoring to power personalized recommendations.

Also known as BQE (BigQuery Engagement), Real Graph processes millions of user interaction pairs daily.

How It Works

1. Graph Representation

Real Graph represents Twitter users as a directed graph where:

Nodes: Users
Edges: Interactions between users (follows, favorites, retweets, profile views, etc.)
Features: Metrics like tweet count, follow count, favorites, and behavioral signals

2. Interaction Types

The system tracks both public and private engagements:

Public Engagements
Private Engagements

Favorites/Likes
Retweets
Follows
Replies
Quote tweets

3. Training Pipeline

Labeled Dataset Creation

Candidate Selection: Identify edges active during a specific time period from BigQuery
Label Generation: Join with interactions occurring one day after the candidate period
- Positive (label = 1): Interaction occurred
- Negative (label = 0): No interaction occurred
Feature Extraction: Include user behavior metrics, graph features, and interaction history

Model Training

Data Split

Split labeled dataset into training and testing sets based on source user ID using a custom data split method

Gradient Boosting

Train a boosted tree classifier with hyperparameters including max iterations and subsample rate

Validation

Evaluate model performance on held-out test set

Deployment

Deploy model to score user pairs in production

4. Daily Aggregation (Scio)

Multiple Dataflow jobs run daily to aggregate interaction counts:

User A → User B: {favorites: 15, retweets: 3, profile_views: 2, ...}

Aggregation Outputs

Daily counts: Interactions per type between each user pair
Incoming aggregates: Daily incoming interactions per user
Decayed sums: Time-decayed rollup of historical interactions
ML scores: Predicted interaction probability alongside decayed sums

The rollup job combines yesterday’s aggregation with today’s interactions, maintaining both recent and historical signal.

Output Scores

Once trained, the model generates a probability score estimating:

P(User A will interact with User B)

This score is used to:

Rank candidates in recommendations
Filter low-probability candidates early
Personalize which accounts and content to show users

Architecture

BQE (BigQuery Engagement)

Location: src/scala/com/twitter/interaction_graph/ Components:

BigQuery tables with user interaction graphs
Gradient boosting tree classifier
Labeled dataset generation pipeline
Model training and evaluation jobs

Scio (Dataflow Aggregation)

Location: src/scala/com/twitter/interaction_graph/ Daily Jobs:

Interaction Aggregation: Count interactions by type per user pair
Rollup Job: Combine historical + new interactions with decay
Incoming Aggregation: Sum incoming interactions per user
ML Scoring: Apply trained model to generate prediction scores

The decayed sum approach ensures recent interactions have more weight than older ones, keeping predictions relevant.

Where It’s Used

For You Timeline

Ranks tweet candidates based on likelihood of engagement with tweet authors

Who to Follow

Scores potential follow recommendations based on interaction probability

Graph Feature Service

Provides interaction scores as features for downstream ranking models

Search Results

Personalizes search results using user interaction predictions

Key Features

Multi-Signal Learning

Combines public and private engagement signals for comprehensive interaction modeling

Time Decay

Recent interactions weighted more heavily than historical ones

Incremental Updates

Daily aggregation keeps interaction counts and scores fresh

Scalable Processing

Dataflow jobs handle millions of user pairs efficiently

Graph Feature Service - Serves graph features derived from Real Graph scores
Follow Recommendation Service - Uses Real Graph scores for account recommendations
Home Mixer - Incorporates Real Graph features in timeline ranking

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Overview

How It Works

1. Graph Representation

2. Interaction Types

3. Training Pipeline

Labeled Dataset Creation

Model Training

4. Daily Aggregation (Scio)

Aggregation Outputs

Output Scores

Architecture

BQE (BigQuery Engagement)

Scio (Dataflow Aggregation)

Where It’s Used

For You Timeline

Who to Follow

Graph Feature Service

Search Results

Key Features

Build docs developers (and LLMs) love

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Documentation Index

​Overview

​How It Works

​1. Graph Representation

​2. Interaction Types

​3. Training Pipeline

​Labeled Dataset Creation

​Model Training

​4. Daily Aggregation (Scio)

​Aggregation Outputs

​Output Scores

​Architecture

​BQE (BigQuery Engagement)

​Scio (Dataflow Aggregation)

​Where It’s Used

For You Timeline

Who to Follow

Graph Feature Service

Search Results

​Key Features

​Related Components

Build docs developers (and LLMs) love

Overview

How It Works

1. Graph Representation

2. Interaction Types

3. Training Pipeline

Labeled Dataset Creation

Model Training

4. Daily Aggregation (Scio)

Aggregation Outputs

Output Scores

Architecture

BQE (BigQuery Engagement)

Scio (Dataflow Aggregation)

Where It’s Used

Key Features

Related Components