Skip to main content

Overview

Real Graph predicts the likelihood of a Twitter user interacting with another user using a gradient boosting tree classifier. It combines batch aggregation of user interactions with ML-based probability scoring to power personalized recommendations.
Also known as BQE (BigQuery Engagement), Real Graph processes millions of user interaction pairs daily.

How It Works

1. Graph Representation

Real Graph represents Twitter users as a directed graph where:
  • Nodes: Users
  • Edges: Interactions between users (follows, favorites, retweets, profile views, etc.)
  • Features: Metrics like tweet count, follow count, favorites, and behavioral signals

2. Interaction Types

The system tracks both public and private engagements:
  • Favorites/Likes
  • Retweets
  • Follows
  • Replies
  • Quote tweets

3. Training Pipeline

Labeled Dataset Creation

  1. Candidate Selection: Identify edges active during a specific time period from BigQuery
  2. Label Generation: Join with interactions occurring one day after the candidate period
    • Positive (label = 1): Interaction occurred
    • Negative (label = 0): No interaction occurred
  3. Feature Extraction: Include user behavior metrics, graph features, and interaction history

Model Training

1

Data Split

Split labeled dataset into training and testing sets based on source user ID using a custom data split method
2

Gradient Boosting

Train a boosted tree classifier with hyperparameters including max iterations and subsample rate
3

Validation

Evaluate model performance on held-out test set
4

Deployment

Deploy model to score user pairs in production

4. Daily Aggregation (Scio)

Multiple Dataflow jobs run daily to aggregate interaction counts:
User A → User B: {favorites: 15, retweets: 3, profile_views: 2, ...}

Aggregation Outputs

  • Daily counts: Interactions per type between each user pair
  • Incoming aggregates: Daily incoming interactions per user
  • Decayed sums: Time-decayed rollup of historical interactions
  • ML scores: Predicted interaction probability alongside decayed sums
The rollup job combines yesterday’s aggregation with today’s interactions, maintaining both recent and historical signal.

Output Scores

Once trained, the model generates a probability score estimating:
P(User A will interact with User B)
This score is used to:
  • Rank candidates in recommendations
  • Filter low-probability candidates early
  • Personalize which accounts and content to show users

Architecture

BQE (BigQuery Engagement)

Location: src/scala/com/twitter/interaction_graph/ Components:
  • BigQuery tables with user interaction graphs
  • Gradient boosting tree classifier
  • Labeled dataset generation pipeline
  • Model training and evaluation jobs

Scio (Dataflow Aggregation)

Location: src/scala/com/twitter/interaction_graph/ Daily Jobs:
  1. Interaction Aggregation: Count interactions by type per user pair
  2. Rollup Job: Combine historical + new interactions with decay
  3. Incoming Aggregation: Sum incoming interactions per user
  4. ML Scoring: Apply trained model to generate prediction scores
The decayed sum approach ensures recent interactions have more weight than older ones, keeping predictions relevant.

Where It’s Used

For You Timeline

Ranks tweet candidates based on likelihood of engagement with tweet authors

Who to Follow

Scores potential follow recommendations based on interaction probability

Graph Feature Service

Provides interaction scores as features for downstream ranking models

Search Results

Personalizes search results using user interaction predictions

Key Features

Combines public and private engagement signals for comprehensive interaction modeling
Recent interactions weighted more heavily than historical ones
Daily aggregation keeps interaction counts and scores fresh
Dataflow jobs handle millions of user pairs efficiently

Build docs developers (and LLMs) love