Overview
Real Graph predicts the likelihood of a Twitter user interacting with another user using a gradient boosting tree classifier. It combines batch aggregation of user interactions with ML-based probability scoring to power personalized recommendations.Also known as BQE (BigQuery Engagement), Real Graph processes millions of user interaction pairs daily.
How It Works
1. Graph Representation
Real Graph represents Twitter users as a directed graph where:- Nodes: Users
- Edges: Interactions between users (follows, favorites, retweets, profile views, etc.)
- Features: Metrics like tweet count, follow count, favorites, and behavioral signals
2. Interaction Types
The system tracks both public and private engagements:- Public Engagements
- Private Engagements
- Favorites/Likes
- Retweets
- Follows
- Replies
- Quote tweets
3. Training Pipeline
Labeled Dataset Creation
- Candidate Selection: Identify edges active during a specific time period from BigQuery
- Label Generation: Join with interactions occurring one day after the candidate period
- Positive (label = 1): Interaction occurred
- Negative (label = 0): No interaction occurred
- Feature Extraction: Include user behavior metrics, graph features, and interaction history
Model Training
Data Split
Split labeled dataset into training and testing sets based on source user ID using a custom data split method
Gradient Boosting
Train a boosted tree classifier with hyperparameters including max iterations and subsample rate
4. Daily Aggregation (Scio)
Multiple Dataflow jobs run daily to aggregate interaction counts:Aggregation Outputs
- Daily counts: Interactions per type between each user pair
- Incoming aggregates: Daily incoming interactions per user
- Decayed sums: Time-decayed rollup of historical interactions
- ML scores: Predicted interaction probability alongside decayed sums
The rollup job combines yesterday’s aggregation with today’s interactions, maintaining both recent and historical signal.
Output Scores
Once trained, the model generates a probability score estimating:- Rank candidates in recommendations
- Filter low-probability candidates early
- Personalize which accounts and content to show users
Architecture
BQE (BigQuery Engagement)
Location:src/scala/com/twitter/interaction_graph/
Components:
- BigQuery tables with user interaction graphs
- Gradient boosting tree classifier
- Labeled dataset generation pipeline
- Model training and evaluation jobs
Scio (Dataflow Aggregation)
Location:src/scala/com/twitter/interaction_graph/
Daily Jobs:
- Interaction Aggregation: Count interactions by type per user pair
- Rollup Job: Combine historical + new interactions with decay
- Incoming Aggregation: Sum incoming interactions per user
- ML Scoring: Apply trained model to generate prediction scores
Where It’s Used
For You Timeline
Ranks tweet candidates based on likelihood of engagement with tweet authors
Who to Follow
Scores potential follow recommendations based on interaction probability
Graph Feature Service
Provides interaction scores as features for downstream ranking models
Search Results
Personalizes search results using user interaction predictions
Key Features
Multi-Signal Learning
Multi-Signal Learning
Combines public and private engagement signals for comprehensive interaction modeling
Time Decay
Time Decay
Recent interactions weighted more heavily than historical ones
Incremental Updates
Incremental Updates
Daily aggregation keeps interaction counts and scores fresh
Scalable Processing
Scalable Processing
Dataflow jobs handle millions of user pairs efficiently
Related Components
- Graph Feature Service - Serves graph features derived from Real Graph scores
- Follow Recommendation Service - Uses Real Graph scores for account recommendations
- Home Mixer - Incorporates Real Graph features in timeline ranking