Candidate Generation
Candidate generation is the critical first stage of X’s recommendation pipeline, responsible for narrowing down approximately 1 billion potential tweets to a manageable set of thousands of candidates for downstream ranking. This process leverages diverse candidate sources and user behavior signals.
Overview
The candidate sourcing stage uses X user behavior as the primary input to identify potentially relevant content. Multiple specialized systems work in parallel to retrieve candidates from different perspectives.
Process Multi-Source Retrieval
Candidate Sources
Home Mixer orchestrates multiple candidate sources to retrieve diverse content:
For You Timeline Sources
In-Network (Earlybird)
UTEG
FRS
Earlybird Search Index : Find and rank tweets from accounts the user follows
Coverage : ~50% of For You timeline candidates
Method : Search index traversal with light ranker scoring
Pipeline : ScoredTweetsInNetworkCandidatePipelineConfig
// In-network candidate retrieval
val inNetworkCandidates = earlybirdClient.search(
userId = request.userId,
followedUserIds = socialGraph.getFollowing(request.userId),
maxResults = 2000 ,
rankingMode = LightRanker
)
Earlybird combines candidate retrieval with light ranking for efficient in-network scoring
User Tweet Entity Graph : Graph-based candidate discovery
Technology : Built on GraphJet framework
Method : In-memory graph traversal of user-tweet interactions
Pipeline : ScoredTweetsUtegCandidatePipelineConfig
// UTEG graph traversal
val utegCandidates = utegClient.recommend(
userId = request.userId,
// Traverse from user -> engaged tweets -> similar users -> their tweets
traversalDepth = 2 ,
engagementTypes = Seq ( Like , Retweet , Reply ),
maxResults = 500
)
How it works :
Start from the user node
Find tweets the user recently engaged with
Find other users who engaged with the same tweets
Recommend tweets those similar users engaged with
Follow Recommendation Service : Candidates from recommended accounts
Purpose : Surface content from accounts the user might want to follow
Pipeline : ScoredTweetsFrsCandidatePipelineConfig
// FRS candidate retrieval
val frsCandidates = frsClient.getRecommendations(
userId = request.userId,
maxAccounts = 100
).flatMap { recommendedAccounts =>
// Get recent tweets from recommended accounts
timelineService.getUserTweets(
recommendedAccounts,
maxPerUser = 5
)
}
User Signals for Candidate Sourcing
Candidate sources use diverse user behavior signals to identify relevant content:
Explicit Signals
Social Graph
Author Follow : Accounts the user follows
Author Unfollow : Recently unfollowed accounts
Author Mute : Muted accounts
Author Block : Blocked accounts
Tweet Engagement
Tweet Favorite : Liked tweets
Tweet Unfavorite : Unliked tweets
Retweet : Retweeted content
Quote Tweet : Retweets with comments
Tweet Reply : Replied to tweets
Tweet Share : Shared tweets
Tweet Bookmark : Bookmarked content
Negative Signals
Tweet Don’t Like : “Not interested” feedback
Tweet Report : Reported tweets
Implicit Signals
Tweet Click : Viewed tweet details
Tweet Video Watch : Video watch time
Notification Open : Opened push notifications
Ntab Click : Clicks from notifications tab
Signal Usage by Component
Different candidate sources use signals as features and/or training labels:
USS = User Signal Service, FRS = Follow Recommendation Service
Signal USS SimClusters TwHIN UTEG FRS Light Ranking Author Follow Features Features/Labels Features/Labels Features Features/Labels N/A Tweet Favorite Features Features Features/Labels Features Features/Labels Features/Labels Retweet Features N/A Features/Labels Features Features/Labels Features/Labels Quote Tweet Features N/A Features/Labels Features Features/Labels Features/Labels Tweet Reply Features N/A Features Features Features/Labels Features Tweet Click Features N/A N/A N/A Features Labels Video Watch Features Features N/A N/A N/A Labels Notification Open Features Features Features N/A Features N/A
Candidate Source Algorithms
SimClusters
Community detection and sparse embeddings
Community Detection
Identify communities of users with similar interests: // SimClusters community detection
val communities = detectCommunities(
userFollowGraph,
numCommunities = 145000
)
User Embeddings
Represent users as sparse vectors over communities: // User representation
val userEmbedding = Map (
communityId_1234 -> 0.8 , // Strong affinity
communityId_5678 -> 0.6 , // Medium affinity
communityId_9012 -> 0.3 // Weak affinity
)
Tweet Embeddings
Represent tweets based on engagement from community members: // Tweet representation
val tweetEmbedding = Map (
communityId_1234 -> 0.7 , // Engaged by community 1234
communityId_5678 -> 0.4 // Engaged by community 5678
)
Candidate Retrieval
Find tweets from user’s communities: // Retrieve candidates via community overlap
val candidates = userEmbedding.keys.flatMap { communityId =>
getTweetsFromCommunity(communityId)
}.sortBy(tweetScore).take( 500 )
TwHIN
Dense knowledge graph embeddings for Users and Tweets
Graph Construction
Embedding Training
Candidate Retrieval
Build heterogeneous graph with multiple entity types: # TwHIN graph structure
graph = {
'users' : user_nodes,
'tweets' : tweet_nodes,
'topics' : topic_nodes,
'edges' : [
( 'user' , 'follows' , 'user' ),
( 'user' , 'likes' , 'tweet' ),
( 'user' , 'interested_in' , 'topic' ),
( 'tweet' , 'about' , 'topic' ),
]
}
Learn dense embeddings via graph neural network: # Simplified TwHIN training
class TwHIN ( nn . Module ):
def forward ( self , user_id , tweet_id ):
user_emb = self .user_encoder(user_id)
tweet_emb = self .tweet_encoder(tweet_id)
# Predict engagement probability
score = dot_product(user_emb, tweet_emb)
return score
# Training objective
loss = binary_cross_entropy(
model(user, positive_tweet), # Engaged tweet
model(user, negative_tweet) # Non-engaged tweet
)
Use approximate nearest neighbor search: # Retrieve candidates via ANN
user_embedding = twhin_model.get_user_embedding(user_id)
# Find tweets with similar embeddings
candidate_tweets = ann_index.search(
query = user_embedding,
k = 500 ,
metric = 'cosine'
)
Real Graph
Predict likelihood of user-to-user interaction
// Real Graph scoring
val realGraphScore = realGraphModel.predict(
sourceUser = userId,
destUser = authorId,
features = Seq (
mutualFollowCount,
recentInteractionCount,
followDuration,
commonInterests
)
)
// Use for candidate weighting
val weightedScore = candidateScore * realGraphScore
Candidate Pipeline Flow
Parallel Retrieval
Query all candidate sources simultaneously: val candidateFutures = Future .collect( Seq (
earlybirdSource.get(request),
utegSource.get(request),
tweetMixerSource.get(request),
frsSource.get(request)
))
Candidate Merging
Combine candidates from all sources: val allCandidates = candidateFutures.map { sources =>
sources.flatten.distinctBy(_.tweetId)
}
Basic Filtering
Apply lightweight filters in candidate pipeline: val filtered = allCandidates.filter { candidate =>
! isBlocked(candidate.authorId) &&
! isMuted(candidate.authorId) &&
! hasSeenRecently(candidate.tweetId) &&
meetsQualityThreshold(candidate)
}
Deduplication
Remove duplicate candidates: val deduplicated = filtered
.groupBy(_.tweetId)
.map { case (id, duplicates) =>
// Keep highest scoring version
duplicates.maxBy(_.score)
}
Pass to Ranking
Send candidates to feature hydration and scoring: val rankedCandidates = scoringPipeline(
candidates = deduplicated,
maxToRank = 2000
)
GraphJet Framework
Many candidate sources (UTEG, Recos-Injector) use the GraphJet framework:
GraphJet In-memory graph processing for real-time recommendations Key Features :
Real-time graph updates from user actions
Sub-millisecond graph traversal queries
Bipartite graph representation (users ↔ tweets)
Time-decayed edge weights for recency
// GraphJet graph structure
class UserTweetGraph {
// Bipartite graph: users -> tweets
val userToTweets : Map [ UserId , Seq [( TweetId , Timestamp )]]
// Reverse index: tweets -> users
val tweetToUsers : Map [ TweetId , Seq [( UserId , Timestamp )]]
def recommend ( userId : UserId ) : Seq [ TweetId ] = {
// 1. Get tweets user engaged with
val seedTweets = userToTweets(userId)
// 2. Find users who engaged with same tweets
val similarUsers = seedTweets.flatMap { tweet =>
tweetToUsers(tweet._1)
}.distinct
// 3. Get tweets those users engaged with
val candidates = similarUsers.flatMap { user =>
userToTweets(user._1)
}.filterNot(seedTweets.contains)
// 4. Score and rank
candidates.groupBy(_._1).map { case (tweetId, occurrences) =>
(tweetId, occurrences.size) // Score by frequency
}.toSeq.sortBy( - _._2).map(_._1)
}
}
Reduction Ratio ~1,000,000:1 reduction from all tweets to candidates
Latency 50-200ms total for parallel candidate retrieval
Diversity Multiple sources ensure diverse content perspectives
Freshness Real-time graph updates capture latest user behavior
Candidate Quality Signals
Early quality filtering in candidate generation:
// Quality gates for candidates
val qualityCriteria = Seq (
// Author reputation
authorTweepCredScore > minReputationThreshold,
// Early engagement
earlyLikeCount > minEarlyEngagement,
// Content safety
! isFlaggedByTrustAndSafety,
// Spam detection
! isLikelySpam,
// Language match
tweetLanguage.isCompatibleWith(userLanguages)
)
Learn More
Ranking Systems Learn how candidates are scored and ranked
Product Mixer Explore the pipeline framework orchestrating candidate generation
Navi ML Serving Understand how embedding models are served