Candidate Generation

Candidate generation is the critical first stage of X’s recommendation pipeline, responsible for narrowing down approximately 1 billion potential tweets to a manageable set of thousands of candidates for downstream ranking. This process leverages diverse candidate sources and user behavior signals.

Overview

The candidate sourcing stage uses X user behavior as the primary input to identify potentially relevant content. Multiple specialized systems work in parallel to retrieve candidates from different perspectives.

Input

~1 Billion Tweets

Process

Multi-Source Retrieval

Output

~2-5K Candidates

Candidate Sources

Home Mixer orchestrates multiple candidate sources to retrieve diverse content:

For You Timeline Sources

In-Network (Earlybird)
UTEG
Tweet Mixer
FRS

Earlybird Search Index: Find and rank tweets from accounts the user follows

Coverage: ~50% of For You timeline candidates
Method: Search index traversal with light ranker scoring
Pipeline: ScoredTweetsInNetworkCandidatePipelineConfig

// In-network candidate retrieval
val inNetworkCandidates = earlybirdClient.search(
  userId = request.userId,
  followedUserIds = socialGraph.getFollowing(request.userId),
  maxResults = 2000,
  rankingMode = LightRanker
)

Earlybird combines candidate retrieval with light ranking for efficient in-network scoring

User Tweet Entity Graph: Graph-based candidate discovery

Technology: Built on GraphJet framework
Method: In-memory graph traversal of user-tweet interactions
Pipeline: ScoredTweetsUtegCandidatePipelineConfig

// UTEG graph traversal
val utegCandidates = utegClient.recommend(
  userId = request.userId,
  // Traverse from user -> engaged tweets -> similar users -> their tweets
  traversalDepth = 2,
  engagementTypes = Seq(Like, Retweet, Reply),
  maxResults = 500
)

How it works:

Start from the user node
Find tweets the user recently engaged with
Find other users who engaged with the same tweets
Recommend tweets those similar users engaged with

Tweet Mixer: Coordination layer for out-of-network candidates

Purpose: Aggregate candidates from multiple compute services
Pipeline: ScoredTweetsTweetMixerCandidatePipelineConfig
Sources: Cr Mixer, SimClusters, TwHIN, and others

// Tweet Mixer orchestration
val mixedCandidates = Future.join(
  crMixerClient.getCandidates(userId),
  simClustersClient.getCandidates(userId),
  twhinClient.getCandidates(userId)
).map(_.flatten.distinct)

Follow Recommendation Service: Candidates from recommended accounts

Purpose: Surface content from accounts the user might want to follow
Pipeline: ScoredTweetsFrsCandidatePipelineConfig

// FRS candidate retrieval
val frsCandidates = frsClient.getRecommendations(
  userId = request.userId,
  maxAccounts = 100
).flatMap { recommendedAccounts =>
  // Get recent tweets from recommended accounts
  timelineService.getUserTweets(
    recommendedAccounts,
    maxPerUser = 5
  )
}

User Signals for Candidate Sourcing

Candidate sources use diverse user behavior signals to identify relevant content:

Explicit Signals

Social Graph

Author Follow: Accounts the user follows
Author Unfollow: Recently unfollowed accounts
Author Mute: Muted accounts
Author Block: Blocked accounts

Tweet Engagement

Tweet Favorite: Liked tweets
Tweet Unfavorite: Unliked tweets
Retweet: Retweeted content
Quote Tweet: Retweets with comments
Tweet Reply: Replied to tweets
Tweet Share: Shared tweets
Tweet Bookmark: Bookmarked content

Negative Signals

Tweet Don’t Like: “Not interested” feedback
Tweet Report: Reported tweets

Implicit Signals

Tweet Click: Viewed tweet details
Tweet Video Watch: Video watch time
Notification Open: Opened push notifications
Ntab Click: Clicks from notifications tab

Signal Usage by Component

Different candidate sources use signals as features and/or training labels:

USS = User Signal Service, FRS = Follow Recommendation Service

Signal	USS	SimClusters	TwHIN	UTEG	FRS	Light Ranking
Author Follow	Features	Features/Labels	Features/Labels	Features	Features/Labels	N/A
Tweet Favorite	Features	Features	Features/Labels	Features	Features/Labels	Features/Labels
Retweet	Features	N/A	Features/Labels	Features	Features/Labels	Features/Labels
Quote Tweet	Features	N/A	Features/Labels	Features	Features/Labels	Features/Labels
Tweet Reply	Features	N/A	Features	Features	Features/Labels	Features
Tweet Click	Features	N/A	N/A	N/A	Features	Labels
Video Watch	Features	Features	N/A	N/A	N/A	Labels
Notification Open	Features	Features	Features	N/A	Features	N/A

Candidate Source Algorithms

SimClusters

Community detection and sparse embeddings

Community Detection

Identify communities of users with similar interests:

// SimClusters community detection
val communities = detectCommunities(
  userFollowGraph,
  numCommunities = 145000
)

User Embeddings

Represent users as sparse vectors over communities:

// User representation
val userEmbedding = Map(
  communityId_1234 -> 0.8,  // Strong affinity
  communityId_5678 -> 0.6,  // Medium affinity
  communityId_9012 -> 0.3   // Weak affinity
)

Tweet Embeddings

Represent tweets based on engagement from community members:

// Tweet representation  
val tweetEmbedding = Map(
  communityId_1234 -> 0.7,  // Engaged by community 1234
  communityId_5678 -> 0.4   // Engaged by community 5678
)

Candidate Retrieval

Find tweets from user’s communities:

// Retrieve candidates via community overlap
val candidates = userEmbedding.keys.flatMap { communityId =>
  getTweetsFromCommunity(communityId)
}.sortBy(tweetScore).take(500)

TwHIN

Dense knowledge graph embeddings for Users and Tweets

Graph Construction
Embedding Training
Candidate Retrieval

Build heterogeneous graph with multiple entity types:

# TwHIN graph structure
graph = {
    'users': user_nodes,
    'tweets': tweet_nodes,
    'topics': topic_nodes,
    'edges': [
        ('user', 'follows', 'user'),
        ('user', 'likes', 'tweet'),
        ('user', 'interested_in', 'topic'),
        ('tweet', 'about', 'topic'),
    ]
}

Learn dense embeddings via graph neural network:

# Simplified TwHIN training
class TwHIN(nn.Module):
    def forward(self, user_id, tweet_id):
        user_emb = self.user_encoder(user_id)
        tweet_emb = self.tweet_encoder(tweet_id)
        
        # Predict engagement probability
        score = dot_product(user_emb, tweet_emb)
        return score

# Training objective
loss = binary_cross_entropy(
    model(user, positive_tweet),  # Engaged tweet
    model(user, negative_tweet)   # Non-engaged tweet
)

Use approximate nearest neighbor search:

# Retrieve candidates via ANN
user_embedding = twhin_model.get_user_embedding(user_id)

# Find tweets with similar embeddings
candidate_tweets = ann_index.search(
    query=user_embedding,
    k=500,
    metric='cosine'
)

Real Graph

Predict likelihood of user-to-user interaction

// Real Graph scoring
val realGraphScore = realGraphModel.predict(
  sourceUser = userId,
  destUser = authorId,
  features = Seq(
    mutualFollowCount,
    recentInteractionCount,
    followDuration,
    commonInterests
  )
)

// Use for candidate weighting
val weightedScore = candidateScore * realGraphScore

Candidate Pipeline Flow

Parallel Retrieval

Query all candidate sources simultaneously:

val candidateFutures = Future.collect(Seq(
  earlybirdSource.get(request),
  utegSource.get(request),
  tweetMixerSource.get(request),
  frsSource.get(request)
))

Candidate Merging

Combine candidates from all sources:

val allCandidates = candidateFutures.map { sources =>
  sources.flatten.distinctBy(_.tweetId)
}

Basic Filtering

Apply lightweight filters in candidate pipeline:

val filtered = allCandidates.filter { candidate =>
  !isBlocked(candidate.authorId) &&
  !isMuted(candidate.authorId) &&
  !hasSeenRecently(candidate.tweetId) &&
  meetsQualityThreshold(candidate)
}

Deduplication

Remove duplicate candidates:

val deduplicated = filtered
  .groupBy(_.tweetId)
  .map { case (id, duplicates) =>
    // Keep highest scoring version
    duplicates.maxBy(_.score)
  }

Pass to Ranking

Send candidates to feature hydration and scoring:

val rankedCandidates = scoringPipeline(
  candidates = deduplicated,
  maxToRank = 2000
)

GraphJet Framework

Many candidate sources (UTEG, Recos-Injector) use the GraphJet framework:

GraphJet

In-memory graph processing for real-time recommendationsKey Features:

Real-time graph updates from user actions
Sub-millisecond graph traversal queries
Bipartite graph representation (users ↔ tweets)
Time-decayed edge weights for recency

// GraphJet graph structure
class UserTweetGraph {
  // Bipartite graph: users -> tweets
  val userToTweets: Map[UserId, Seq[(TweetId, Timestamp)]]
  
  // Reverse index: tweets -> users  
  val tweetToUsers: Map[TweetId, Seq[(UserId, Timestamp)]]
  
  def recommend(userId: UserId): Seq[TweetId] = {
    // 1. Get tweets user engaged with
    val seedTweets = userToTweets(userId)
    
    // 2. Find users who engaged with same tweets
    val similarUsers = seedTweets.flatMap { tweet =>
      tweetToUsers(tweet._1)
    }.distinct
    
    // 3. Get tweets those users engaged with
    val candidates = similarUsers.flatMap { user =>
      userToTweets(user._1)
    }.filterNot(seedTweets.contains)
    
    // 4. Score and rank
    candidates.groupBy(_._1).map { case (tweetId, occurrences) =>
      (tweetId, occurrences.size)  // Score by frequency
    }.toSeq.sortBy(-_._2).map(_._1)
  }
}

Performance Characteristics

Reduction Ratio

~1,000,000:1 reduction from all tweets to candidates

Latency

50-200ms total for parallel candidate retrieval

Diversity

Multiple sources ensure diverse content perspectives

Freshness

Real-time graph updates capture latest user behavior

Candidate Quality Signals

Early quality filtering in candidate generation:

// Quality gates for candidates
val qualityCriteria = Seq(
  // Author reputation
  authorTweepCredScore > minReputationThreshold,
  
  // Early engagement
  earlyLikeCount > minEarlyEngagement,
  
  // Content safety
  !isFlaggedByTrustAndSafety,
  
  // Spam detection  
  !isLikelySpam,
  
  // Language match
  tweetLanguage.isCompatibleWith(userLanguages)
)

Learn More

Ranking Systems

Learn how candidates are scored and ranked

Product Mixer

Explore the pipeline framework orchestrating candidate generation

Navi ML Serving

Understand how embedding models are served

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Candidate Generation

Candidate Generation

Overview

Input

Process

Output

Candidate Sources

For You Timeline Sources

User Signals for Candidate Sourcing

Explicit Signals

Social Graph

Tweet Engagement

Negative Signals

Implicit Signals

Signal Usage by Component

Candidate Source Algorithms

SimClusters

TwHIN

Real Graph

Candidate Pipeline Flow

GraphJet Framework

GraphJet

Performance Characteristics

Reduction Ratio

Latency

Diversity

Freshness

Candidate Quality Signals

Learn More

Ranking Systems

Product Mixer

Navi ML Serving

Build docs developers (and LLMs) love

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Documentation Index

​Candidate Generation

​Overview

Input

Process

Output

​Candidate Sources

​For You Timeline Sources

​User Signals for Candidate Sourcing

​Explicit Signals

Social Graph

Tweet Engagement

Negative Signals

Implicit Signals

​Signal Usage by Component

​Candidate Source Algorithms

​SimClusters

​TwHIN

​Real Graph

​Candidate Pipeline Flow

​GraphJet Framework

GraphJet

​Performance Characteristics

Reduction Ratio

Latency

Diversity

Freshness

​Candidate Quality Signals

​Learn More

Ranking Systems

Product Mixer

Navi ML Serving

Build docs developers (and LLMs) love

Candidate Generation

Overview

Candidate Sources

For You Timeline Sources

User Signals for Candidate Sourcing

Explicit Signals

Signal Usage by Component

Candidate Source Algorithms

SimClusters

TwHIN

Real Graph

Candidate Pipeline Flow

GraphJet Framework

Performance Characteristics

Candidate Quality Signals

Learn More