Skip to main content

Candidate Generation

Candidate generation is the critical first stage of X’s recommendation pipeline, responsible for narrowing down approximately 1 billion potential tweets to a manageable set of thousands of candidates for downstream ranking. This process leverages diverse candidate sources and user behavior signals.

Overview

The candidate sourcing stage uses X user behavior as the primary input to identify potentially relevant content. Multiple specialized systems work in parallel to retrieve candidates from different perspectives.

Input

~1 Billion Tweets

Process

Multi-Source Retrieval

Output

~2-5K Candidates

Candidate Sources

Home Mixer orchestrates multiple candidate sources to retrieve diverse content:

For You Timeline Sources

Earlybird Search Index: Find and rank tweets from accounts the user follows
  • Coverage: ~50% of For You timeline candidates
  • Method: Search index traversal with light ranker scoring
  • Pipeline: ScoredTweetsInNetworkCandidatePipelineConfig
// In-network candidate retrieval
val inNetworkCandidates = earlybirdClient.search(
  userId = request.userId,
  followedUserIds = socialGraph.getFollowing(request.userId),
  maxResults = 2000,
  rankingMode = LightRanker
)
Earlybird combines candidate retrieval with light ranking for efficient in-network scoring

User Signals for Candidate Sourcing

Candidate sources use diverse user behavior signals to identify relevant content:

Explicit Signals

Social Graph

  • Author Follow: Accounts the user follows
  • Author Unfollow: Recently unfollowed accounts
  • Author Mute: Muted accounts
  • Author Block: Blocked accounts

Tweet Engagement

  • Tweet Favorite: Liked tweets
  • Tweet Unfavorite: Unliked tweets
  • Retweet: Retweeted content
  • Quote Tweet: Retweets with comments
  • Tweet Reply: Replied to tweets
  • Tweet Share: Shared tweets
  • Tweet Bookmark: Bookmarked content

Negative Signals

  • Tweet Don’t Like: “Not interested” feedback
  • Tweet Report: Reported tweets

Implicit Signals

  • Tweet Click: Viewed tweet details
  • Tweet Video Watch: Video watch time
  • Notification Open: Opened push notifications
  • Ntab Click: Clicks from notifications tab

Signal Usage by Component

Different candidate sources use signals as features and/or training labels:
USS = User Signal Service, FRS = Follow Recommendation Service
SignalUSSSimClustersTwHINUTEGFRSLight Ranking
Author FollowFeaturesFeatures/LabelsFeatures/LabelsFeaturesFeatures/LabelsN/A
Tweet FavoriteFeaturesFeaturesFeatures/LabelsFeaturesFeatures/LabelsFeatures/Labels
RetweetFeaturesN/AFeatures/LabelsFeaturesFeatures/LabelsFeatures/Labels
Quote TweetFeaturesN/AFeatures/LabelsFeaturesFeatures/LabelsFeatures/Labels
Tweet ReplyFeaturesN/AFeaturesFeaturesFeatures/LabelsFeatures
Tweet ClickFeaturesN/AN/AN/AFeaturesLabels
Video WatchFeaturesFeaturesN/AN/AN/ALabels
Notification OpenFeaturesFeaturesFeaturesN/AFeaturesN/A

Candidate Source Algorithms

SimClusters

Community detection and sparse embeddings
1

Community Detection

Identify communities of users with similar interests:
// SimClusters community detection
val communities = detectCommunities(
  userFollowGraph,
  numCommunities = 145000
)
2

User Embeddings

Represent users as sparse vectors over communities:
// User representation
val userEmbedding = Map(
  communityId_1234 -> 0.8,  // Strong affinity
  communityId_5678 -> 0.6,  // Medium affinity
  communityId_9012 -> 0.3   // Weak affinity
)
3

Tweet Embeddings

Represent tweets based on engagement from community members:
// Tweet representation  
val tweetEmbedding = Map(
  communityId_1234 -> 0.7,  // Engaged by community 1234
  communityId_5678 -> 0.4   // Engaged by community 5678
)
4

Candidate Retrieval

Find tweets from user’s communities:
// Retrieve candidates via community overlap
val candidates = userEmbedding.keys.flatMap { communityId =>
  getTweetsFromCommunity(communityId)
}.sortBy(tweetScore).take(500)

TwHIN

Dense knowledge graph embeddings for Users and Tweets
Build heterogeneous graph with multiple entity types:
# TwHIN graph structure
graph = {
    'users': user_nodes,
    'tweets': tweet_nodes,
    'topics': topic_nodes,
    'edges': [
        ('user', 'follows', 'user'),
        ('user', 'likes', 'tweet'),
        ('user', 'interested_in', 'topic'),
        ('tweet', 'about', 'topic'),
    ]
}

Real Graph

Predict likelihood of user-to-user interaction
// Real Graph scoring
val realGraphScore = realGraphModel.predict(
  sourceUser = userId,
  destUser = authorId,
  features = Seq(
    mutualFollowCount,
    recentInteractionCount,
    followDuration,
    commonInterests
  )
)

// Use for candidate weighting
val weightedScore = candidateScore * realGraphScore

Candidate Pipeline Flow

1

Parallel Retrieval

Query all candidate sources simultaneously:
val candidateFutures = Future.collect(Seq(
  earlybirdSource.get(request),
  utegSource.get(request),
  tweetMixerSource.get(request),
  frsSource.get(request)
))
2

Candidate Merging

Combine candidates from all sources:
val allCandidates = candidateFutures.map { sources =>
  sources.flatten.distinctBy(_.tweetId)
}
3

Basic Filtering

Apply lightweight filters in candidate pipeline:
val filtered = allCandidates.filter { candidate =>
  !isBlocked(candidate.authorId) &&
  !isMuted(candidate.authorId) &&
  !hasSeenRecently(candidate.tweetId) &&
  meetsQualityThreshold(candidate)
}
4

Deduplication

Remove duplicate candidates:
val deduplicated = filtered
  .groupBy(_.tweetId)
  .map { case (id, duplicates) =>
    // Keep highest scoring version
    duplicates.maxBy(_.score)
  }
5

Pass to Ranking

Send candidates to feature hydration and scoring:
val rankedCandidates = scoringPipeline(
  candidates = deduplicated,
  maxToRank = 2000
)

GraphJet Framework

Many candidate sources (UTEG, Recos-Injector) use the GraphJet framework:

GraphJet

In-memory graph processing for real-time recommendationsKey Features:
  • Real-time graph updates from user actions
  • Sub-millisecond graph traversal queries
  • Bipartite graph representation (users ↔ tweets)
  • Time-decayed edge weights for recency
// GraphJet graph structure
class UserTweetGraph {
  // Bipartite graph: users -> tweets
  val userToTweets: Map[UserId, Seq[(TweetId, Timestamp)]]
  
  // Reverse index: tweets -> users  
  val tweetToUsers: Map[TweetId, Seq[(UserId, Timestamp)]]
  
  def recommend(userId: UserId): Seq[TweetId] = {
    // 1. Get tweets user engaged with
    val seedTweets = userToTweets(userId)
    
    // 2. Find users who engaged with same tweets
    val similarUsers = seedTweets.flatMap { tweet =>
      tweetToUsers(tweet._1)
    }.distinct
    
    // 3. Get tweets those users engaged with
    val candidates = similarUsers.flatMap { user =>
      userToTweets(user._1)
    }.filterNot(seedTweets.contains)
    
    // 4. Score and rank
    candidates.groupBy(_._1).map { case (tweetId, occurrences) =>
      (tweetId, occurrences.size)  // Score by frequency
    }.toSeq.sortBy(-_._2).map(_._1)
  }
}

Performance Characteristics

Reduction Ratio

~1,000,000:1 reduction from all tweets to candidates

Latency

50-200ms total for parallel candidate retrieval

Diversity

Multiple sources ensure diverse content perspectives

Freshness

Real-time graph updates capture latest user behavior

Candidate Quality Signals

Early quality filtering in candidate generation:
// Quality gates for candidates
val qualityCriteria = Seq(
  // Author reputation
  authorTweepCredScore > minReputationThreshold,
  
  // Early engagement
  earlyLikeCount > minEarlyEngagement,
  
  // Content safety
  !isFlaggedByTrustAndSafety,
  
  // Spam detection  
  !isLikelySpam,
  
  // Language match
  tweetLanguage.isCompatibleWith(userLanguages)
)

Learn More

Ranking Systems

Learn how candidates are scored and ranked

Product Mixer

Explore the pipeline framework orchestrating candidate generation

Navi ML Serving

Understand how embedding models are served

Build docs developers (and LLMs) love