Skip to main content

Overview

SimClusters is a general-purpose representation layer that uses overlapping communities to create sparse, interpretable vectors for users and heterogeneous content. It powers personalized tweet recommendations across X’s recommendation surfaces.

How It Works

1. Follow Graph as Bipartite Graph

SimClusters represents Twitter’s follow relationships as a bipartite graph with two node sets:
  • Producers: Users who are followed (~20M top followed users)
  • Consumers: Users who follow others
This bipartite graph can be represented as an m × n matrix where consumers are u and producers are v.

2. Community Detection (Known For)

The algorithm identifies communities of producers with similar followers:
  1. Producer-Producer Similarity: Computed using cosine similarity between users who follow each producer
  2. Graph Construction: Similarity scores create a weighted producer-producer graph
  3. Noise Removal: Edges below a threshold are deleted
  4. Community Detection: Metropolis-Hastings sampling identifies k communities
In production, SimClusters discovers approximately 145,000 communities from the top 20 million producers.
The result is an n × k “Known For” matrix (V) where each producer is affiliated with at most one community (maximally sparse).

3. Consumer Embeddings (InterestedIn)

The InterestedIn matrix (U) represents user interests:
U = A × V
Where:
  • A = Follow graph matrix
  • V = Known For matrix
InterestedIn embeddings capture users’ long-term interests and are a major source for consumer-based tweet recommendations.

4. Producer Embeddings

Since Known For restricts each producer to a single community, producer embeddings () provide richer representation:
  • Calculated as cosine similarity between each producer’s follow graph and the InterestedIn vector for each community
  • Captures that users tweet about multiple topics and are “known” in multiple communities
  • Used for producer-based recommendations (e.g., suggesting tweets from accounts you just followed)

5. Entity Embeddings

Tweet Embeddings

  • Initialization: Empty vector when tweet is created
  • Updates: Each time a tweet is favorited, the InterestedIn vector of the user who favorited it is added
  • Dynamic: Changes over time as engagement occurs
  • Usage: Calculate tweet similarity and recommend similar tweets based on engagement history
A real-time Heron job updates tweet embeddings as favorites occur. See summingbird/README.md for details.

Topic Embeddings

Topic embeddings (R) are determined by:
  • Cosine similarity between consumers interested in a community
  • Aggregated favorites on tweets with topic annotations
  • Time decay applied
  • Used for topic-related recommendations like TopicFollow

Architecture

Offline Jobs (Scalding)

JobCode LocationDescription
KnownForsimclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scalaOutputs KnownFor dataset storing clusterId ↔ producerUserId relationships for top 20M producers
InterestedIn Embeddingssimclusters_v2/scalding/InterestedInFromKnownFor.scalaComputes users’ InterestedIn embeddings from KnownFor dataset
Producer Embeddingssimclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scalaComputes producer embeddings representing content users produce
Semantic Core Entity Embeddingssimclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scalaComputes semantic core entity embeddings (entityId ↔ clusterId mappings)
Topic Embeddingssimclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scalaGenerates fav-based Topic-Follow-Graph embeddings

GCP Jobs (BigQuery)

GCP pipeline for building SimClusters ANN indices via BigQuery, enabling faster iterations:
JobCode LocationDescription
PushOpenBased Indexscio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scalaBuilds clusterId → TopTweet index based on user-open engagement for notifications
VideoViewBased Indexscio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scalaBuilds clusterId → TopTweet index based on video view history for Home video recommendations

Real-Time Streaming Jobs

JobCode LocationDescription
Tweet Embedding Jobsimclusters_v2/summingbird/storm/TweetJob.scalaGenerates real-time tweet embeddings and SimClusters index
Persistent Tweet Embedding Jobsimclusters_v2/summingbird/storm/PersistentTweetJob.scalaPersists tweet embeddings from MemCache to Manhattan

Where It’s Used

For You Timeline

Powers candidate generation and ranking for personalized tweet recommendations

Notifications

Generates tweet candidates for push notifications via SimClusters ANN

Video Recommendations

Recommends videos on Home timeline using video view-based indices

Similar Content

Finds similar tweets and accounts based on embedding similarity

Key Benefits

  • Sparse & Interpretable: Community-based vectors are easy to understand and compute
  • Multi-Modal: Supports users, tweets, topics, and other entities in the same space
  • Real-Time: Tweet embeddings update as engagement occurs
  • Scalable: Handles 20M producers and 145K communities in production
  • Representation Manager - Service to retrieve SimClusters embeddings (representation-manager/)
  • Representation Scorer - Computes similarity scores using embeddings (representation-scorer/)
  • SimClusters ANN - Fast approximate nearest neighbor search for recommendations (simclusters-ann/)

Build docs developers (and LLMs) love