SimClusters

Overview

SimClusters is a general-purpose representation layer that uses overlapping communities to create sparse, interpretable vectors for users and heterogeneous content. It powers personalized tweet recommendations across X’s recommendation surfaces.

Published in KDD 2020 Applied Data Science Track: SimClusters: Community-Based Representations for Heterogeneous Recommendations at Twitter

How It Works

1. Follow Graph as Bipartite Graph

SimClusters represents Twitter’s follow relationships as a bipartite graph with two node sets:

Producers: Users who are followed (~20M top followed users)
Consumers: Users who follow others

This bipartite graph can be represented as an m × n matrix where consumers are u and producers are v.

2. Community Detection (Known For)

The algorithm identifies communities of producers with similar followers:

Producer-Producer Similarity: Computed using cosine similarity between users who follow each producer
Graph Construction: Similarity scores create a weighted producer-producer graph
Noise Removal: Edges below a threshold are deleted
Community Detection: Metropolis-Hastings sampling identifies k communities

In production, SimClusters discovers approximately 145,000 communities from the top 20 million producers.

The result is an n × k “Known For” matrix (V) where each producer is affiliated with at most one community (maximally sparse).

3. Consumer Embeddings (InterestedIn)

The InterestedIn matrix (U) represents user interests:

U = A × V

Where:

A = Follow graph matrix
V = Known For matrix

InterestedIn embeddings capture users’ long-term interests and are a major source for consumer-based tweet recommendations.

4. Producer Embeddings

Since Known For restricts each producer to a single community, producer embeddings (Ṽ) provide richer representation:

Calculated as cosine similarity between each producer’s follow graph and the InterestedIn vector for each community
Captures that users tweet about multiple topics and are “known” in multiple communities
Used for producer-based recommendations (e.g., suggesting tweets from accounts you just followed)

5. Entity Embeddings

Tweet Embeddings

Initialization: Empty vector when tweet is created
Updates: Each time a tweet is favorited, the InterestedIn vector of the user who favorited it is added
Dynamic: Changes over time as engagement occurs
Usage: Calculate tweet similarity and recommend similar tweets based on engagement history

A real-time Heron job updates tweet embeddings as favorites occur. See summingbird/README.md for details.

Topic Embeddings

Topic embeddings (R) are determined by:

Cosine similarity between consumers interested in a community
Aggregated favorites on tweets with topic annotations
Time decay applied
Used for topic-related recommendations like TopicFollow

Architecture

Offline Jobs (Scalding)

Job	Code Location	Description
KnownFor	`simclusters_v2/scalding/update_known_for/UpdateKnownFor20M145K2020.scala`	Outputs KnownFor dataset storing clusterId ↔ producerUserId relationships for top 20M producers
InterestedIn Embeddings	`simclusters_v2/scalding/InterestedInFromKnownFor.scala`	Computes users’ InterestedIn embeddings from KnownFor dataset
Producer Embeddings	`simclusters_v2/scalding/embedding/ProducerEmbeddingsFromInterestedIn.scala`	Computes producer embeddings representing content users produce
Semantic Core Entity Embeddings	`simclusters_v2/scalding/embedding/EntityToSimClustersEmbeddingsJob.scala`	Computes semantic core entity embeddings (entityId ↔ clusterId mappings)
Topic Embeddings	`simclusters_v2/scalding/embedding/tfg/FavTfgBasedTopicEmbeddings.scala`	Generates fav-based Topic-Follow-Graph embeddings

GCP Jobs (BigQuery)

GCP pipeline for building SimClusters ANN indices via BigQuery, enabling faster iterations:

Job	Code Location	Description
PushOpenBased Index	`scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala`	Builds clusterId → TopTweet index based on user-open engagement for notifications
VideoViewBased Index	`scio/bq_generation/simclusters_index_generation/EngagementEventBasedClusterToTweetIndexGenerationJob.scala`	Builds clusterId → TopTweet index based on video view history for Home video recommendations

Real-Time Streaming Jobs

Job	Code Location	Description
Tweet Embedding Job	`simclusters_v2/summingbird/storm/TweetJob.scala`	Generates real-time tweet embeddings and SimClusters index
Persistent Tweet Embedding Job	`simclusters_v2/summingbird/storm/PersistentTweetJob.scala`	Persists tweet embeddings from MemCache to Manhattan

Where It’s Used

For You Timeline

Powers candidate generation and ranking for personalized tweet recommendations

Notifications

Generates tweet candidates for push notifications via SimClusters ANN

Video Recommendations

Recommends videos on Home timeline using video view-based indices

Key Benefits

Sparse & Interpretable: Community-based vectors are easy to understand and compute
Multi-Modal: Supports users, tweets, topics, and other entities in the same space
Real-Time: Tweet embeddings update as engagement occurs
Scalable: Handles 20M producers and 145K communities in production

Representation Manager - Service to retrieve SimClusters embeddings (representation-manager/)
Representation Scorer - Computes similarity scores using embeddings (representation-scorer/)
SimClusters ANN - Fast approximate nearest neighbor search for recommendations (simclusters-ann/)

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Overview

How It Works

1. Follow Graph as Bipartite Graph

2. Community Detection (Known For)

3. Consumer Embeddings (InterestedIn)

4. Producer Embeddings

5. Entity Embeddings

Tweet Embeddings

Topic Embeddings

Architecture

Offline Jobs (Scalding)

GCP Jobs (BigQuery)

Real-Time Streaming Jobs

Where It’s Used

For You Timeline

Notifications

Video Recommendations

Similar Content

Key Benefits

Build docs developers (and LLMs) love

Overview

Core Services

Models & Embeddings

Machine Learning

Data Pipeline

Development

Documentation Index

​Overview

​How It Works

​1. Follow Graph as Bipartite Graph

​2. Community Detection (Known For)

​3. Consumer Embeddings (InterestedIn)

​4. Producer Embeddings

​5. Entity Embeddings

​Tweet Embeddings

​Topic Embeddings

​Architecture

​Offline Jobs (Scalding)

​GCP Jobs (BigQuery)

​Real-Time Streaming Jobs

​Where It’s Used

For You Timeline

Notifications

Video Recommendations

Similar Content

​Key Benefits

​Related Components

Build docs developers (and LLMs) love

Overview

How It Works

1. Follow Graph as Bipartite Graph

2. Community Detection (Known For)

3. Consumer Embeddings (InterestedIn)

4. Producer Embeddings

5. Entity Embeddings

Tweet Embeddings

Topic Embeddings

Architecture

Offline Jobs (Scalding)

GCP Jobs (BigQuery)

Real-Time Streaming Jobs

Where It’s Used

Key Benefits

Related Components