Skip to main content

Overview

TwHIN (Twitter Heterogeneous Information Network) creates dense knowledge graph embeddings for users and tweets. Unlike SimClusters which produces sparse community-based embeddings, TwHIN generates dense vector representations by learning from the heterogeneous interaction graph on X.
For detailed information about TwHIN, see the TwHIN project documentation in the algorithm-ml repository.

How It Works

Heterogeneous Graph Learning

TwHIN models X as a heterogeneous information network containing:
  • Users: Account entities
  • Tweets: Post content
  • Interactions: Follows, favorites, retweets, replies, and other engagement types
The model learns dense embeddings by capturing the structure and relationships within this heterogeneous graph.

Dense Embeddings

Unlike SimClusters’ sparse community vectors, TwHIN produces dense embeddings:
User A: [0, 0, 0.8, 0, 0.2, 0, 0, ...] (145K dims, mostly zeros)
Dense embeddings can capture more nuanced relationships but require more computation and storage compared to sparse embeddings.

Key Characteristics

All dimensions have non-zero values, capturing rich latent features from the graph structure.
Learns from multiple entity types (users, tweets) and relationship types (follows, favorites, etc.) simultaneously.
Treats X as a knowledge graph where embeddings preserve structural and semantic relationships.
Works alongside SimClusters to provide both sparse interpretable and dense expressive representations.

Where It’s Used

TwHIN embeddings are used across X’s recommendation systems:

Tweet Recommendations

Powers candidate generation and ranking in For You timeline

User Recommendations

Suggests accounts to follow based on embedding similarity

Content Understanding

Represents semantic meaning of tweets and user interests

Similar Content Discovery

Finds related tweets and accounts using dense vector similarity

Architecture Integration

Representation Manager

TwHIN embeddings are served via the Representation Manager service, which:
  • Stores pre-computed embeddings
  • Provides fast retrieval APIs
  • Handles both SimClusters and TwHIN embeddings

Representation Scorer

The Representation Scorer uses TwHIN embeddings to:
  • Compute similarity scores between entities
  • Rank candidates based on embedding distance
  • Combine with other signals for final recommendations

Comparison with SimClusters

AspectSimClustersTwHIN
Vector TypeSparse (145K dims, ~5-10 non-zero)Dense (all dims non-zero)
InterpretabilityHigh (community-based)Lower (latent features)
ComputationFast (sparse operations)Slower (dense operations)
ExpressivenessGood for community patternsBetter for nuanced relationships
Use CaseCommunity-based recommendationsSemantic similarity matching
X uses both SimClusters and TwHIN embeddings in the recommendation pipeline, leveraging the strengths of each approach.

Training and Updates

For information about:
  • Model architecture
  • Training procedures
  • Update frequency
  • Performance characteristics
Refer to the TwHIN project README in the algorithm-ml repository.
  • SimClusters - Sparse community-based embeddings
  • Representation Manager - Embedding storage and retrieval service (representation-manager/)
  • Representation Scorer - Similarity scoring using embeddings (representation-scorer/)

Build docs developers (and LLMs) love