Use this file to discover all available pages before exploring further.
Community detection is a critical component of GraphRAG that organizes the knowledge graph into hierarchical clusters. This structure enables both global reasoning about dataset themes and efficient navigation through related information.
Community detection identifies groups of entities that are densely connected to each other but sparsely connected to entities in other groups. In GraphRAG, this reveals:
Thematic clusters: Groups of entities discussing related topics
Organizational structure: How information in your dataset is naturally organized
Multiple granularities: From broad themes to specific subtopics
Navigation pathways: How to traverse from global to local information
In a knowledge graph visualization, each circle represents an entity sized by its degree (number of connections), and colors represent different community memberships.
The entity-relationship graph is converted to an undirected weighted graph:
Nodes: Entities from the knowledge graph
Edges: Relationships between entities
Weights: Relationship strength (based on frequency and context)
# From cluster_graph.py# Normalize edge direction (undirected graph)lo = edge_df[["source", "target"]].min(axis=1)hi = edge_df[["source", "target"]].max(axis=1)edge_df["source"] = loedge_df["target"] = hiedge_df.drop_duplicates(subset=["source", "target"], keep="last")
2
Optional LCC filtering
If configured, extract only the largest connected component:
if use_lcc: edge_df = stable_lcc(edge_df)
This focuses clustering on the main graph, filtering out small disconnected components.
3
Initial clustering
The Leiden algorithm identifies communities by optimizing modularity—a measure of how well the graph is partitioned into communities.Communities are groups where:
High edge density within the community
Low edge density between communities
4
Recursive refinement
The algorithm is applied recursively to create hierarchy:
Apply Leiden to create level 0 (leaf communities)
If any community exceeds max_cluster_size, subdivide it
Repeat until all leaf communities are below threshold
Communities include only intra-community relationships—edges where both source and target are in the same community:
# From create_communities.py# For each hierarchy level, find relationships within communitiesfor level in communities["level"].unique(): level_comms = communities[communities["level"] == level] # Join relationships with community memberships with_source = relationships.merge(level_comms, left_on="source", right_on="title") with_both = with_source.merge(level_comms, left_on="target", right_on="title") # Keep only intra-community edges intra = with_both[with_both["community_x"] == with_both["community_y"]]
This ensures each community has a self-contained subgraph.
How many hierarchy levels were created.Typical: 2-4 levels
Too many (5+): max_cluster_size may be too small
Too few (1): max_cluster_size may be too large
Communities per level
Distribution of communities across levels.Level 0: Most communities (hundreds to thousands)
Mid levels: Fewer communities
Root: 1-5 communities
Entity distribution
How entities are distributed across communities.Check for:
Communities with very few entities
Unbalanced distribution
Isolated entities
Coverage
Percentage of entities in communities.High coverage (>95%): Good graph connectivity
Low coverage: Many disconnected entities (consider use_lcc setting)
Global search heavy: Optimize for good high-level summaries (larger clusters)
Local search heavy: Optimize for detailed leaf communities (smaller clusters)
Both: Use default balanced approach
Changing community detection parameters requires re-running the entire indexing pipeline from Phase 4 onward.