Base clustering groups conversation summaries into thematic clusters based on their embeddings. Each cluster receives an LLM-generated name and description that captures what the conversations have in common.This is the first level of organization before meta-clustering creates hierarchies.
def cluster(self, items: list[dict]) -> dict[int, list[ConversationSummary]]: embeddings = [item["embedding"] for item in items] data = [item["item"] for item in items] # Calculate number of clusters: ceil(total_items / target_size) n_clusters = math.ceil(len(data) / self.clusters_per_group) # K-means clustering kmeans = KMeans(n_clusters=n_clusters) cluster_labels = kmeans.fit_predict(embeddings) # Group items by cluster return { i: [data[j] for j in range(len(data)) if cluster_labels[j] == i] for i in range(n_clusters) }
From kura/cluster.py:30-63, the default prompt instructs the LLM to:
Analyze positive examples (conversations in the cluster)
Compare with contrastive examples (conversations from other clusters)
Generate:
2-sentence summary in past tense
Short name (max 10 words) in imperative form
Example Cluster Prompt (abbreviated)
You are tasked with summarizing a group of related statements into a short, precise, and accurate description and name.Summarize all the statements into a clear, precise, two-sentence description in the past tense. Your summary should be specific to this group and distinguish it from the contrastive answers of the other groups.Generate a short name for the group of statements. This name should be at most ten words long and be specific but also reflective of most of the statements.The cluster name should be a sentence in the imperative that captures the user's request. For example, 'Brainstorm ideas for a birthday party' or 'Help me find a new job.'Below are the related statements:<positive_examples>{% for item in positive_examples %}{{ item }}{% endfor %}</positive_examples>For context, here are statements from nearby groups:<contrastive_examples>{% for item in contrastive_examples %}{{ item }}{% endfor %}</contrastive_examples>
The LLM receives examples from OTHER clusters to help it be specific:
def get_contrastive_examples( cluster_id: int, cluster_id_to_summaries: Dict[int, List[ConversationSummary]], max_contrastive_examples: int = 10,) -> List[ConversationSummary]: # Sample up to 10 examples from other clusters other_clusters = [c for c in cluster_id_to_summaries.keys() if c != cluster_id] all_examples = [] for cluster in other_clusters: all_examples.extend(cluster_id_to_summaries[cluster]) return np.random.choice(all_examples, size=max_contrastive_examples, replace=False)
custom_prompt = """You are analyzing customer support conversations. Create a cluster name and description that helps support teams quickly understand the issue type.Focus on:1. The specific product feature or area2. The type of problem (bug, how-to, feature request)3. Common user pain pointsConversations in this cluster:<positive_examples>{% for item in positive_examples %}{{ item }}{% endfor %}</positive_examples>Conversations from other clusters:<contrastive_examples>{% for item in contrastive_examples %}{{ item }}{% endfor %}</contrastive_examples>Generate:1. A 2-sentence description focusing on the problem pattern2. A short name like "Billing: Failed payment method updates""""clusters = await generate_base_clusters_from_conversation_summaries( summaries=summaries, prompt=custom_prompt, ...)
# More clusters = more granular, but more LLM callsclustering_method = KmeansClusteringModel(clusters_per_group=5) # 200 clusters for 1000 conversations# Fewer clusters = faster, but less specificclustering_method = KmeansClusteringModel(clusters_per_group=50) # 20 clusters for 1000 conversations
# Test on a subset firsttest_clusters = await generate_base_clusters_from_conversation_summaries( summaries=summaries[:100], clustering_method=KmeansClusteringModel(clusters_per_group=5))# Review cluster qualityfor cluster in test_clusters: print(cluster.name)