Documentation Index Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This quickstart guides you through analyzing a sample dataset of programming conversations. You’ll learn how to:
Load conversation data
Run the analysis pipeline
Visualize the results
By the end, you’ll have a complete working example that you can adapt for your own data.
This tutorial uses a small public dataset (190 conversations) from HuggingFace. The entire process takes about 2 minutes on a modern laptop.
Prerequisites
Install Kura
If you haven’t already, install Kura with visualization support: uv pip install "kura[visualization]"
Set up API keys
You’ll need an API key from one of these providers: export OPENAI_API_KEY = "your-key-here"
Complete example
Create a file called analyze.py with the following code:
import asyncio
from rich.console import Console
from kura.cache import DiskCacheStrategy
from kura.summarisation import summarise_conversations, SummaryModel
from kura.cluster import generate_base_clusters_from_conversation_summaries, ClusterDescriptionModel
from kura.meta_cluster import reduce_clusters_from_base_clusters, MetaClusterModel
from kura.dimensionality import reduce_dimensionality_from_clusters, HDBUMAP
from kura.visualization import visualise_pipeline_results
from kura.types import Conversation
from kura.checkpoints import JSONLCheckpointManager
async def main ():
console = Console()
# Define models
summary_model = SummaryModel(
console = console,
cache = DiskCacheStrategy( cache_dir = "./.summary" ), # Disk-based caching
)
cluster_model = ClusterDescriptionModel( console = console)
meta_cluster_model = MetaClusterModel( console = console)
dimensionality_model = HDBUMAP()
# Define checkpoints
checkpoint_manager = JSONLCheckpointManager( "./checkpoints" , enabled = True )
# Load conversations from HuggingFace dataset
conversations = Conversation.from_hf_dataset(
"ivanleomk/synthetic-gemini-conversations" , split = "train"
)
# Process through the pipeline step by step
summaries = await summarise_conversations(
conversations, model = summary_model, checkpoint_manager = checkpoint_manager
)
clusters = await generate_base_clusters_from_conversation_summaries(
summaries, model = cluster_model, checkpoint_manager = checkpoint_manager
)
reduced_clusters = await reduce_clusters_from_base_clusters(
clusters, model = meta_cluster_model, checkpoint_manager = checkpoint_manager
)
projected_clusters = await reduce_dimensionality_from_clusters(
reduced_clusters,
model = dimensionality_model,
checkpoint_manager = checkpoint_manager,
)
# Visualize results
visualise_pipeline_results(projected_clusters, style = "rich" )
if __name__ == "__main__" :
asyncio.run(main())
Run the analysis
Execute the script:
You’ll see progress output as Kura processes the conversations:
Loading Conversations: 100%|██████████| 190/190 [00:03<00:00, 52.11it/s]
Summarizing conversations: 100%|██████████| 190/190 [00:12<00:00, 15.83it/s]
Generating clusters: 100%|██████████| 1/1 [00:03<00:00, 3.21s/it]
Reducing to meta-clusters: 100%|██████████| 1/1 [00:02<00:00, 2.45s/it]
Reducing dimensionality: 100%|██████████| 1/1 [00:01<00:00, 1.12s/it]
Understanding the output
Kura will display a hierarchical visualization of discovered patterns:
📚 All Clusters (190 conversations)
╠══ 🔸 Data Analysis & Visualization
║ 📊 38 conversations (20.0%) [████░░░░░░░░░░░░░░░░]
║ 💭 Python and R programming for statistical analysis
║
║ ╠══ 🔸 R Programming
║ ║ 📊 12 conversations (6.3%) [█░░░░░░░░░░░░░░░░░░░]
║ ║
║ ╠══ 🔸 Tableau Dashboards
║ ║ 📊 10 conversations (5.3%) [█░░░░░░░░░░░░░░░░░░░]
║ ║
║ ╚══ 🔸 Pandas Data Manipulation
║ 📊 16 conversations (8.4%) [██░░░░░░░░░░░░░░░░░░]
║
╠══ 🔸 Web Development
║ 📊 45 conversations (23.7%) [█████░░░░░░░░░░░░░░░]
...
The exact output will vary depending on the dataset and the LLM’s clustering decisions.
What just happened?
Let’s break down each step:
Load conversations
conversations = Conversation.from_hf_dataset(
"ivanleomk/synthetic-gemini-conversations" , split = "train"
)
Loads 190 programming conversations from a HuggingFace dataset. You can also load from:
Claude conversation exports: Conversation.from_claude_conversation_dump("file.json")
Custom JSON: Conversation.from_conversation_dump("file.json")
Your own format: Create Conversation objects manually
Summarize conversations
summaries = await summarise_conversations(
conversations, model = summary_model, checkpoint_manager = checkpoint_manager
)
Each conversation is condensed into a concise task description using an LLM. Results are cached to disk (./.summary/) and checkpointed (./checkpoints/summaries.jsonl). Run the script again and this step will be nearly instant thanks to caching!
Generate clusters
clusters = await generate_base_clusters_from_conversation_summaries(
summaries, model = cluster_model, checkpoint_manager = checkpoint_manager
)
Similar conversations are grouped together using K-means clustering on embedding vectors. The LLM generates descriptive names for each cluster.
Build hierarchy
reduced_clusters = await reduce_clusters_from_base_clusters(
clusters, model = meta_cluster_model, checkpoint_manager = checkpoint_manager
)
Clusters are organized into a hierarchical structure (meta-clusters) for easier navigation. This helps you see both high-level patterns and specific sub-patterns.
Reduce dimensionality
projected_clusters = await reduce_dimensionality_from_clusters(
reduced_clusters,
model = dimensionality_model,
checkpoint_manager = checkpoint_manager,
)
High-dimensional embeddings are projected to 2D using UMAP and HDBSCAN for visualization (useful for the web UI).
Checkpoints and caching
Kura automatically saves progress at each step:
./checkpoints/
├── summaries.jsonl
├── clusters.jsonl
├── reduced_clusters.jsonl
└── projected_clusters.jsonl
./.summary/
└── [cache files]
Benefits:
Resume from failure : If the script crashes, it picks up where it left off
Fast iteration : Change clustering parameters without re-summarizing
Inspect intermediate results : Load and analyze any checkpoint manually
Checkpoints are tied to the input data. If you change the conversations, delete the checkpoint directory to regenerate.
Visualizing results
Kura offers three visualization styles:
Basic
Enhanced
Rich (recommended)
visualise_pipeline_results(projected_clusters, style = "basic" )
Basic shows a simple tree structure. Enhanced adds statistics and progress bars. Rich provides full color formatting and interactive-style output.
Loading your own data
To analyze your own conversations, create Conversation objects:
from kura.types import Conversation, Message
from datetime import datetime
conversations = [
Conversation(
chat_id = "conv-1" ,
created_at = datetime.now(),
messages = [
Message(
created_at = datetime.now(),
role = "user" ,
content = "How do I reset my password?"
),
Message(
created_at = datetime.now(),
role = "assistant" ,
content = "You can reset your password by..."
),
],
metadata = {
"source" : "support_chat" ,
"customer_tier" : "premium" ,
},
),
# Add more conversations...
]
The metadata field is optional but useful for filtering and analysis. You can include any relevant information like user segments, conversation sources, or custom tags.
Customizing the pipeline
Using a different clustering method
Kura supports multiple clustering algorithms:
from kura.k_means import MiniBatchKmeansClusteringMethod
minibatch_kmeans = MiniBatchKmeansClusteringMethod(
clusters_per_group = 10 , # Target items per cluster
batch_size = 1000 , # Mini-batch size
max_iter = 100 , # Maximum iterations
random_state = 42 , # Reproducibility
)
clusters = await generate_base_clusters_from_conversation_summaries(
summaries,
model = cluster_model,
clustering_method = minibatch_kmeans,
checkpoint_manager = checkpoint_manager,
)
Kura supports multiple checkpoint formats:
JSONL (default)
Parquet
HuggingFace Datasets
from kura.checkpoints import JSONLCheckpointManager
checkpoint_manager = JSONLCheckpointManager( "./checkpoints" , enabled = True )
HuggingFace dataset format is recommended for very large datasets (100k+ conversations) due to better compression and lazy loading.
Web UI
Kura includes a web interface for interactive exploration:
kura start-app --dir ./checkpoints
Open http://localhost:8000 in your browser to:
Explore clusters visually
Drill down into individual conversations
Filter by metadata
Export insights
Next steps
Core concepts Learn about the analysis pipeline in depth
Checkpoints Master checkpoint strategies for large datasets
Loading data Learn all the ways to load conversation data
API reference Explore the complete API documentation
Troubleshooting
Out of memory errors
For large datasets, use mini-batch clustering and HuggingFace checkpoints:
from kura.k_means import MiniBatchKmeansClusteringMethod
from kura.checkpoints import HFDatasetCheckpointManager
clustering_method = MiniBatchKmeansClusteringMethod(
clusters_per_group = 10 ,
batch_size = 1000 ,
)
checkpoint_manager = HFDatasetCheckpointManager( "./checkpoints" , enabled = True )
Rate limit errors
Reduce concurrent requests:
summary_model = SummaryModel(
console = console,
max_concurrent_requests = 5 , # Default is 100
)
Poor clustering quality
Try adjusting the number of clusters:
meta_cluster_model = MetaClusterModel(
console = console,
max_clusters = 15 , # Default is 10
)
For more troubleshooting tips, see the GitHub Issues or open a new one.