Quickstart

Overview

This quickstart guides you through analyzing a sample dataset of programming conversations. You’ll learn how to:

Load conversation data
Run the analysis pipeline
Visualize the results

By the end, you’ll have a complete working example that you can adapt for your own data.

This tutorial uses a small public dataset (190 conversations) from HuggingFace. The entire process takes about 2 minutes on a modern laptop.

Prerequisites

Install Kura

If you haven’t already, install Kura with visualization support:

uv pip install "kura[visualization]"

Set up API keys

You’ll need an API key from one of these providers:

export OPENAI_API_KEY="your-key-here"

Complete example

Create a file called analyze.py with the following code:

analyze.py

import asyncio
from rich.console import Console
from kura.cache import DiskCacheStrategy
from kura.summarisation import summarise_conversations, SummaryModel
from kura.cluster import generate_base_clusters_from_conversation_summaries, ClusterDescriptionModel
from kura.meta_cluster import reduce_clusters_from_base_clusters, MetaClusterModel
from kura.dimensionality import reduce_dimensionality_from_clusters, HDBUMAP
from kura.visualization import visualise_pipeline_results
from kura.types import Conversation
from kura.checkpoints import JSONLCheckpointManager


async def main():
    console = Console()

    # Define models
    summary_model = SummaryModel(
        console=console,
        cache=DiskCacheStrategy(cache_dir="./.summary"),  # Disk-based caching
    )
    cluster_model = ClusterDescriptionModel(console=console)
    meta_cluster_model = MetaClusterModel(console=console)
    dimensionality_model = HDBUMAP()

    # Define checkpoints
    checkpoint_manager = JSONLCheckpointManager("./checkpoints", enabled=True)

    # Load conversations from HuggingFace dataset
    conversations = Conversation.from_hf_dataset(
        "ivanleomk/synthetic-gemini-conversations", split="train"
    )

    # Process through the pipeline step by step
    summaries = await summarise_conversations(
        conversations, model=summary_model, checkpoint_manager=checkpoint_manager
    )

    clusters = await generate_base_clusters_from_conversation_summaries(
        summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
    )

    reduced_clusters = await reduce_clusters_from_base_clusters(
        clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
    )

    projected_clusters = await reduce_dimensionality_from_clusters(
        reduced_clusters,
        model=dimensionality_model,
        checkpoint_manager=checkpoint_manager,
    )

    # Visualize results
    visualise_pipeline_results(projected_clusters, style="rich")


if __name__ == "__main__":
    asyncio.run(main())

Run the analysis

Execute the script:

python analyze.py

You’ll see progress output as Kura processes the conversations:

Loading Conversations: 100%|██████████| 190/190 [00:03<00:00, 52.11it/s]
Summarizing conversations: 100%|██████████| 190/190 [00:12<00:00, 15.83it/s]
Generating clusters: 100%|██████████| 1/1 [00:03<00:00, 3.21s/it]
Reducing to meta-clusters: 100%|██████████| 1/1 [00:02<00:00, 2.45s/it]
Reducing dimensionality: 100%|██████████| 1/1 [00:01<00:00, 1.12s/it]

Understanding the output

Kura will display a hierarchical visualization of discovered patterns:

📚 All Clusters (190 conversations)

╠══ 🔸 Data Analysis & Visualization
║   📊 38 conversations (20.0%) [████░░░░░░░░░░░░░░░░]
║   💭 Python and R programming for statistical analysis
║
║   ╠══ 🔸 R Programming
║   ║   📊 12 conversations (6.3%) [█░░░░░░░░░░░░░░░░░░░]
║   ║
║   ╠══ 🔸 Tableau Dashboards
║   ║   📊 10 conversations (5.3%) [█░░░░░░░░░░░░░░░░░░░]
║   ║
║   ╚══ 🔸 Pandas Data Manipulation
║       📊 16 conversations (8.4%) [██░░░░░░░░░░░░░░░░░░]
║
╠══ 🔸 Web Development
║   📊 45 conversations (23.7%) [█████░░░░░░░░░░░░░░░]
...

The exact output will vary depending on the dataset and the LLM’s clustering decisions.

What just happened?

Let’s break down each step:

Load conversations

conversations = Conversation.from_hf_dataset(
    "ivanleomk/synthetic-gemini-conversations", split="train"
)

Loads 190 programming conversations from a HuggingFace dataset. You can also load from:

Claude conversation exports: Conversation.from_claude_conversation_dump("file.json")
Custom JSON: Conversation.from_conversation_dump("file.json")
Your own format: Create Conversation objects manually

Summarize conversations

summaries = await summarise_conversations(
    conversations, model=summary_model, checkpoint_manager=checkpoint_manager
)

Each conversation is condensed into a concise task description using an LLM. Results are cached to disk (./.summary/) and checkpointed (./checkpoints/summaries.jsonl).

Run the script again and this step will be nearly instant thanks to caching!

Generate clusters

clusters = await generate_base_clusters_from_conversation_summaries(
    summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
)

Similar conversations are grouped together using K-means clustering on embedding vectors. The LLM generates descriptive names for each cluster.

Build hierarchy

reduced_clusters = await reduce_clusters_from_base_clusters(
    clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
)

Clusters are organized into a hierarchical structure (meta-clusters) for easier navigation. This helps you see both high-level patterns and specific sub-patterns.

Reduce dimensionality

projected_clusters = await reduce_dimensionality_from_clusters(
    reduced_clusters,
    model=dimensionality_model,
    checkpoint_manager=checkpoint_manager,
)

High-dimensional embeddings are projected to 2D using UMAP and HDBSCAN for visualization (useful for the web UI).

Checkpoints and caching

Kura automatically saves progress at each step:

./checkpoints/
  ├── summaries.jsonl
  ├── clusters.jsonl
  ├── reduced_clusters.jsonl
  └── projected_clusters.jsonl

./.summary/
  └── [cache files]

Benefits:

Resume from failure: If the script crashes, it picks up where it left off
Fast iteration: Change clustering parameters without re-summarizing
Inspect intermediate results: Load and analyze any checkpoint manually

Checkpoints are tied to the input data. If you change the conversations, delete the checkpoint directory to regenerate.

Visualizing results

Kura offers three visualization styles:

visualise_pipeline_results(projected_clusters, style="basic")

Basic shows a simple tree structure. Enhanced adds statistics and progress bars. Rich provides full color formatting and interactive-style output.

Loading your own data

To analyze your own conversations, create Conversation objects:

from kura.types import Conversation, Message
from datetime import datetime

conversations = [
    Conversation(
        chat_id="conv-1",
        created_at=datetime.now(),
        messages=[
            Message(
                created_at=datetime.now(),
                role="user",
                content="How do I reset my password?"
            ),
            Message(
                created_at=datetime.now(),
                role="assistant",
                content="You can reset your password by..."
            ),
        ],
        metadata={
            "source": "support_chat",
            "customer_tier": "premium",
        },
    ),
    # Add more conversations...
]

The metadata field is optional but useful for filtering and analysis. You can include any relevant information like user segments, conversation sources, or custom tags.

Customizing the pipeline

Using a different clustering method

Kura supports multiple clustering algorithms:

from kura.k_means import MiniBatchKmeansClusteringMethod

minibatch_kmeans = MiniBatchKmeansClusteringMethod(
    clusters_per_group=10,  # Target items per cluster
    batch_size=1000,        # Mini-batch size
    max_iter=100,           # Maximum iterations
    random_state=42,        # Reproducibility
)

clusters = await generate_base_clusters_from_conversation_summaries(
    summaries,
    model=cluster_model,
    clustering_method=minibatch_kmeans,
    checkpoint_manager=checkpoint_manager,
)

Using different checkpoint formats

Kura supports multiple checkpoint formats:

from kura.checkpoints import JSONLCheckpointManager
checkpoint_manager = JSONLCheckpointManager("./checkpoints", enabled=True)

HuggingFace dataset format is recommended for very large datasets (100k+ conversations) due to better compression and lazy loading.

Web UI

Kura includes a web interface for interactive exploration:

kura start-app --dir ./checkpoints

Open http://localhost:8000 in your browser to:

Explore clusters visually
Drill down into individual conversations
Filter by metadata
Export insights

Next steps

Core concepts

Learn about the analysis pipeline in depth

Checkpoints

Master checkpoint strategies for large datasets

Loading data

Learn all the ways to load conversation data

API reference

Explore the complete API documentation

Troubleshooting

Out of memory errors

For large datasets, use mini-batch clustering and HuggingFace checkpoints:

from kura.k_means import MiniBatchKmeansClusteringMethod
from kura.checkpoints import HFDatasetCheckpointManager

clustering_method = MiniBatchKmeansClusteringMethod(
    clusters_per_group=10,
    batch_size=1000,
)
checkpoint_manager = HFDatasetCheckpointManager("./checkpoints", enabled=True)

Rate limit errors

Reduce concurrent requests:

summary_model = SummaryModel(
    console=console,
    max_concurrent_requests=5,  # Default is 100
)

Poor clustering quality

Try adjusting the number of clusters:

meta_cluster_model = MetaClusterModel(
    console=console,
    max_clusters=15,  # Default is 10
)

For more troubleshooting tips, see the GitHub Issues or open a new one.

Get Started

Core Concepts

Guides

Examples

Overview

Prerequisites

Complete example

Run the analysis

Understanding the output

What just happened?

Checkpoints and caching

Visualizing results

Loading your own data

Customizing the pipeline

Using a different clustering method

Using different checkpoint formats

Web UI

Next steps

Core concepts

Checkpoints

Loading data

API reference

Troubleshooting

Out of memory errors

Rate limit errors

Poor clustering quality

Build docs developers (and LLMs) love

Get Started

Core Concepts

Guides

Examples

Documentation Index

​Overview

​Prerequisites

​Complete example

​Run the analysis

​Understanding the output

​What just happened?

​Checkpoints and caching

​Visualizing results

​Loading your own data

​Customizing the pipeline

​Using a different clustering method

​Using different checkpoint formats

​Web UI

​Next steps

Core concepts

Checkpoints

Loading data

API reference

​Troubleshooting

​Out of memory errors

​Rate limit errors

​Poor clustering quality

Build docs developers (and LLMs) love

Overview

Prerequisites

Complete example

Run the analysis

Understanding the output

What just happened?

Checkpoints and caching

Visualizing results

Loading your own data

Customizing the pipeline

Using a different clustering method

Using different checkpoint formats

Web UI

Next steps

Troubleshooting

Out of memory errors

Rate limit errors

Poor clustering quality