Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This quickstart guides you through analyzing a sample dataset of programming conversations. You’ll learn how to:
  1. Load conversation data
  2. Run the analysis pipeline
  3. Visualize the results
By the end, you’ll have a complete working example that you can adapt for your own data.
This tutorial uses a small public dataset (190 conversations) from HuggingFace. The entire process takes about 2 minutes on a modern laptop.

Prerequisites

1

Install Kura

If you haven’t already, install Kura with visualization support:
uv pip install "kura[visualization]"
2

Set up API keys

You’ll need an API key from one of these providers:
export OPENAI_API_KEY="your-key-here"

Complete example

Create a file called analyze.py with the following code:
analyze.py
import asyncio
from rich.console import Console
from kura.cache import DiskCacheStrategy
from kura.summarisation import summarise_conversations, SummaryModel
from kura.cluster import generate_base_clusters_from_conversation_summaries, ClusterDescriptionModel
from kura.meta_cluster import reduce_clusters_from_base_clusters, MetaClusterModel
from kura.dimensionality import reduce_dimensionality_from_clusters, HDBUMAP
from kura.visualization import visualise_pipeline_results
from kura.types import Conversation
from kura.checkpoints import JSONLCheckpointManager


async def main():
    console = Console()

    # Define models
    summary_model = SummaryModel(
        console=console,
        cache=DiskCacheStrategy(cache_dir="./.summary"),  # Disk-based caching
    )
    cluster_model = ClusterDescriptionModel(console=console)
    meta_cluster_model = MetaClusterModel(console=console)
    dimensionality_model = HDBUMAP()

    # Define checkpoints
    checkpoint_manager = JSONLCheckpointManager("./checkpoints", enabled=True)

    # Load conversations from HuggingFace dataset
    conversations = Conversation.from_hf_dataset(
        "ivanleomk/synthetic-gemini-conversations", split="train"
    )

    # Process through the pipeline step by step
    summaries = await summarise_conversations(
        conversations, model=summary_model, checkpoint_manager=checkpoint_manager
    )

    clusters = await generate_base_clusters_from_conversation_summaries(
        summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
    )

    reduced_clusters = await reduce_clusters_from_base_clusters(
        clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
    )

    projected_clusters = await reduce_dimensionality_from_clusters(
        reduced_clusters,
        model=dimensionality_model,
        checkpoint_manager=checkpoint_manager,
    )

    # Visualize results
    visualise_pipeline_results(projected_clusters, style="rich")


if __name__ == "__main__":
    asyncio.run(main())

Run the analysis

Execute the script:
python analyze.py
You’ll see progress output as Kura processes the conversations:
Loading Conversations: 100%|██████████| 190/190 [00:03<00:00, 52.11it/s]
Summarizing conversations: 100%|██████████| 190/190 [00:12<00:00, 15.83it/s]
Generating clusters: 100%|██████████| 1/1 [00:03<00:00, 3.21s/it]
Reducing to meta-clusters: 100%|██████████| 1/1 [00:02<00:00, 2.45s/it]
Reducing dimensionality: 100%|██████████| 1/1 [00:01<00:00, 1.12s/it]

Understanding the output

Kura will display a hierarchical visualization of discovered patterns:
📚 All Clusters (190 conversations)

╠══ 🔸 Data Analysis & Visualization
║   📊 38 conversations (20.0%) [████░░░░░░░░░░░░░░░░]
║   💭 Python and R programming for statistical analysis

║   ╠══ 🔸 R Programming
║   ║   📊 12 conversations (6.3%) [█░░░░░░░░░░░░░░░░░░░]
║   ║
║   ╠══ 🔸 Tableau Dashboards
║   ║   📊 10 conversations (5.3%) [█░░░░░░░░░░░░░░░░░░░]
║   ║
║   ╚══ 🔸 Pandas Data Manipulation
║       📊 16 conversations (8.4%) [██░░░░░░░░░░░░░░░░░░]

╠══ 🔸 Web Development
║   📊 45 conversations (23.7%) [█████░░░░░░░░░░░░░░░]
...
The exact output will vary depending on the dataset and the LLM’s clustering decisions.

What just happened?

Let’s break down each step:
1

Load conversations

conversations = Conversation.from_hf_dataset(
    "ivanleomk/synthetic-gemini-conversations", split="train"
)
Loads 190 programming conversations from a HuggingFace dataset. You can also load from:
  • Claude conversation exports: Conversation.from_claude_conversation_dump("file.json")
  • Custom JSON: Conversation.from_conversation_dump("file.json")
  • Your own format: Create Conversation objects manually
2

Summarize conversations

summaries = await summarise_conversations(
    conversations, model=summary_model, checkpoint_manager=checkpoint_manager
)
Each conversation is condensed into a concise task description using an LLM. Results are cached to disk (./.summary/) and checkpointed (./checkpoints/summaries.jsonl).
Run the script again and this step will be nearly instant thanks to caching!
3

Generate clusters

clusters = await generate_base_clusters_from_conversation_summaries(
    summaries, model=cluster_model, checkpoint_manager=checkpoint_manager
)
Similar conversations are grouped together using K-means clustering on embedding vectors. The LLM generates descriptive names for each cluster.
4

Build hierarchy

reduced_clusters = await reduce_clusters_from_base_clusters(
    clusters, model=meta_cluster_model, checkpoint_manager=checkpoint_manager
)
Clusters are organized into a hierarchical structure (meta-clusters) for easier navigation. This helps you see both high-level patterns and specific sub-patterns.
5

Reduce dimensionality

projected_clusters = await reduce_dimensionality_from_clusters(
    reduced_clusters,
    model=dimensionality_model,
    checkpoint_manager=checkpoint_manager,
)
High-dimensional embeddings are projected to 2D using UMAP and HDBSCAN for visualization (useful for the web UI).

Checkpoints and caching

Kura automatically saves progress at each step:
./checkpoints/
  ├── summaries.jsonl
  ├── clusters.jsonl
  ├── reduced_clusters.jsonl
  └── projected_clusters.jsonl

./.summary/
  └── [cache files]
Benefits:
  • Resume from failure: If the script crashes, it picks up where it left off
  • Fast iteration: Change clustering parameters without re-summarizing
  • Inspect intermediate results: Load and analyze any checkpoint manually
Checkpoints are tied to the input data. If you change the conversations, delete the checkpoint directory to regenerate.

Visualizing results

Kura offers three visualization styles:
visualise_pipeline_results(projected_clusters, style="basic")
Basic shows a simple tree structure. Enhanced adds statistics and progress bars. Rich provides full color formatting and interactive-style output.

Loading your own data

To analyze your own conversations, create Conversation objects:
from kura.types import Conversation, Message
from datetime import datetime

conversations = [
    Conversation(
        chat_id="conv-1",
        created_at=datetime.now(),
        messages=[
            Message(
                created_at=datetime.now(),
                role="user",
                content="How do I reset my password?"
            ),
            Message(
                created_at=datetime.now(),
                role="assistant",
                content="You can reset your password by..."
            ),
        ],
        metadata={
            "source": "support_chat",
            "customer_tier": "premium",
        },
    ),
    # Add more conversations...
]
The metadata field is optional but useful for filtering and analysis. You can include any relevant information like user segments, conversation sources, or custom tags.

Customizing the pipeline

Using a different clustering method

Kura supports multiple clustering algorithms:
from kura.k_means import MiniBatchKmeansClusteringMethod

minibatch_kmeans = MiniBatchKmeansClusteringMethod(
    clusters_per_group=10,  # Target items per cluster
    batch_size=1000,        # Mini-batch size
    max_iter=100,           # Maximum iterations
    random_state=42,        # Reproducibility
)

clusters = await generate_base_clusters_from_conversation_summaries(
    summaries,
    model=cluster_model,
    clustering_method=minibatch_kmeans,
    checkpoint_manager=checkpoint_manager,
)

Using different checkpoint formats

Kura supports multiple checkpoint formats:
from kura.checkpoints import JSONLCheckpointManager
checkpoint_manager = JSONLCheckpointManager("./checkpoints", enabled=True)
HuggingFace dataset format is recommended for very large datasets (100k+ conversations) due to better compression and lazy loading.

Web UI

Kura includes a web interface for interactive exploration:
kura start-app --dir ./checkpoints
Open http://localhost:8000 in your browser to:
  • Explore clusters visually
  • Drill down into individual conversations
  • Filter by metadata
  • Export insights

Next steps

Core concepts

Learn about the analysis pipeline in depth

Checkpoints

Master checkpoint strategies for large datasets

Loading data

Learn all the ways to load conversation data

API reference

Explore the complete API documentation

Troubleshooting

Out of memory errors

For large datasets, use mini-batch clustering and HuggingFace checkpoints:
from kura.k_means import MiniBatchKmeansClusteringMethod
from kura.checkpoints import HFDatasetCheckpointManager

clustering_method = MiniBatchKmeansClusteringMethod(
    clusters_per_group=10,
    batch_size=1000,
)
checkpoint_manager = HFDatasetCheckpointManager("./checkpoints", enabled=True)

Rate limit errors

Reduce concurrent requests:
summary_model = SummaryModel(
    console=console,
    max_concurrent_requests=5,  # Default is 100
)

Poor clustering quality

Try adjusting the number of clusters:
meta_cluster_model = MetaClusterModel(
    console=console,
    max_clusters=15,  # Default is 10
)
For more troubleshooting tips, see the GitHub Issues or open a new one.

Build docs developers (and LLMs) love