Documentation Index
Fetch the complete documentation index at: https://mintlify.com/jxnl/kura/llms.txt
Use this file to discover all available pages before exploring further.
The Conversation Model
At the core of Kura is theConversation Pydantic model defined in kura/types/conversation.py:
Fields
- chat_id: Unique identifier for the conversation (must be unique across your dataset)
- created_at: Timestamp when the conversation started
- messages: Ordered list of user and assistant messages
- metadata: Flexible dictionary for custom properties (e.g., model name, language, user type)
The
metadata field supports strings, numbers, booleans, and lists of these types. This is validated at the type level.Loading Conversations
From HuggingFace Datasets
The most common method for loading large conversation datasets:Parameters
dataset_name(str): HuggingFace dataset identifiersplit(str): Dataset split to load (default: “train”)max_conversations(int | None): Limit number of conversations (useful for testing)chat_id_fn(callable): Function to extract chat ID from dataset rowcreated_at_fn(callable): Function to extract timestamp from dataset rowmessages_fn(callable): Function to extract messages from dataset rowmetadata_fn(callable): Function to extract custom metadata from dataset row
From Claude Conversation Dumps
If you’ve exported your Claude conversation history:- Parsing Claude’s JSON format
- Converting message timestamps
- Mapping “human”/“assistant” roles to “user”/“assistant”
- Extracting text content from Claude’s content blocks
From JSONL Files
For conversations previously saved by Kura:Custom Data Sources
For any other format, createConversation objects directly:
Saving Conversations
Export conversations to JSONL for later use:Working with Metadata
Metadata enriches conversations with custom properties for filtering and analysis.Attaching Metadata at Load Time
LLM-Powered Metadata Extraction
For complex properties that require analysis, use LLM extractors during the summarization stage (see Summarization).Example: Language Detection Extractor
Example: Language Detection Extractor
Message Structure
EachMessage represents a single turn in the conversation:
Important Notes
- Messages must alternate between “user” and “assistant” roles for most LLM analysis
- The
contentfield should contain only text (no tool calls, images, etc.) - Messages are ordered chronologically by their position in the list
Best Practices
Unique Chat IDs
Ensure each conversation has a globally uniquechat_id:
Handling Large Datasets
For datasets with millions of conversations:- Use
max_conversationsto test your pipeline on a subset first - Use HuggingFace Datasets checkpoint format for efficient storage
- Consider batching conversations into multiple analysis runs
Metadata Strategy
Decide what to include in metadata:- Static properties: Model name, language, user type → attach at load time
- Computed properties: Sentiment, complexity, topics → use LLM extractors
- Analysis results: Cluster IDs, scores → added automatically by Kura
Next Steps
Summarization
Learn how conversations are analyzed and summarized