Documentation Index
Fetch the complete documentation index at: https://mintlify.com/juanceresa/sift-kg/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Each pipeline function corresponds to a CLI command but takes explicit parameters instead of reading from config files. Use these from Jupyter notebooks, web apps, or anywhere you want sift-kg as a library.run_pipeline
Signature
Parameters
Directory containing documents (PDF, text, HTML, 75+ formats)
LLM model string (e.g.
"openai/gpt-4o-mini", "anthropic/claude-3-5-sonnet-20241022")Domain configuration object loaded via
load_domain()Output directory for all artifacts (extractions, graph, narratives)
Budget cap in USD. Pipeline stops if cost exceeds this limit.
Whether to generate narrative summary at the end
Returns
Path to output directory containing all pipeline artifacts
Example
run_extract
Signature
Parameters
Directory containing documents to extract from
LLM model string (e.g.
"openai/gpt-4o-mini")Domain configuration
Where to save extraction JSON files
Budget cap in USD
Concurrent LLM calls per document
Characters per text chunk. Larger = fewer API calls but longer context.
Re-extract all documents, ignoring cached results
Extraction backend —
"kreuzberg" (default) or "pdfplumber"Enable OCR for scanned documents
OCR engine —
"tesseract", "easyocr", "paddleocr", or "gcv"OCR language code (ISO 639-3, e.g.
"eng", "spa", "fra")Max requests per minute for rate limiting
Returns
List of extraction results, one per document
Example
run_build
Signature
Parameters
Directory with extraction JSON files (from
run_extract)Domain configuration (used for review_required types)
Flag relations below this confidence for human review
Whether to remove redundant edges during graph construction
Returns
Populated knowledge graph saved to
output_dir/graph_data.jsonExample
run_resolve
Signature
Parameters
Directory with
graph_data.jsonLLM model string for entity comparison
Domain configuration (provides system context for smarter resolution)
Use semantic clustering for batching candidates (requires
sift-kg[embeddings])Concurrent LLM calls
Max requests per minute
Returns
Merge file with DRAFT proposals saved to
output_dir/merge_proposals.yamlExample
run_apply_merges
Signature
Parameters
Directory with
graph_data.json and review files (merge_proposals.yaml, relation_review.yaml)Returns
Stats dict with keys:
merges_applied(int): Number of entity merges appliedrejected_count(int): Number of relations rejected
Example
run_narrate
Signature
Parameters
Directory with
graph_data.jsonLLM model string
Optional domain context injected into LLM prompts
Generate per-entity descriptions (more expensive)
Budget cap in USD
Only regenerate community labels (~$0.01 cost)
Returns
Path to generated
narrative.md or communities.jsonExample
run_view
Signature
Parameters
Directory with
graph_data.jsonOutput HTML path (default:
output_dir/graph.html)Whether to open the visualization in a browser automatically
Show only top N entities by degree (useful for large graphs)
Hide nodes/edges below this confidence threshold
Show only entities from this document
Center visualization on entity ID (e.g.
"person:alice")Number of hops for neighborhood filter (used with
neighborhood)Focus on a specific community label
Returns
Path to generated interactive HTML file
Example
run_export
Signature
Parameters
Directory with
graph_data.jsonExport format —
"json", "graphml", "gexf", "csv", or "sqlite"Where to write output (default:
output_dir/graph.{fmt})Returns
Path to the exported file or directory (for CSV format)