When extracting entities from multiple documents, the same real-world entity often appears under different names:
Mr. EdwardsBradley EdwardsDetective EdwardsBrad Edwardsedwards
Without deduplication, your knowledge graph contains duplicate nodes for the same entity, fragmenting relationships and making analysis difficult.sift-kg solves this with a 4-layer deduplication approach that combines deterministic rules, semantic similarity, LLM reasoning, and human review.
See /home/daytona/workspace/source/src/sift_kg/graph/prededup.py:30 for the complete list.Canonical SelectionWhen multiple variants normalize to the same form, the canonical is chosen by:
Implementation at /home/daytona/workspace/source/src/sift_kg/graph/prededup.py:177:
def _pick_canonical(names: list[str]) -> str: counts = Counter(names) max_count = max(counts.values()) most_frequent = [n for n, c in counts.items() if c == max_count] if len(most_frequent) == 1: return most_frequent[0] # Tiebreak: longest name max_len = max(len(n) for n in most_frequent) longest = [n for n in most_frequent if len(n) == max_len] return sorted(longest)[0] # Alphabetical tiebreaker
Entities are processed in type-specific batches to improve accuracy.SortingEntities are sorted before batching:
PERSON entities: Sorted by surname so name variants cluster
"Detective Joe Recarey" → sort key: "recarey detective joe recarey""Joseph Recarey" → sort key: "recarey joseph recarey"# These appear adjacent in sorted list
Other types: Alphabetically by name
Implementation at /home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:40:
def _person_sort_key(name: str) -> str: """Sort PERSON entities by surname. 'Mr. Edwards', 'Bradley Edwards', 'Edwards' all sort under 'edwards'. """ normalized = unidecode(name).lower().strip() # Strip title prefixes changed = True while changed: changed = False for prefix in _TITLE_PREFIXES: if normalized.startswith(prefix + " "): normalized = normalized[len(prefix) + 1:].strip() changed = True break # Sort by last word (surname), then full name parts = normalized.split() surname = parts[-1] if parts else normalized return f"{surname} {normalized}"
Overlapping WindowsLarge entity sets are split into batches with overlap:
Batch size: 100 entities max
Overlap: 20 entities between consecutive batches
This prevents duplicates from being missed at batch boundaries.
Batch 1: entities[0:100]Batch 2: entities[80:180] # 20-entity overlap with batch 1Batch 3: entities[160:260] # 20-entity overlap with batch 2
From /home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:393:
Analyze these PERSON entities and identify:1. Duplicates — entities that refer to the exact same thing (merge them)2. Variants — entities that are a subtype, version, or specific implementation of a parent entity (link them with EXTENDS)Look for:- Name variations (abbreviations, nicknames, full vs common names, misspellings)- Title/honorific prefixes that don't change identity (Dr., Mr., Detective, etc.)- First name vs nickname variants- Aliases — if an entity's aliases list contains a name matching another entity, they are very likely the same- Same person referenced differently across documents- DO NOT merge genuinely different people (father and son, unrelated people sharing a surname)IMPORTANT: If entity B is a variant/subtype/version of entity A (not the samething, but derived from it), put it in "variants" NOT "groups". Only trueduplicates go in "groups".Return valid JSON only:{ "groups": [ { "canonical_id": "id of the best/most complete entity", "canonical_name": "the preferred name", "member_ids": ["id1", "id2"], "confidence": 0.0-1.0, "reason": "brief explanation" } ], "variants": [ { "parent_id": "id of the parent/base entity", "child_id": "id of the variant/subtype", "confidence": 0.0-1.0, "reason": "brief explanation" } ]}
When a domain provides system_context, it’s prepended to the prompt:
system_context: | You are analyzing academic papers to map the intellectual landscape. Distinguish carefully: - "Transformer" as an architecture is a THEORY - "GPT-2" the trained model is a SYSTEM These should NOT be merged — GPT-2 EXTENDS Transformer
This helps the LLM make domain-appropriate decisions.
These are merged automatically (no LLM call needed):
Canonical type: The one with more connections (higher degree)
Reason: “Same name across types (CONCEPT vs PHENOMENON). Relations will be combined.”
From /home/daytona/workspace/source/src/sift_kg/resolve/resolver.py:190:
def _find_cross_type_duplicates(kg: KnowledgeGraph) -> list[MergeProposal]: # Group by normalized name name_groups: dict[str, list[tuple[str, str, int]]] = defaultdict(list) for nid, data in kg.graph.nodes(data=True): entity_type = data.get("entity_type", "") if entity_type in SKIP_TYPES: continue name = data.get("name", "").strip().lower() degree = kg.graph.degree(nid) name_groups[name].append((nid, entity_type, degree)) proposals = [] for _name, group in name_groups.items(): types = {t for _, t, _ in group} if len(types) < 2: # Only one type, no cross-type dup continue # Canonical = highest degree group.sort(key=lambda x: x[2], reverse=True) canonical_id, canonical_type, _ = group[0] # Create merge proposal...
┌─ Merge 1/23 ─────────────────────────────────────┐│ Merge into: Bradley Edwards (person:bradley_ed…) ││ Type: PERSON ││ ││ Members to merge ││ Member ID Confid… ││ Mr. Edwards person:mr_edwards 92% ││ Detective Edwards person:detective_e… 88% ││ ││ Reason: Same person with different titles │└───────────────────────────────────────────────────┘ [a]pprove [r]eject [s]kip [q]uit →
User decisions:
Approve: Status changes to CONFIRMED, will be applied
For each CONFIRMED proposal, sift-kg:1. Merge Node DataCanonical entity accumulates data from merged members:
# Combine source documentscanonical_docs = canonical.get("source_documents", [])member_docs = member.get("source_documents", [])for doc in member_docs: if doc not in canonical_docs: canonical_docs.append(doc)# Keep higher confidenceif member.get("confidence") > canonical.get("confidence"): canonical["confidence"] = member["confidence"]# Merge attributes (canonical takes precedence)for key, value in member.get("attributes", {}).items(): if key not in canonical_attrs: canonical_attrs[key] = value# Track member names as aliasesmember_name = member.get("name")if member_name not in aliases: aliases.append(member_name)
From /home/daytona/workspace/source/src/sift_kg/resolve/engine.py:95.2. Rewrite EdgesAll relations pointing to/from merged members are redirected to the canonical:
for source, target, key, data in kg.graph.edges(data=True, keys=True): new_source = merge_map.get(source, source) new_target = merge_map.get(target, target) if new_source != source or new_target != target: kg.graph.remove_edge(source, target, key=key) # Skip self-loops if new_source == new_target: stats["self_loops_removed"] += 1 continue kg.graph.add_edge(new_source, new_target, key=key, **data)
3. Remove Merged Nodes
for member_id in valid_map: if kg.graph.has_node(member_id): kg.graph.remove_node(member_id) stats["nodes_removed"] += 1
4. Remove Rejected RelationsRelations marked REJECTED in relation_review.yaml are deleted:
rejection_keys: set[tuple[str, str, str]] = set()for entry in review_file.rejected: rejection_keys.add((entry.source_id, entry.target_id, entry.relation_type)) # Also handle symmetric rejection_keys.add((entry.target_id, entry.source_id, entry.relation_type))for source, target, key, data in kg.graph.edges(data=True, keys=True): rel_type = data.get("relation_type", "") if (source, target, rel_type) in rejection_keys: kg.graph.remove_edge(source, target, key=key) removed += 1
From /home/daytona/workspace/source/src/sift_kg/resolve/engine.py:140.
┌─ Merge 1/1 ──────────────────────────────────────┐│ Merge into: Bradley Edwards ││ Type: PERSON ││ ││ Members to merge ││ Member ID Confidence ││ Mr. Edwards person:mr_edwards 92% ││ ││ Reason: Same person with title variations │└───────────────────────────────────────────────────┘ [a]pprove [r]eject [s]kip [q]uit → a
The duplicate EMPLOYED_BY relation becomes a self-loop and is dropped.Final result: One consolidated entity with complete provenance and all name variations tracked.
Provide domain context to help the LLM make better decisions:
name: "Academic Research"system_context: | When resolving entities: - "Transformer" (architecture) vs "BERT" (model): DO NOT MERGE BERT EXTENDS Transformer, they are related but distinct - "ResNet" vs "Residual Networks": MERGE (same thing) - "John Smith" (researcher) vs "J. Smith" (author): LIKELY SAME unless there's evidence of different affiliations
If the LLM makes systematic mistakes, add guidance to system_context:
system_context: | Common mistakes to avoid: - "University of X" and "X University" are the SAME (merge them) - "John Smith" and "John Smith Jr." are DIFFERENT (father and son) - "ACL" (conference) and "ACL" (association) are DIFFERENT entities