Processing Pipeline

The JARVIS pipeline is the core of Mega Brain, transforming raw transcriptions into structured knowledge through 5 phases. This guide walks through each phase with real examples.

Pipeline Overview

Phase 1: Initialization

Validates input, extracts metadata, loads state files, detects duplicates

Phase 2: Chunking

Breaks content into semantic segments (~300 words each)

Phase 3: Entity Resolution

Canonicalizes person names, themes, and concepts

Phase 4: Insight Extraction

Extracts frameworks, heuristics, and actionable insights

Phase 5: Narrative Synthesis

Creates coherent narratives by person and theme

The complete pipeline takes 2-5 minutes per material depending on length.

Starting the Pipeline

Basic Processing

Process a single file:

/process-jarvis inbox/cole-gordon/MASTERCLASS/closing-techniques.txt

Auto-Process on Ingest

Combine ingestion and processing:

/ingest https://youtube.com/watch?v=abc123 --process

Phase 1: Initialization

What Happens in Initialization

1.1 Input Validation

IF file does not exist:
  → LOG ERROR: "File not found"
  → EXIT with status: FILE_NOT_FOUND

1.2 Metadata Extraction

From the file path:

inbox/COLE GORDON/MASTERMINDS/video-title.txt
       ↓            ↓           ↓
  SOURCE_PERSON  SOURCE_TYPE  FILENAME

Extracted metadata:

SOURCE_PERSON: “Cole Gordon”
SOURCE_COMPANY: “Cole Gordon”
SOURCE_TYPE: “MASTERCLASS”
SOURCE_ID: “CG003” (auto-generated)
SCOPE: “company” or “personal”
CORPUS: “closers_io”

1.3 State Files Loading

Loads or creates:

CHUNKS-STATE.json - All semantic chunks
CANONICAL-MAP.json - Entity normalization
INSIGHTS-STATE.json - Extracted insights
NARRATIVES-STATE.json - Synthesized narratives

1.4 Duplicate Detection

6-level check prevents reprocessing: ✓ MD5 hash comparison ✓ Content hash (ignores formatting) ✓ Partial content matching ✓ YouTube ID lookup ✓ File registry check ✓ Chunk fingerprint analysis

Phase 2: Chunking

Semantic Segmentation

Content is broken into ~300-word semantic chunks preserving:

Timestamps
Speaker labels
Formatting
Context boundaries

{
  "id_chunk": "chunk_CG003_042",
  "source_id": "CG003",
  "source_path": "inbox/cole-gordon/MASTERCLASS/...",
  "source_type": "lecture",
  "text": "The CLOSER needs to master NEPQ...",
  "speaker": "Cole Gordon",
  "word_count": 287,
  "pessoas": ["Cole Gordon", "closer"],
  "temas": ["sales", "objection handling"],
  "key_concepts": ["NEPQ", "discovery call"],
  "chunk_sequence": 42
}

Chunks are the foundation of traceability - every insight traces back to specific chunks.

Phase 3: Entity Resolution

Canonicalization Process

Normalizes variations of the same entity:

Person Names
Themes
Concepts

Problem: Multiple variations

"Sam oven"
"Sam Ovens"
"sam"
"Samuel Ovens"

Solution: Canonical form

Canonical: "Sam Ovens"
Aliases: ["sam", "Sam oven", "Samuel Ovens"]
Confidence: 0.95

Problem: Synonym explosion

"objection handling"
"overcoming objections"
"handling resistance"
"dealing with objections"

Solution: Theme normalization

Canonical: "Objection Handling"
Related: ["handling resistance", "overcoming objections"]
Domain: "Sales"

Problem: Framework variations

"NEPQ"
"Neuro-Emotional Persuasion Questions"
"nepq framework"
"neuro emotional persuasion"

Solution: Concept registry

Canonical: "NEPQ (Neuro-Emotional Persuasion Questions)"
Author: "Jeremy Miner"
Category: "Sales Framework"

Merge Thresholds

Entity resolution uses confidence thresholds to prevent false merges:

≥ 0.95: Auto-merge (high confidence)
0.85-0.94: Add to review queue
< 0.85: Keep separate

Output:

Phase 3/5 - Resolution ............ OK (8 entities)

Entities resolved: 12
Aliases added: 5
Review queue: 2 (manual review needed)
Collisions: 0 (no name conflicts)

Phase 4: Insight Extraction

Insight Classification

Extracts structured knowledge with priority levels:

Priority Levels Explained

HIGH Priority - Impacts money, structure, risk, critical decisions

"Close rate below 60% means you need script work, not more leads"
→ HIGH (affects revenue directly)

MEDIUM Priority - Improves process/clarity but not urgent

"Use CRM tags to track objection types by prospect stage"
→ MEDIUM (operational improvement)

LOW Priority - Contextual or peripheral information

"Cole Gordon started his sales career at age 19"
→ LOW (background context)

Insight Structure

{
  "insight_id": "INS_CG003_042",
  "category": "HEURISTIC",
  "priority": "HIGH",
  "content": "If close rate < 60%, problem is script, not lead volume",
  "chunks": ["chunk_CG003_042", "chunk_CG003_043"],
  "confidence": 0.92,
  "actionable_by": ["closer", "sales-manager"],
  "frameworks_referenced": ["NEPQ"],
  "status": "new"
}

Knowledge Layers (DNA Schema)

L1: Philosophies

Core beliefs and worldview

Appear 3+ times in different contexts
No numeric thresholds
Example: “Philosophy beats tactics”

L2: Mental Models

Thinking frameworks and lenses

Generate specific questions
Change how you see problems
Example: “3 Audience Buckets (YES/NO/MAYBE)”

L3: Heuristics

Rules with numeric thresholds (MOST VALUABLE)

Format: “If X then Y”
Contains numbers
Example: “If show rate < 75%, fix confirmation system”

L4: Frameworks

Structured methodologies

Named components
No rigid order
Example: “NEPQ Framework (Situation, Problem, Implication, Need-Payoff)”

L5: Methodologies

Step-by-step processes

Rigid order required
Success criteria per step
Example: “7-Step Closing Process”

Output:

Phase 4/5 - Extraction ............ OK (12 insights)

Total extracted: 12
HIGH priority: 5
MEDIUM priority: 4
LOW priority: 3
Contradictions: 0

Phase 5: Narrative Synthesis

Creating Coherent Stories

Synthesizes insights into executive memory format:

By Person
By Theme

Aggregates all insights from a person:

## Alex Hormozi - Narrative Synthesis

### Position on Pricing
Hormozi consistently advocates for value-based pricing...
[chunk_AH001_023, chunk_AH002_045]

### Patterns Identified
1. Always ties price to value equation (4 variables)
2. Rejects cost-plus pricing in all contexts
3. References Porsche pricing as case study

### Open Loops
- How does this apply to services vs products?
- What's the threshold for "premium" positioning?

Aggregates insights across people on a theme:

## Objection Handling - Cross-Expert Synthesis

### Consensus Points
All 3 experts agree:
- Objections = lack of perceived value
- Pre-empt objections in presentation
- Never argue with prospect

### Divergences
- Cole Gordon: Use NEPQ questions
- Alex Hormozi: Use Value Equation
- Jeremy Miner: Use Neuro-Emotional triggers

### Tensions
[Documented contradictions with evidence]

Incremental Updates

Narratives are APPENDED to, never replaced:

Merge rules:

narrative: CONCATENATE with separator
insights_included[]: APPEND chunk_ids
tensions[]: APPEND new tensions
open_loops[]: APPEND new, mark RESOLVED for answered
next_questions[]: REPLACE (only exception)

Output:

Phase 5/5 - Synthesis ............. OK (3 narratives)

Persons updated: 1 (Cole Gordon)
Themes updated: 2 (Sales Process, Objection Handling)
Open loops: 4 identified
Tensions: 1 documented

Phase 6: Dossier Compilation

Generates Markdown dossiers:

# DOSSIER: Cole Gordon

**Sources:** CG001, CG002, CG003
**Last Updated:** 2026-03-06
**Density:** ◐◐◐◯◯ (3/5)

## TL;DR

Closing expert focused on high-ticket sales...
[CG001_012, CG002_034]

## Central Philosophy

"The prospect already knows if they want to buy..."
[CG001_001]

## Modus Operandi

### Discovery-First Approach [CG001_023, CG001_024]
...

Complete Pipeline Output

═══════════════════════════════════════════════
        JARVIS PIPELINE COMPLETE
         Cole Gordon (CG003)
═══════════════════════════════════════════════

[INPUT] SOURCE
   File: inbox/cole-gordon/MASTERCLASS/closing.txt
   Person: Cole Gordon (Cole Gordon)
   Type: MASTERCLASS
   Words: 6,647

[CHUNK] CHUNKING
   Chunks created: 23
   Avg chunk size: 289 words

[ENTITY] ENTITY RESOLUTION
   Entities resolved: 12
   Aliases added: 5
   [!] Review queue: 2
   [!] Collisions: 0

[INSIGHT] INSIGHTS
   Total extracted: 12
   HIGH priority: 5
   MEDIUM priority: 4
   LOW priority: 3
   Contradictions: 0

[NARRATIVE] NARRATIVES
   Persons updated: 1
   Themes updated: 2
   Open loops: 4
   Tensions: 1

[DOSSIER] DOSSIERS
   Persons: 0 created, 1 updated
   Themes: 1 created, 1 updated
   RAG indexed: 2 files

[OK] STATUS: SUCCESS
   Time: 2m 34s

═══════════════════════════════════════════════

Troubleshooting

Common Issues & Solutions

Issue: “File not found”

Solution: Verify file path is correct:
/process-jarvis inbox/[PERSON]/[TYPE]/[FILE].txt

Issue: “Duplicate detected”

Solution: File already processed. Check file-registry.json
To reprocess: Remove entry from registry first

Issue: “Review queue has entries”

Solution: Manual review needed for ambiguous entities
Check: /processing/canonical/REVIEW-QUEUE.json

Issue: “Low insight extraction (< 5 insights)“

Possible causes:
- Content too generic (not expert-level)
- Poor transcription quality
- Wrong content type classification

Next Steps

Extract DNA

Create expert mind clones from processed materials

Use Agents

Query agents enriched with new knowledge

Run Conclave

Multi-agent deliberation on strategic decisions

Manage Sessions

Save and resume processing sessions

Get Started

Core Concepts

CLI Commands

Guides

Advanced

Pipeline Overview

Starting the Pipeline

Basic Processing

Auto-Process on Ingest

Phase 1: Initialization

1.1 Input Validation

1.2 Metadata Extraction

1.3 State Files Loading

1.4 Duplicate Detection

Phase 2: Chunking

Semantic Segmentation

Phase 3: Entity Resolution

Canonicalization Process

Merge Thresholds

Phase 4: Insight Extraction

Insight Classification

Insight Structure

Knowledge Layers (DNA Schema)

Phase 5: Narrative Synthesis

Creating Coherent Stories

Incremental Updates

Phase 6: Dossier Compilation

Complete Pipeline Output

Troubleshooting

Next Steps

Extract DNA

Use Agents

Run Conclave

Manage Sessions

Build docs developers (and LLMs) love

Get Started

Core Concepts

CLI Commands

Guides

Advanced

Documentation Index

​Pipeline Overview

​Starting the Pipeline

​Basic Processing

​Auto-Process on Ingest

​Phase 1: Initialization

​1.1 Input Validation

​1.2 Metadata Extraction

​1.3 State Files Loading

​1.4 Duplicate Detection

​Phase 2: Chunking

​Semantic Segmentation

​Phase 3: Entity Resolution

​Canonicalization Process

​Merge Thresholds

​Phase 4: Insight Extraction

​Insight Classification

​Insight Structure

​Knowledge Layers (DNA Schema)

​Phase 5: Narrative Synthesis

​Creating Coherent Stories

​Incremental Updates

​Phase 6: Dossier Compilation

​Complete Pipeline Output

​Troubleshooting

​Next Steps

Extract DNA

Use Agents

Run Conclave

Manage Sessions

Build docs developers (and LLMs) love

Pipeline Overview

Starting the Pipeline

Basic Processing

Auto-Process on Ingest

Phase 1: Initialization

1.1 Input Validation

1.2 Metadata Extraction

1.3 State Files Loading

1.4 Duplicate Detection

Phase 2: Chunking

Semantic Segmentation

Phase 3: Entity Resolution

Canonicalization Process

Merge Thresholds

Phase 4: Insight Extraction

Insight Classification

Insight Structure

Knowledge Layers (DNA Schema)

Phase 5: Narrative Synthesis

Creating Coherent Stories

Incremental Updates

Phase 6: Dossier Compilation

Complete Pipeline Output

Troubleshooting

Next Steps