/process-jarvis - Knowledge Extraction Pipeline

Transforms raw content from inbox into structured knowledge through an 8-phase semantic processing pipeline. This is the core engine of Mega Brain.

Syntax

/process-jarvis [FILE_PATH]

FILE_PATH

string

required

Path to file in inbox/ directoryExample: inbox/ALEX HORMOZI/MASTERCLASSES/video-title.txt

Pipeline Overview

The JARVIS pipeline processes content through 8 mandatory phases:

┌──────────────────────────────────────────────────────────────────────────┐
│                         JARVIS PIPELINE v2.2                                  │
├──────────────────────────────────────────────────────────────────────────┤
│  Phase 1: Initialization + Validation          [PRE-1, POST-1]              │
│  Phase 2: Chunking (~300 words/chunk)           [PRE-2, POST-2]              │
│  Phase 3: Entity Resolution                     [PRE-3, POST-3]              │
│  Phase 4: Insight Extraction                    [PRE-4, POST-4]              │
│  Phase 5: Narrative Synthesis                   [PRE-5, POST-5]              │
│  Phase 6: Dossier Compilation                   [PRE-6, POST-6]              │
│  Phase 7: Agent Enrichment                      [User Prompt]                │
│  Phase 8: Finalization + Registry Update        [CHECKPOINT 7]               │
└──────────────────────────────────────────────────────────────────────────┘

All 8 phases are MANDATORY. The pipeline does NOT stop at Phase 7. Skipping Phase 8 will result in incomplete propagation.

Phase-by-Phase Breakdown

Phase 1: Initialization

Purpose: Validate input and extract metadata from file path

Validate File Exists

# Checks if file exists at specified path
test -f "$FILE_PATH" || exit 1

Extract Path Metadata

Path: inbox/COLE GORDON/MASTERMINDS/video-title.txt

Extracted:
  SOURCE_PERSON: "Cole Gordon"
  SOURCE_COMPANY: "Cole Gordon"
  SOURCE_TYPE: "MASTERMINDS" → mapped to "lecture"
  SOURCE_ID: "CG003" (auto-generated hash)
  SCOPE: "personal" (auto-determined)
  CORPUS: "closers_io" (from known sources)

Load State Files

Creates if missing:

CHUNKS-STATE.json
CANONICAL-MAP.json
INSIGHTS-STATE.json
NARRATIVES-STATE.json

Duplicate Detection (CRITICAL)

░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
⛔ DUPLICATE DETECTION - STOPS PROCESSING IF FOUND
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

Checks:
1. MD5 hash (exact duplicate)
2. Content hash (same content, different file)
3. Fingerprint (partial duplicate)
4. YouTube ID (same video already processed)

If duplicate found: EXIT immediately

Checkpoint: PRE-1 + POST-1 must pass

Phase 2: Chunking

Purpose: Split content into ~300-word semantic chunks with metadata

Chunk Structure
Chunking Rules
Output

{
  "id_chunk": "chunk_CG003_001",
  "content": "Full text of chunk...",
  "word_count": 287,
  "pessoas": ["Cole Gordon", "Alex Hormozi"],
  "temas": ["Sales Process", "Closing Techniques"],
  "meta": {
    "source_type": "lecture",
    "source_id": "CG003",
    "source_title": "video-title.txt",
    "source_path": "inbox/COLE GORDON/...",
    "source_datetime": "2026-03-06T10:00:00Z",
    "scope": "personal",
    "corpus": "closers_io"
  }
}

Target size: ~300 words (~1000 tokens)
Preserve: Timestamps, speaker labels, formatting
Extract: People mentioned (raw), themes (raw)
Sequential IDs: chunk_{SOURCE_ID}_{001...NNN}

Appends to processing/chunks/CHUNKS-STATE.json:

{
  "chunks": [
    {...},  // existing chunks
    {...}   // new chunks from this file
  ],
  "meta": {
    "last_updated": "2026-03-06T10:15:23Z",
    "total_chunks": 1247,
    "version": "v1"
  }
}

Checkpoint: PRE-2 + POST-2 must pass

Phase 3: Entity Resolution

Purpose: Canonicalize person/company names and themes

Why? “Cole”, “Cole Gordon”, “CG” all refer to the same person. This phase unifies them.

Canonical Mapping:
  "Alex" → "Alex Hormozi"
  "Hormozi" → "Alex Hormozi"
  "acquisition.com" → "Acquisition.com"
  
Threshold: 0.85 confidence
Output: CANONICAL-MAP.json + updated CHUNKS-STATE.json

Handles:

Name variations (“Cole” vs “Cole Gordon”)
Typos (“Hormozi” vs “Hormozzi”)
Abbreviations (“CG” → “Cole Gordon”)
Collisions (same name in different corpora)

Checkpoint: PRE-3 + POST-3 must pass

Phase 4: Insight Extraction

Purpose: Extract actionable insights with priority classification

Priority Levels
Insight Structure
Contradiction Detection

HIGH:
  - Affects money, structure, risk, decisions
  - Operational criticality
  - Example: "Commission structure must be 10% base + 5% accelerator"

MEDIUM:
  - Improves process/clarity
  - Not urgent
  - Example: "Weekly team meetings improve morale"

LOW:
  - Peripheral context
  - Background information
  - Example: "Cole started his career in 2015"

{
  "insight_id": "insight_CG003_042",
  "content": "Close rate drops 40% without proper qualification",
  "priority": "HIGH",
  "confidence": 0.92,
  "chunks": ["chunk_CG003_012", "chunk_CG003_013"],
  "actionable_by": ["CLOSER", "SALES-MANAGER"],
  "theme": "02-PROCESSO-VENDAS",
  "status": "new"
}

If insight contradicts existing insight:
  - Mark status: "contradiction"
  - Document both sides
  - Require human review
  
Example:
  Source A: "Cold calls work best 9-11am"
  Source B: "Cold calls work best 4-6pm"
  → Flag as contradiction, include in dossier

Output: processing/insights/INSIGHTS-STATE.json Checkpoint: PRE-4 + POST-4 must pass

Phase 5: Narrative Synthesis

Purpose: Synthesize insights into executive narratives

Style: “Executive memory” - clear, strategic, evidence-based

Narrative Structure:
  - Patterns identified
  - Positions (expert's stance)
  - Tensions (contradictions)
  - Open loops (unanswered questions)
  - Consensus points
  - Next questions

Merge Rules (CRITICAL):
  narrative: CONCATENATE with separator
  insights_included[]: APPEND (don't replace)
  tensions[]: APPEND (don't replace)
  open_loops[]: APPEND, mark RESOLVED if answered
  next_questions[]: REPLACE (only exception)

Output: processing/narratives/NARRATIVES-STATE.json Checkpoint: PRE-5 + POST-5 must pass

Phase 6: Dossier Compilation

Purpose: Transform narratives into Markdown dossiers

CRITICAL RULE: Every section MUST have chunk_ids for traceability

### Christmas Tree Structure [CG001_012, SS001_045]
  ✓ Correct

### Christmas Tree Structure
  ✗ BLOCKED - No chunk_ids

Person Dossiers
Theme Dossiers
Incremental Updates

# DOSSIER: COLE GORDON

**Voice:** 1st person ("I believe...")
**Sources:** CG001, CG002, CG003

## TL;DR
[1-2 sentence essence]

## Central Philosophy [CG001_045, CG002_012]
Core beliefs and worldview...

## Modus Operandi [CG001_067, CG003_023]
How this person operates...

## Technical Arsenal [CG002_089]
Frameworks and methodologies...

## Traps & Antipatterns [CG001_123]
What to avoid...

## Signature Quotes
> "Philosophy beats tactics" — [CG001_001]

# DOSSIER: 02-PROCESSO-VENDAS

**Voice:** Neutral narrator
**Contributors:** Cole Gordon, Alex Hormozi

## Overview
[Theme summary]

## Consensus Points [CG001_045, AH002_067]
What experts agree on...

## Divergences
Where experts disagree...

## Frameworks
### STAR Qualification [CG001_089]
- Situation
- Timing
- Authority
- Resources

IF dossier exists:
  MODE = "INCREMENTAL"
  
  Actions:
  1. APPEND new source to header
  2. APPEND new patterns/positions
  3. MERGE contradictions into Tensions section
  4. UPDATE last_updated timestamp
  
ELSE:
  MODE = "CREATE"
  Generate from template

Output:

knowledge/dossiers/persons/DOSSIER-{PERSON}.md
knowledge/dossiers/THEMES/DOSSIER-{THEME}.md

Checkpoint: PRE-6 + POST-6 must pass

Phase 7: Agent Enrichment

Purpose: Update agent MEMORYs with relevant knowledge

Theme-to-Agent Mapping:
  "02-PROCESSO-VENDAS" → [CLOSER, SDS, LNS]
  "04-COMISSIONAMENTO" → [SALES-MANAGER, CRO, CFO]
  "07-PRICING" → [CRO, CFO, CLOSER]
  
Framework-to-Agent Mapping:
  "3 Audience Buckets" → [CLOSER, SDS, LNS]
  "STAR Qualification" → [SDS, CLOSER]
  "28 Rules of Closing" → [CLOSER, SALES-MANAGER]

Process:

Identify themes in processed content
Map themes → relevant agents
Update each agent’s MEMORY.md
Append source_id to memory

Output: Updated agents/cargo/{AREA}/{ROLE}/MEMORY.md files Checkpoint: User confirmation prompt

Phase 8: Finalization (MANDATORY)

This phase is NON-OPTIONAL. Pipeline is incomplete without it.

Update RAG Index

python scripts/rag_index.py --knowledge --force

Re-indexes all knowledge files for semantic search

Update File Registry

python scripts/file_registry.py --scan

Registers MD5 hash and marks file as PROCESSED

Update SESSION-STATE.md

Adds entry to “Processed Files” table

Update INBOX-REGISTRY.md

Marks file as COMPLETE with propagation status

Verify Agent Coverage

CRITICAL CHECK:
For each theme/framework:
  → List expected agents
  → Verify each agent has source_id in MEMORY.md
  → If missing: LOG ERROR and FAIL

Example failure:

❌ AGENT COVERAGE FAILED

Expected: [CLOSER, SDS, LNS]
Received: [CLOSER, SDS]
MISSING: LNS

Framework "3 Audience Buckets" was NOT propagated to LNS

Role Tracking (Optional)

python scripts/role_tracker.py --scan

Counts role mentions. Auto-creates agent if ≥ 10 mentions.

Checkpoint: CHECKPOINT 7 (10 validation items) must all pass

Execution Report

After successful completion:

═══════════════════════════════════════════════════════════════════════════════
JARVIS PIPELINE COMPLETE: COLE GORDON (CG003)
═══════════════════════════════════════════════════════════════════════════════

[INPUT] SOURCE
   File: inbox/COLE GORDON/MASTERMINDS/video-title.txt
   Person: Cole Gordon (Closers.io)
   Type: lecture
   Words: 8,542

[CHUNK] CHUNKING
   Chunks created: 29
   Avg chunk size: 294 words

[ENTITY] ENTITY RESOLUTION
   Entities resolved: 47
   Aliases added: 12
   [!] Review queue: 0
   [!] Collisions: 0

[INSIGHT] INSIGHTS
   Total extracted: 63
   HIGH priority: 18
   MEDIUM priority: 32
   LOW priority: 13
   Contradictions: 0

[NARRATIVE] NARRATIVES
   Persons updated: 1 (Cole Gordon)
   Themes updated: 5
   Open loops: 8
   Tensions: 2

[DOSSIER] DOSSIERS (PHASE 6)
   Persons: 0 created, 1 updated (DOSSIER-COLE-GORDON.md)
   Themes: 0 created, 5 updated
   RAG indexed: 6 files

[AGENT] AGENT ENRICHMENT (PHASE 7)
   MEMORYs updated: 8 agents
   ✓ CLOSER
   ✓ SDS
   ✓ LNS
   ✓ SALES-MANAGER
   ✓ SALES-LEAD
   ✓ BDR
   ✓ CRO
   ✓ SALES-COORDINATOR

[FINALIZE] PHASE 8 COMPLETE
   ✓ RAG index updated (127 files)
   ✓ File registry updated
   ✓ SESSION-STATE updated
   ✓ INBOX-REGISTRY updated
   ✓ Agent coverage: 100% (8/8 agents)
   ✓ Role tracking: 3 roles scanned

[OK] STATUS: SUCCESS
═══════════════════════════════════════════════════════════════════════════════

Examples

/process-jarvis "inbox/ALEX HORMOZI/MASTERCLASSES/scaling-masterclass.txt"

Performance

Processing Time

Content Size	Chunks	Time
3k words	~10 chunks	2-3 min
10k words	~33 chunks	5-8 min
30k words	~100 chunks	15-20 min

Time varies based on chunk count, not file size. More chunks = longer processing.

Resource Usage

RAM: ~500MB per file
Disk: Temporary files in artifacts/
API calls: 0 (runs locally)

Error Handling

File Not Found

✗ PIPELINE FAILED

File not found: inbox/PERSON/file.txt

Check:
  1. File path is correct
  2. File exists in inbox/
  3. Spelling matches exactly

Duplicate Detected

⛔ DUPLICATE EXACT DETECTED - PROCESSING STOPPED
┌─────────────────────────────────────────────────────────────────────────┐
│  Current file: inbox/COLE GORDON/video.txt                                  │
│  MD5: abc123def456...                                                       │
│                                                                            │
│  Duplicate of: inbox/COLE GORDON/old-video.txt                             │
│  Registered: 2026-02-15T10:30:00Z                                          │
│  SOURCE_ID: CG001                                                          │
│                                                                            │
│  This file will NOT be processed.                                          │
└─────────────────────────────────────────────────────────────────────────┘

Agent Coverage Failed

❌ VERIFICATION FAILED: AGENT COVERAGE

Framework "3 Audience Buckets" detected

Expected agents: [CLOSER, SDS, LNS]
Agents updated: [CLOSER, SDS]
MISSING: LNS

Phase 7 must be re-run to fix coverage.

Phase Checkpoint Failure

✗ CHECKPOINT POST-4 FAILED

Insights extraction incomplete:
  - 0 HIGH priority insights (expected: > 0)
  - No chunk_ids in insights

Pipeline stopped. Review Phase 4 output.

Troubleshooting

”Pipeline stopped at Phase 7”

Issue: Pipeline does not continue to Phase 8 Solution: This is intentional. Phase 8 requires confirmation:

# Phase 7 completes, then prompts:
"Continue to Phase 8 (Finalization)? [Y/n]"

# Type 'Y' to proceed

“Chunk_ids missing in dossier”

Issue: Dossier compilation fails validation Solution: Phase 6 requires chunk_ids. Check:

### Section Title [CG001_045, CG002_067]
  ✓ Has chunk_ids

### Section Title
  ✗ Missing chunk_ids - BLOCKED

“Agent MEMORY not updated”

Issue: Agent doesn’t have source_id after Phase 7 Solution: Check theme/framework mapping:

Theme: "02-PROCESSO-VENDAS"
Expected agents: [CLOSER, SDS, LNS]

Verify:
  - Theme is correctly identified
  - Agent files exist at agents/cargo/SALES/{AGENT}/MEMORY.md
  - No file permission issues

Best Practices

1. Process in Order

Process files chronologically when possible:

# Good: Chronological order
/process-jarvis "inbox/PERSON/video-2024-01-15.txt"
/process-jarvis "inbox/PERSON/video-2024-02-20.txt"
/process-jarvis "inbox/PERSON/video-2024-03-10.txt"

# Suboptimal: Random order
/process-jarvis "inbox/PERSON/video-2024-03-10.txt"
/process-jarvis "inbox/PERSON/video-2024-01-15.txt"

2. Review High Priority Insights

After processing, check:

# View insights
cat processing/insights/INSIGHTS-STATE.json | jq '.insights_state.persons["Cole Gordon"] | .[] | select(.priority == "HIGH")'

3. Verify Agent Coverage

After Phase 8:

# Check which agents were updated
grep -r "CG003" agents/cargo/*/MEMORY.md

4. Monitor Contradictions

If contradictions found:

# List contradictions
cat processing/insights/INSIGHTS-STATE.json | jq '.insights_state.persons["Cole Gordon"] | .[] | select(.status == "contradiction")'

Advanced Usage

Batch Processing

Reprocessing

If file was already processed:

⚠️  File already processed: CG003

Reprocess? This will:
  - Remove old chunks for this source_id
  - Re-extract all insights
  - Update existing dossiers

[y/N]

Incremental Updates

Dossiers update incrementally:

## Modus Operandi

[Previous content...]

--- Update 2026-03-06 via CG003 ---

[New content from latest source...]

Next Steps

Extract DNA

Generate cognitive DNA after 3+ sources

JARVIS Briefing

Check processing statistics

Dossiers Guide

Understanding dossier structure

Agent System

How agents use processed knowledge

Get Started

Core Concepts

CLI Commands

Guides

Advanced

Documentation Index

​/process-jarvis - Knowledge Extraction Pipeline

​Syntax

​Pipeline Overview

​Phase-by-Phase Breakdown

​Phase 1: Initialization

​Phase 2: Chunking

​Phase 3: Entity Resolution

​Phase 4: Insight Extraction

​Phase 5: Narrative Synthesis

​Phase 6: Dossier Compilation

​Phase 7: Agent Enrichment

​Phase 8: Finalization (MANDATORY)

​Execution Report

​Examples

​Performance

​Processing Time

​Resource Usage

​Error Handling

​File Not Found

​Duplicate Detected

​Agent Coverage Failed

​Phase Checkpoint Failure

​Troubleshooting

​”Pipeline stopped at Phase 7”

​“Chunk_ids missing in dossier”

​“Agent MEMORY not updated”

​Best Practices

​1. Process in Order

​2. Review High Priority Insights

​3. Verify Agent Coverage

​4. Monitor Contradictions

​Advanced Usage

​Batch Processing

​Reprocessing

​Incremental Updates

​Next Steps

Extract DNA

JARVIS Briefing

Dossiers Guide

Agent System

Build docs developers (and LLMs) love

/process-jarvis - Knowledge Extraction Pipeline

Syntax

Pipeline Overview

Phase-by-Phase Breakdown

Phase 1: Initialization

Phase 2: Chunking

Phase 3: Entity Resolution

Phase 4: Insight Extraction

Phase 5: Narrative Synthesis

Phase 6: Dossier Compilation

Phase 7: Agent Enrichment

Phase 8: Finalization (MANDATORY)

Execution Report

Examples

Performance

Processing Time

Resource Usage

Error Handling

File Not Found

Duplicate Detected

Agent Coverage Failed

Phase Checkpoint Failure

Troubleshooting

”Pipeline stopped at Phase 7”

“Chunk_ids missing in dossier”

“Agent MEMORY not updated”

Best Practices

1. Process in Order

2. Review High Priority Insights

3. Verify Agent Coverage

4. Monitor Contradictions

Advanced Usage

Batch Processing

Reprocessing

Incremental Updates

Next Steps