Spacebot’s ingestion system processes files from a watched directory, extracts text, chunks it, and imports each chunk as memories via the standard memory recall + save flow.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/spacedriveapp/spacebot/llms.txt
Use this file to discover all available pages before exploring further.
How It Works
Ingestion is a background polling loop:Process Chunk
Create a fresh branch for each chunk. The branch uses
memory_recall to check for duplicates, then memory_save to store new knowledge.Track Progress
Each chunk’s completion is recorded in
ingestion_progress table. If the server restarts mid-file, already-completed chunks are skipped.Configuration
agent.toml
Whether to run the ingestion loop (default: false)
How often to scan for new files (default: 60)
Target chunk size in characters (default: 4000). Chunks split at line boundaries—no partial lines.
Ingestion Directory
Files are read from{workspace}/ingest/:
Supported Formats
Text-like files and PDFs:- Plain text:
.txt,.md,.log - Structured data:
.json,.jsonl,.csv,.tsv,.yaml,.yml,.toml - Markup:
.xml,.html,.htm,.rst,.org - Documents:
.pdf
src/agent/ingestion.rs
Chunking
Text is split at line boundaries to preserve semantic units:src/agent/ingestion.rs
Processing Flow
Each chunk gets a fresh LLM agent with memory tools:src/agent/ingestion.rs
- Read the chunk
- Use
memory_recallto check for duplicates or related memories - Extract facts, decisions, preferences, or events
- Save via
memory_savewith appropriate types and importance
Each chunk is independent—no history carries over between chunks. This keeps memory usage bounded.
Progress Tracking
Progress is tracked by content hash (SHA-256):Failure Handling
If any chunk errors (e.g., provider 401, rate limit):- The file stays in
ingest/ - Progress records persist
- Status is marked
failediningestion_files - The next poll retries failed chunks
src/agent/ingestion.rs
API Access
Upload files via HTTP:{workspace}/ingest/ and marks it as queued in ingestion_files. The polling loop picks it up on the next scan.
Query ingestion status:
Use Cases
Meeting Transcripts
Export Zoom/Meet transcripts as text. Spacebot ingests and saves key decisions, action items, and context.
Documentation
Import project docs, READMEs, or API references as memories for later recall.
Research Papers
Upload PDFs. Spacebot extracts text, chunks, and saves findings.
Email Archives
Export mailbox as JSONL or CSV. Ingestion creates memories from important threads.
Best Practices
Pre-process Large Files
Pre-process Large Files
Files over 1MB should be split before ingestion. Chunking happens in-memory—extremely large files may hit memory limits.
Clean Up Noise
Clean Up Noise
Remove headers, footers, or boilerplate before ingesting. The LLM processes everything, so cleaner input = better memories.
Monitor Failures
Monitor Failures
Check the
ingestion_files table for failed status. Common causes: rate limits, invalid UTF-8, or corrupted PDFs.Use Descriptive Filenames
Use Descriptive Filenames
The filename is included in the chunk prompt.
2026-02-28-standup.txt is better than notes.txt.Performance Tuning
poll_interval_secs
Lower = faster pickup, higher = less polling overhead. 60s is reasonable for most use cases.
chunk_size
Smaller chunks = faster per-chunk processing, more chunks total. 4000 chars is ~1000 tokens, a good balance.