Overview
The document processing pipeline handles three main workflows:Supported Formats
Word Documents
.doc, .docxConverted via LibreOffice + python-docxPDF Documents
.pdfConverted via pymupdf4llm or LibreOfficeExcel Spreadsheets
.xls, .xlsxLoaded into SQLite tablesConversion Pipeline
Word to Markdown (.docx)
Direct conversion using python-docx preserves document structure:- Preserves heading hierarchy (H1-H6)
- Converts tables to Markdown table format
- Escapes special characters in cells
- Maintains paragraph structure
Legacy Word (.doc) via LibreOffice
Older.doc files are converted through LibreOffice headless:
Linux vs. Windows: The Windows version used PowerShell + Word COM automation. The Linux version uses LibreOffice headless, which doesn’t require Microsoft Office.
PDF to Markdown
PDFs are converted using pymupdf4llm with LibreOffice as fallback:- Primary: pymupdf4llm (direct PDF → Markdown, better table preservation)
- Fallback: LibreOffice → .docx → python-docx (if pymupdf4llm fails)
Excel to SQLite
Spreadsheets are loaded into SQLite tables for structured queries:convertidor.py:94-115
Chunking with Overlap
Markdown documents are split into fixed-size chunks with overlap to preserve context.Chunking Algorithm
Why Overlap Matters
Without Overlap (Bad)
Without Overlap (Bad)
❌ Context lost at boundaries
❌ Poor semantic search performance
With Overlap (Good)
With Overlap (Good)
✅ Context maintained
✅ Overlap ensures continuity
Chunk Configuration
siaa_proxy.py:288-296
- Chunk 1: chars 0-800
- Chunk 2: chars 500-1300 (overlap: 300 chars)
- Chunk 3: chars 1000-1800 (overlap: 300 chars)
Section Tracking
Each chunk remembers the last heading seen before it:siaa_proxy.py:661-667
Fuente: PSAA16-10476 [ARTÍCULO 5 - RESPONSABILIDAD]
Index Generation
During document loading, multiple indexes are built:TF-IDF Keywords Index
siaa_proxy.py:807-816
Density Index
siaa_proxy.py:819-829
Pre-computed Chunks
siaa_proxy.py:779-780
Document Loading Process
siaa_proxy.py:730-843
Running the Converter
Command Line Usage
Default Paths (Linux)
convertidor.py:63-67
Auto-Reload
After conversion, the system automatically triggers index reload:convertidor.py:560-567
Folder Structure
The converter expects this structure:Installation Requirements
System Dependencies
Python Packages
Error Handling
When conversion fails, an error Markdown file is generated:convertidor.py:361-371
Reloading Documents
Reload documents without restarting the server:Performance Considerations
LibreOffice Timeout
2 minutes per fileFiles taking longer are likely corrupted
Chunk Pre-computation
Once at load timeAvoids re-chunking on every query
Index Build Time
~2-5 seconds per 10 documentsTF-IDF + density + chunks
Memory Usage
~5 MB per 100 chunksAll chunks cached in RAM
Best Practices
Document Organization
Document Organization
- One folder per document type or department
- One Word/PDF + one Excel per folder
- Use descriptive folder names (becomes table name in SQLite)
- Avoid special characters in filenames
Chunk Size Tuning
Chunk Size Tuning
- Smaller chunks (600-800): Better precision, more chunks
- Larger chunks (1000-1200): More context, fewer chunks
- Always maintain 30-40% overlap to preserve sentence boundaries
- Monitor
total_chunksin/siaa/status— aim for <1000 total
PDF Conversion Quality
PDF Conversion Quality
- Use pymupdf4llm for PDFs with complex tables
- Use LibreOffice for scanned PDFs or image-based content
- For best results, prefer source Word files over PDFs
- Test conversion with
/siaa/ver/<filename>to verify formatting