Document ingestion is the process that converts raw files into searchable vector embeddings inside NISIRA’s knowledge base. Every file — whether uploaded manually through the Admin Panel or pulled automatically from Google Drive — travels through the same five-stage pipeline: parsing, format-aware chunking, embedding generation, vector storage, and metadata registration. The following sections describe each stage in detail.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/HugoX2003/nisira-assistant/llms.txt
Use this file to discover all available pages before exploring further.
Supported File Formats
NISIRA’sDOCUMENT_PROCESSING_CONFIG (in rag_system/config.py) and GOOGLE_DRIVE_CONFIG both declare the same set of accepted extensions:
| Extension | Parser Library | Notes |
|---|---|---|
.pdf | PyPDFLoader (LangChain) → pdfplumber → PyPDF2 (fallbacks) | Page number preserved in chunk metadata |
.txt | TextProcessor | Plain UTF-8; no structural extraction |
.docx / .doc | python-docx via TextProcessor | Section structure preserved |
.pptx | python-pptx via TextProcessor | Slide-level chunking |
.xlsx | openpyxl via TextProcessor | Sheet content extracted as text |
Files arriving from Google Drive are skipped if they exceed 50 MB (
max_file_size = 50 × 1024 × 1024 bytes). The download_file() method returns the sentinel value "TOO_LARGE" in that case and logs a warning — the file does not block the rest of the sync.Processing Pipeline
Ingestion entry point
Documents enter through one of two paths:
- Admin Panel upload — files are received by the Django API and written to
PostgresFileStore(binary storage in the database) or the local filesystem as a fallback. - Google Drive sync —
GoogleDriveManager.sync_documents()downloads only files that are new or have been modified since the last sync (modification-time comparison) and stores them the same way.
UploadedDocument Django model with fields: file_name, file_path (postgres://<uuid> or a local path), file_size, file_type, drive_file_id, processed, chunks_created, and embeddings_generated.Format-specific parsing
RAGPipeline.process_document() dispatches to the correct processor based on the file extension:- PDF →
PDFProcessor.process_pdf(). The primary loader is LangChain’sPyPDFLoaderwithextraction_mode="layout". If PyPDF returns fewer than 50 characters per page, the processor retries withpdfplumber, thenPyPDF2as a last resort. - All other formats →
TextProcessor.process_document(), which routes internally topython-docx,python-pptx, oropenpyxl.
Document objects carrying page_content and metadata (including source filename, page number for PDFs, total_pages, and word_count).Format-aware chunking
Documents are split by
Chunks smaller than
RecursiveCharacterTextSplitter using per-format sizes defined in DOCUMENT_PROCESSING_CONFIG["chunk_config"]:| Format | chunk_size | chunk_overlap | min_chunk_size |
|---|---|---|---|
.pdf | 1 300 | 260 | 180 |
.txt | 1 100 | 220 | 150 |
.docx | 1 300 | 260 | 180 |
| Default | 1 000 | 200 | 100 |
min_chunk_size are discarded. Each surviving chunk receives a chunk_id index and a chunk_size character count appended to its metadata.Embedding generation
EmbeddingManager.create_embeddings_batch() converts chunk texts into 768-dimensional float vectors. The active provider is selected at startup:- Google
text-embedding-004— used whenGOOGLE_API_KEYis set (production default). sentence-transformers/all-mpnet-base-v2— local fallback viaSentenceTransformerAdapterorHuggingFaceEmbeddings, processed in mini-batches of 4 to avoid memory pressure.
Vector storage
Valid chunk–embedding pairs are stored through
RAGPipeline.chroma_manager.add_documents(), which routes to the configured backend:- PostgreSQL
pgvector— default in production (VECTOR_STORE_BACKEND=postgres). Chunks are stored with their full metadata for hybrid lexical + semantic search. - ChromaDB — local development fallback or when
DATABASE_URLis absent.
/api/documents/<filename>/ via serve_document(), which reads directly from PostgresFileStore or the local filesystem. PDF responses include an inline Content-Disposition header so the browser’s native PDF viewer can honour #page=N fragment links generated from chunk page_number metadata.Re-Embedding Documents
If you change the embedding model or need to rebuild the index from scratch, use the Admin Panel:- Open the Admin Panel → Embeddings tab.
- Click Generate Embeddings to reprocess all documents currently in the vector store.
UploadedDocument Model Fields
| Field | Type | Description |
|---|---|---|
file_name | CharField | Original filename including extension |
file_path | CharField | postgres://<uuid> or absolute local path |
file_size | BigIntegerField | File size in bytes |
file_type | CharField | File extension (e.g. .pdf, .docx) |
drive_file_id | CharField | Google Drive file ID (blank for manual uploads) |
processed | BooleanField | Whether chunking has completed |
chunks_created | IntegerField | Number of text chunks generated |
embeddings_generated | IntegerField | Number of embeddings stored in the vector database |
Document Serving and PDF Deep Linking
Chunks that include apage metadata field (PDFs) expose page-level deep links in the sources array returned by /api/rag/query/. The frontend can construct a URL of the form:
?token= query parameter is required because PDF iframes cannot send Authorization headers. The server validates the token identically to the standard Bearer header flow before streaming the file inline.