Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/HugoX2003/nisira-assistant/llms.txt

Use this file to discover all available pages before exploring further.

Document ingestion is the process that converts raw files into searchable vector embeddings inside NISIRA’s knowledge base. Every file — whether uploaded manually through the Admin Panel or pulled automatically from Google Drive — travels through the same five-stage pipeline: parsing, format-aware chunking, embedding generation, vector storage, and metadata registration. The following sections describe each stage in detail.

Supported File Formats

NISIRA’s DOCUMENT_PROCESSING_CONFIG (in rag_system/config.py) and GOOGLE_DRIVE_CONFIG both declare the same set of accepted extensions:
ExtensionParser LibraryNotes
.pdfPyPDFLoader (LangChain) → pdfplumberPyPDF2 (fallbacks)Page number preserved in chunk metadata
.txtTextProcessorPlain UTF-8; no structural extraction
.docx / .docpython-docx via TextProcessorSection structure preserved
.pptxpython-pptx via TextProcessorSlide-level chunking
.xlsxopenpyxl via TextProcessorSheet content extracted as text
Files arriving from Google Drive are skipped if they exceed 50 MB (max_file_size = 50 × 1024 × 1024 bytes). The download_file() method returns the sentinel value "TOO_LARGE" in that case and logs a warning — the file does not block the rest of the sync.

Processing Pipeline

1

Ingestion entry point

Documents enter through one of two paths:
  • Admin Panel upload — files are received by the Django API and written to PostgresFileStore (binary storage in the database) or the local filesystem as a fallback.
  • Google Drive syncGoogleDriveManager.sync_documents() downloads only files that are new or have been modified since the last sync (modification-time comparison) and stores them the same way.
Every file is registered in the UploadedDocument Django model with fields: file_name, file_path (postgres://<uuid> or a local path), file_size, file_type, drive_file_id, processed, chunks_created, and embeddings_generated.
2

Format-specific parsing

RAGPipeline.process_document() dispatches to the correct processor based on the file extension:
  • PDFPDFProcessor.process_pdf(). The primary loader is LangChain’s PyPDFLoader with extraction_mode="layout". If PyPDF returns fewer than 50 characters per page, the processor retries with pdfplumber, then PyPDF2 as a last resort.
  • All other formatsTextProcessor.process_document(), which routes internally to python-docx, python-pptx, or openpyxl.
Each parser outputs a list of LangChain Document objects carrying page_content and metadata (including source filename, page number for PDFs, total_pages, and word_count).
3

Format-aware chunking

Documents are split by RecursiveCharacterTextSplitter using per-format sizes defined in DOCUMENT_PROCESSING_CONFIG["chunk_config"]:
Formatchunk_sizechunk_overlapmin_chunk_size
.pdf1 300260180
.txt1 100220150
.docx1 300260180
Default1 000200100
Chunks smaller than min_chunk_size are discarded. Each surviving chunk receives a chunk_id index and a chunk_size character count appended to its metadata.
4

Embedding generation

EmbeddingManager.create_embeddings_batch() converts chunk texts into 768-dimensional float vectors. The active provider is selected at startup:
  1. Google text-embedding-004 — used when GOOGLE_API_KEY is set (production default).
  2. sentence-transformers/all-mpnet-base-v2 — local fallback via SentenceTransformerAdapter or HuggingFaceEmbeddings, processed in mini-batches of 4 to avoid memory pressure.
If a single embedding fails, the chunk is excluded from storage; successful chunks and their embeddings are kept paired.
5

Vector storage

Valid chunk–embedding pairs are stored through RAGPipeline.chroma_manager.add_documents(), which routes to the configured backend:
  • PostgreSQL pgvector — default in production (VECTOR_STORE_BACKEND=postgres). Chunks are stored with their full metadata for hybrid lexical + semantic search.
  • ChromaDB — local development fallback or when DATABASE_URL is absent.
Documents are served back to users at /api/documents/<filename>/ via serve_document(), which reads directly from PostgresFileStore or the local filesystem. PDF responses include an inline Content-Disposition header so the browser’s native PDF viewer can honour #page=N fragment links generated from chunk page_number metadata.

Re-Embedding Documents

If you change the embedding model or need to rebuild the index from scratch, use the Admin Panel:
  1. Open the Admin Panel → Embeddings tab.
  2. Click Generate Embeddings to reprocess all documents currently in the vector store.
Alternatively, use the Django management command:
python manage.py rag_manage
Pass force_reprocess=True when calling sync_and_process_documents() programmatically to re-chunk and re-embed every file in the download directory, regardless of whether it has changed.

UploadedDocument Model Fields

FieldTypeDescription
file_nameCharFieldOriginal filename including extension
file_pathCharFieldpostgres://<uuid> or absolute local path
file_sizeBigIntegerFieldFile size in bytes
file_typeCharFieldFile extension (e.g. .pdf, .docx)
drive_file_idCharFieldGoogle Drive file ID (blank for manual uploads)
processedBooleanFieldWhether chunking has completed
chunks_createdIntegerFieldNumber of text chunks generated
embeddings_generatedIntegerFieldNumber of embeddings stored in the vector database

Document Serving and PDF Deep Linking

Chunks that include a page metadata field (PDFs) expose page-level deep links in the sources array returned by /api/rag/query/. The frontend can construct a URL of the form:
/api/documents/<filename>?token=<JWT>#page=<N>
The ?token= query parameter is required because PDF iframes cannot send Authorization headers. The server validates the token identically to the standard Bearer header flow before streaming the file inline.

Build docs developers (and LLMs) love