In a retrieval-augmented generation pipeline, a retriever queries a vector store and returns a list ofDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/MuhammadSalmanAhmad/rag-pdf-highlighter/llms.txt
Use this file to discover all available pages before exploring further.
Document objects — each carrying the retrieved text and the metadata needed to trace it back to its source. RAG PDF Highlighter closes the loop by accepting those same Document objects and returning the original PDF with every retrieved passage highlighted. Users can immediately see exactly which passages informed the model’s answer, without any manual cross-referencing.
The Document schema
RAG PDF Highlighter works directly withlangchain_core.documents.Document objects — the standard unit of retrieved content in LangChain-based pipelines. No conversion or adapter layer is required.
Two fields are read from each document:
page_content— the text chunk to locate and highlight in the PDF.metadata["page"]— the zero-indexed page number where the chunk appears.
metadata["page"] is zero-indexed. Page 0 is the first page of the document. This matches the convention used by both PyMuPDF (which powers the highlighter) and most LangChain document loaders such as PyPDFLoader.metadata are ignored, so documents enriched with source filenames, scores, or other retriever metadata pass through safely.
End-to-end RAG example
The most common integration pattern is to run the highlighter as a microservice alongside your RAG application. After the retriever returns its documents, serialize them andPOST to the /highlight endpoint with the PDF URL. The response body is the annotated PDF, ready to stream to your user.
Using the library directly
When the PDF is already on disk — for example, after loading it withPyPDFLoader during an ingestion step — you can call highlight_chunks_in_pdf directly without going through the HTTP layer. This is useful for batch processing or for embedding the highlighter into a larger Python application.
highlight_chunks_in_pdf writes the annotated PDF to a new temporary file and returns its path. The caller is responsible for moving or deleting that file. The function raises plain Python exceptions (NoDocumentsError, PDFNotFoundError) with no FastAPI dependency, so it integrates cleanly into any Python environment.
When using the library directly, the PDF must already be available on the local filesystem. Use
download_pdf from rag_pdf_highlighter.utils.pdf_helpers if you need to fetch it from a URL before calling highlight_chunks_in_pdf.Page number conventions
Both PyMuPDF and the standard LangChain loaders use zero-indexed page numbers. Page 0 is the first page, page 1 is the second, and so on. Themetadata["page"] value you pass must use this same convention to match the correct page in the PDF.
For example, if your loader yields metadata={"page": 4} meaning “the fourth page” (1-indexed), convert it before passing to the highlighter: