Skip to main content
Perplexica can analyze documents you upload and answer questions about their content. Upload PDFs, Word documents, or text files, then search across both the web and your files simultaneously.

Supported file types

Perplexica currently supports three file formats:

PDF

.pdf - Portable Document Format

Word

.docx - Microsoft Word documents

Text

.txt - Plain text files

How file uploads work

When you upload a file, Perplexica processes it in several steps:
  1. File validation: Confirms the file type is supported
  2. Text extraction: Extracts text content from the document
  3. Content chunking: Splits the text into manageable chunks (512 tokens with 128 token overlap)
  4. Embedding generation: Creates vector embeddings for semantic search
  5. Storage: Saves the file and its processed content for future searches
File processing happens automatically when you upload. The content is embedded and ready to search immediately.

Uploading files

To upload a file:
  1. Look for the file upload button in the search interface
  2. Click to select one or more files from your device
  3. Wait for the processing indicator to complete
  4. The file is now available for searching
What happens during upload:
  • Each file receives a unique identifier
  • Original filename is preserved for display
  • Upload timestamp is recorded
  • Processed content is stored separately from the original

Asking questions about files

Once uploaded, you can ask questions about your documents: Example queries:
  • “Summarize the main points in this document”
  • “What does the report say about revenue growth?”
  • “Find all mentions of AI in my uploaded papers”
  • “Compare the conclusions from both PDFs”
Perplexica searches your uploaded files alongside web sources, combining information from both to answer your question.
File content is searched using semantic similarity, not just keyword matching. Ask questions naturally, and Perplexica will find relevant passages.

How file search works

Text extraction

Different file types are processed differently: PDF files:
  • Parsed using the pdf-parse library
  • Text is extracted from all pages
  • Preserves document structure where possible
Word documents (.docx):
  • Processed using the officeparser library
  • Extracts text content from the document
  • Handles formatted text and structure
Text files (.txt):
  • Read directly as UTF-8 text
  • No additional processing needed

Content chunking

Extracted text is split into chunks for efficient searching:
  • Chunk size: 512 tokens
  • Overlap: 128 tokens between chunks
  • Purpose: Maintains context across chunk boundaries
const splittedText = splitText(content, 512, 128)
Overlapping ensures that relevant information near chunk boundaries isn’t missed during search.

Embedding and storage

Each chunk is converted to a vector embedding:
const embeddings = await embeddingModel.embedText(splittedText)
The chunks and embeddings are stored together:
{
  "chunks": [
    {
      "content": "text content",
      "embedding": [0.123, -0.456, ...]
    }
  ]
}
This enables semantic search - finding content based on meaning rather than exact word matches. When you ask a question about uploaded files:
  1. Your query is converted to an embedding vector
  2. Perplexica compares it against all file chunk embeddings
  3. The most semantically similar chunks are retrieved
  4. These chunks provide context for the AI to answer your question
Semantic search finds relevant content even if your question uses different words than the document. It understands meaning, not just keywords.

File management

Uploaded files are managed automatically: Storage location:
  • Files are stored in data/uploads/ directory
  • Each file gets a unique identifier
  • Processed content is saved alongside the original
File metadata:
{
  id: string;              // Unique file identifier
  name: string;            // Original filename
  filePath: string;        // Path to stored file
  contentPath: string;     // Path to processed content
  uploadedAt: string;      // ISO timestamp
}
Persistence:
  • Uploaded files remain available across sessions
  • File metadata is stored in uploaded_files.json
  • Files can be referenced in multiple chats

Privacy and security

Your uploaded files are completely private:
  • Files are processed and stored locally on your Perplexica instance
  • No file content is sent to external services
  • Embeddings are generated using your configured embedding model
  • Only you have access to your uploaded files
File content never leaves your Perplexica instance. All processing happens locally or with your configured AI providers.

Using files with search modes

Files work with all search modes:
  • Speed mode: Quick answers from file content
  • Balanced mode: Combines file and web search
  • Quality mode: Deep analysis of file content with comprehensive web research
The search mode determines how thoroughly the AI analyzes your files:
config: {
  sources: SearchSources[];
  fileIds: string[];      // Your uploaded files
  mode: 'speed' | 'balanced' | 'quality';
}

File size and limits

While there’s no hard limit on file size, consider:
  • Larger files take longer to process
  • More content means more chunks and embeddings
  • Very large files may affect search performance
  • Embedding generation time increases with content length
For best performance, consider splitting very large documents into smaller, topic-focused files.

Troubleshooting

File won’t upload:
  • Verify the file type is supported (.pdf, .docx, or .txt)
  • Check that the file isn’t corrupted
  • Ensure Perplexica has disk space for storage
Search not finding content:
  • Try rephrasing your question
  • Make sure the information actually exists in the file
  • Check that the file processed successfully
Slow processing:
  • Large files take longer to embed
  • Check your embedding model performance
  • Consider your system resources

Upcoming features

  • Support for additional file formats
  • Batch upload capabilities
  • File management interface
  • Search within specific files

Build docs developers (and LLMs) love