Overview
EduMate transforms educational PDF documents into a searchable knowledge base through a multi-stage processing pipeline. The system loads PDFs, splits them into semantic chunks, and generates vector embeddings for efficient retrieval.Processing Pipeline
PDF Discovery
The system finds PDF files from various input sources including:
- Individual PDF files
- Directories containing PDFs
- Glob patterns for batch processing
Document Loading
PDFs are loaded using LangChain’s
PyPDFLoader, which extracts text content page by page. Each page becomes a separate document with metadata including the source file path.Text Chunking
Documents are split into smaller, overlapping chunks using
RecursiveCharacterTextSplitter to maintain semantic coherence while staying within embedding model limits.PDF Discovery and Loading
Thefind_pdfs() function intelligently handles multiple input formats:
PyPDFLoader, preserving source metadata:
The system gracefully handles PDFs with no extractable text and logs warnings for debugging.
Text Chunking Configuration
EduMate usesRecursiveCharacterTextSplitter with carefully tuned parameters to balance context preservation and embedding efficiency:
Chunking Parameters
Chunk Size
15,000 characters - Large enough to capture complete concepts and maintain semantic coherence across paragraphs and sections.
Chunk Overlap
4,000 characters - Ensures continuity between chunks and prevents important information from being split across boundaries.
The large chunk size (15,000 characters) is optimized for educational content, allowing entire topics and concepts to remain together in a single retrievable unit.
Why These Values?
- Large chunks preserve complete explanations, formulas, and multi-step procedures
- Significant overlap ensures that concepts spanning chunk boundaries are captured in multiple chunks
- Better context for the LLM when generating questions, as each retrieved chunk contains more complete information
Vector Embeddings
EduMate uses the Qwen3 embedding model to convert text chunks into high-dimensional vector representations:Embedding Model: qwen3-embedding:0.6b
Why Qwen3 Embedding?
Why Qwen3 Embedding?
- Lightweight: 0.6B parameters allows fast local inference via Ollama
- Multilingual: Supports educational content in multiple languages
- High quality: Produces semantically meaningful embeddings for similarity search
- Local deployment: Runs entirely on-premises without external API calls
Vector Storage in Qdrant
Processed chunks are stored in Qdrant, a high-performance vector database:- Each document collection is stored separately (e.g., “chemistry”, “physics”)
- Chunks are indexed by their vector embeddings for fast similarity search
- Metadata (source file, page number) is preserved for traceability
Processing Return Value
After successful processing, the system returns metadata about the operation:Complete Processing Flow
Next Steps
Assessment Generation
Learn how processed documents are used to generate assessments
RAG System
Understand the retrieval-augmented generation architecture