Overview
Retrieval-Augmented Generation (RAG) is the core architecture powering EduMate’s intelligent assessment generation. RAG combines the strengths of vector search (retrieval) with large language model generation to produce contextually grounded, accurate questions.What is RAG?
RAG is a technique that enhances LLM outputs by:- Retrieving relevant information from a knowledge base
- Augmenting the LLM prompt with this retrieved context
- Generating responses grounded in factual source material
RAG solves the hallucination problem: instead of generating from parametric memory alone, the LLM reasons over real document content.
EduMate’s RAG Pipeline
Document Indexing
PDFs are chunked and embedded into Qdrant vector database (offline, one-time process).
Similarity Search
Qdrant retrieves the top-k most similar document chunks based on cosine similarity.
Architecture Components
1. Langchain
Langchain orchestrates the RAG pipeline, providing:- Document loaders:
PyPDFLoaderfor PDF extraction - Text splitters:
RecursiveCharacterTextSplitterfor chunking - Vector stores:
QdrantVectorStoreintegration - Embeddings:
OllamaEmbeddingswrapper
2. Qdrant Vector Database
Qdrant stores and retrieves document embeddings:- URL:
http://localhost:6333 - Collections: Separate collection per subject (e.g., “chemistry”, “physics”)
- Metadata: Stores source file and page number with each chunk
Why Qdrant?
- High performance: Optimized for similarity search at scale
- Metadata filtering: Filter by source, page, or custom fields
- Local deployment: Runs on-premises for data privacy
- REST API: Easy integration with Python and other languages
3. Ollama for Embeddings
Ollama runs the Qwen3 embedding model locally:- Model:
qwen3-embedding:0.6b - Endpoint:
http://localhost:11434 - Benefits: Fast, local inference with no external API calls
Vector Similarity Search
The core of retrieval is vector similarity search:How Similarity Search Works
- Query embedding: User query is converted to a vector using Qwen3
- Cosine similarity: Qdrant computes cosine similarity between query vector and all document vectors
- Top-k retrieval: Returns the k most similar chunks (default: 5)
Cosine Similarity Explained
Cosine Similarity Explained
Cosine similarity measures the angle between two vectors:
- 1.0: Identical meaning (0° angle)
- 0.0: Orthogonal (90° angle, unrelated)
- -1.0: Opposite meaning (180° angle)
Top-k Parameter
Default: k=5
Retrieves 5 chunks to balance context richness with token efficiency.
Configurable
Can be increased for broader context or decreased for focused questions.
Context Formatting
Retrieved chunks are formatted to separate metadata from content:Why This Format?
Metadata Isolation
Metadata Isolation
Clearly marking metadata as “DO NOT MENTION IN OUTPUT” instructs the LLM to use it for verification only, not in generated questions.
Content Clarity
Content Clarity
Separating educational content makes it clear what information should be used for question generation.
Source Traceability
Source Traceability
Including source and page metadata allows debugging and verification of question grounding.
Complete RAG Workflow
Here’s the full RAG implementation inbackend/queue/chat.py:
Vector Store Initialization
EduMate uses helper functions to initialize the embedding model and vector database:The same embedding model (
qwen3-embedding:0.6b) must be used for both indexing (document processing) and retrieval (query embedding) to ensure vectors live in the same semantic space.LLM Clients
EduMate configures clients for both Gemini (production) and Ollama (local alternative):Gemini (Default)
Production LLM
- Model:
gemini-2.5-flash-lite - Structured output support
- High-quality generation
Ollama (Alternative)
Local LLM
- Model:
llama3.2:1b(commented) - Fully offline operation
- Privacy-focused deployment
Benefits of RAG Architecture
Factual Grounding
Questions are based on actual document content, not LLM parametric memory, reducing hallucinations.
Source Traceability
Each generated question can be traced back to specific pages and documents for verification.
RAG vs. Fine-Tuning
| Approach | RAG (EduMate) | Fine-Tuning |
|---|---|---|
| Setup Time | Minutes (just index docs) | Days/weeks (training required) |
| Cost | Low (inference only) | High (GPU training) |
| Updatability | Instant (add new docs) | Slow (retrain model) |
| Factual Accuracy | High (grounded in sources) | Variable (can hallucinate) |
| Traceability | Full (source + page metadata) | None (black box) |
Performance Considerations
Embedding Generation
- Speed: Qwen3 (0.6B params) embeds ~100 tokens/sec on CPU
- Batch processing: Documents are embedded in batches during indexing
- Query latency: Single query embedding takes ~50-100ms
Vector Search
- Qdrant performance: Less than 10ms for top-5 retrieval on 100k vectors
- Scaling: Sub-linear scaling with HNSW index
- Memory: ~1GB RAM per 100k vectors (768-dim embeddings)
End-to-End Latency
Typical Request Timeline
Typical Request Timeline
- Query embedding: 50-100ms (Ollama)
- Vector search: 10-20ms (Qdrant)
- LLM generation: 5-15 seconds (Gemini, 20 questions)
Storage Architecture
- PostgreSQL: Stores assessment metadata and results
- Qdrant: Stores document vectors and metadata
- Separation of concerns: Structured data in Postgres, vectors in Qdrant
Async Processing with Redis Queue
Both document processing and assessment generation run asynchronously:Asynchronous processing via Redis Queue (RQ) prevents long-running operations from blocking the FastAPI web server, enabling responsive user experience even during heavy workloads.
System Requirements
To run EduMate’s RAG system:Ollama
Embedding Model
- Install Ollama
- Pull
qwen3-embedding:0.6b - Run on port 11434
Qdrant
Vector Database
- Docker:
qdrant/qdrant - Or standalone binary
- Run on port 6333
PostgreSQL
Structured Storage
- Store assessments
- User data
- JSONB support required
Redis
Task Queue
- Redis server
- RQ workers
- Async job processing
Example: Complete RAG Flow
Let’s trace a complete request:- User uploads
organic_chemistry.pdf→ Document processing job queued - RQ worker chunks document into 42 chunks (15,000 chars each)
- Ollama generates embeddings for all 42 chunks
- Qdrant stores vectors in “chemistry” collection
- User requests assessment on “Alkanes and Alkenes”
- Query embedding generated for “Alkanes and Alkenes”
- Qdrant retrieves 5 most similar chunks (pages 12, 13, 14, 18, 19)
- Context formatted with metadata and content
- Gemini generates 20 MCQs following Bloom’s distribution
- Assessment stored in PostgreSQL with JSONB content
- User receives generated questions in UI
Debugging and Monitoring
The RAG pipeline includes logging for debugging:Next Steps
Document Processing
Learn about PDF loading, chunking, and embedding generation
Assessment Generation
Deep dive into prompt engineering and question generation