Why RAG matters for architecture
LLMs trained on general code and documentation tend to produce plausible-sounding but context-free recommendations. For architecture work this is especially risky: a suggestion to “use a repository pattern” is only useful if it accounts for your actual tech stack, your team’s coding standards, and the compliance requirements of your domain. RAG solves this by:- Grounding responses in the project’s own knowledge base (Tech Packs, templates, examples)
- Preventing hallucinations by injecting verified technical content before the LLM generates anything
- Enabling offline operation — the vector store is local, so no data leaves your machine
The RAG pipeline
User query
The user sends a message through the Flutter chat UI. Before reaching the backend, the message is validated against
CHAT_MAX_MESSAGE_LENGTH (default 20,000 characters) and passed through the input sanitizer.Prompt sanitization
InputSanitizer.sanitize_message() applies a two-step defence:ignore previous instructions or you are now a different are detected and logged. The sanitised text continues through the pipeline — the detection is non-blocking by design so legitimate architecture prompts are never silently dropped.Dual-channel retrieval
The
SequentialOrchestrator runs two parallel retrieval paths:- Global knowledge-base search — queries ChromaDB’s
softarchitect_kbcollection for general software-engineering knowledge relevant to the user’s input. - Per-project semantic search — queries a project-specific ChromaDB collection using the composite query
"Context for {doc_type}: {user_input}", returning the top-k most relevant chunks from documents already generated for this project.
Prompt assembly
The orchestrator assembles the final prompt in a deterministic order using XML tags for LLM clarity:A hard safety cap truncates the assembled prompt to
LLM_MAX_PROMPT_CHARS (default 200,000) before it is sent. The cap never silently drops RAG context — it truncates from the end of the assembled string so critical architectural context always reaches the model.LLM generation (streaming)
The assembled prompt is sent to the active LLM provider via
BaseLLMClient.stream_generate(). Tokens are yielded incrementally as they arrive:Vector store: ChromaDB
All knowledge base content is embedded and stored in ChromaDB, running as a local Docker container. After ingestion the store contains:| Metric | Value |
|---|---|
| Documents | 29 |
| Vector embeddings | 934 |
| Collections | 3 (tech-packs, templates, examples) |
softarchitect collection after ingestion.
A dedicated
/api/v1/knowledge/status API endpoint is planned for Phase 2. Until then, use the ChromaDB API at http://localhost:8001 to inspect collection state.VectorStoreService connects over HTTP and wraps all operations with a retry decorator (exponential backoff, 3 attempts) and deterministic SHA-256 document IDs for idempotent ingestion:
Chunking and embedding
TheDocumentLoader reads every .md file under packages/knowledge_base/, cleans it with MarkdownCleaner, and performs semantic splitting:
javascript:, data: URIs, <script> and <iframe> blocks) before any content reaches the vector store.
Configuration
Two environment variables control the RAG context budget. Set them in.env based on your LLM provider’s context window:
| Variable | Default | Description |
|---|---|---|
LLM_MAX_PROMPT_CHARS | 200000 | Hard cap on assembled prompt size (characters ≈ tokens × 4). |
RAG_MAX_CHUNKS | 3 | Number of per-project chunks returned by semantic search. |
LLM_MAX_PROMPT_CHARS truncates the assembled prompt from the end, never silently drops RAG context blocks. Architectural recommendations always remain grounded in the knowledge base even when truncation occurs.Prompt safety
SoftArchitect AI applies three distinct sanitization layers before any user content reaches the LLM:HTML escaping
html.escape() converts <script> to <script>, preserving code snippets like List<String> that regex stripping would destroy.Injection detection
Seven regex patterns cover common prompt injection attempts (
ignore previous instructions, you are now, system:, etc.). Detections are logged with full context for audit.Content sanitization
MarkdownCleaner strips javascript: links, data: URIs, and <iframe>/<script> tags from knowledge base content during ingestion so malicious payloads can never enter the vector store.