Chat interface

The SoftArchitect AI chat interface is a Flutter desktop application that streams architectural guidance token-by-token as the LLM generates it. Every message goes through the RAG pipeline before the model sees it, so responses are grounded in your project’s specific technology decisions rather than generic advice.

How streaming works

The Flutter client sends a POST request to /api/v1/chat/stream and keeps the connection open to receive Server-Sent Events (SSE). The backend yields three event types:

Event	Payload	When
`message`	`{"token": "...", "is_final": false}`	Each incremental token from the LLM
`done`	`{"full_response": "", "sources": [...], "metadata": {...}}`	After the last token
`error`	`{"error": "...", "code": "...", "retry": bool}`	On any failure

The response headers disable caching and proxy buffering so tokens reach the client immediately:

return StreamingResponse(
    event_generator(),
    media_type="text/event-stream",
    headers={
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "X-Accel-Buffering": "no",
    },
)

Request and response shapes

Chat stream request

POST /api/v1/chat/stream

{
  "conversation_id": "550e8400-e29b-41d4-a716-446655440000",
  "message": "How should I structure authentication in my Flutter app?",
  "project_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "user_name": "Developer",
  "doc_type": "TECH_STACK_DECISION",
  "history": [
    {"role": "user", "content": "I'm building a mobile banking app."},
    {"role": "assistant", "content": "Great. Let's start with the project manifesto..."}
  ],
  "project_context": {
    "context/10-CONTEXT/PROJECT_MANIFESTO.md": "# Mobile Banking App\n..."
  }
}

Field	Type	Description
`conversation_id`	UUID	Unique identifier for this conversation session.
`message`	string	User input. Max `CHAT_MAX_MESSAGE_LENGTH` characters (default 20,000).
`project_id`	UUID	Used to route per-project semantic RAG retrieval.
`user_name`	string	Injected into the system prompt for personalized responses.
`doc_type`	string \| null	Target document type (e.g. `TECH_STACK_DECISION`). Determines which template and example the orchestrator injects.
`history`	array	Previous messages for multi-turn context (max `CHAT_MAX_HISTORY_MESSAGES`, code default `100`, `.env.example` template sets `50`).
`project_context`	object	Key-value map of already-generated project files. Prevents document inconsistency across workflow steps.

Streaming response (SSE)

event: message
data: {"token": "For", "is_final": false}

event: message
data: {"token": " authentication", "is_final": false}

event: message
data: {"token": " in Flutter", "is_final": false}

event: done
data: {"full_response": "", "sources": ["TECH_STACK_DECISION"], "metadata": {"template_used": "TECH_STACK_DECISION"}}

Error events

event: error
data: {"error": "AI Engine is unreachable", "code": "LLM_CONNECTION_ERROR", "retry": true}

Error codes and their retry guidance:

Code	Meaning	Retry
`LLM_CONNECTION_ERROR`	LLM provider is unreachable	Yes
`RAG_RETRIEVAL_ERROR`	ChromaDB query failed	No
`STREAM_ERROR`	Unexpected error during streaming	No

Document generation endpoint

POST /api/v1/chat/generate is a legacy endpoint kept for backward compatibility. It accepts a slightly different request shape:

{
  "message": "Build me an architecture for a fintech app",
  "doc_type": "PROJECT_MANIFESTO",
  "project_context": {},
  "chat_history": []
}

New integrations should use /api/v1/chat/stream, which integrates with FastAPI’s dependency injection system and supports API key verification.

Available doc types

The doc_type field maps directly to one of the 24 steps in the Master Workflow. Each type has a corresponding template and example file that the WorkflowInjector reads from disk and injects into the LLM prompt.

Phase 1 — Context
Phase 2 — Requirements
Phase 3 — Architecture
Phases 4–6

`doc_type`	Output file
`PROJECT_MANIFESTO`	`context/10-CONTEXT/PROJECT_MANIFESTO.md`
`DOMAIN_LANGUAGE`	`context/10-CONTEXT/DOMAIN_LANGUAGE.md`
`USER_JOURNEY_MAP`	`context/10-CONTEXT/USER_JOURNEY_MAP.md`

`doc_type`	Output file
`REQUIREMENTS_MASTER`	`context/20-REQUIREMENTS/REQUIREMENTS_MASTER.md`
`USER_STORIES_MASTER`	`context/20-REQUIREMENTS/USER_STORIES_MASTER.json`
`SECURITY_PRIVACY_POLICY`	`context/20-REQUIREMENTS/SECURITY_PRIVACY_POLICY.md`
`COMPLIANCE_MATRIX`	`context/20-REQUIREMENTS/COMPLIANCE_MATRIX.md`

`doc_type`	Output file
`TECH_STACK_DECISION`	`context/30-ARCHITECTURE/TECH_STACK_DECISION.md`
`DATA_MODEL_SCHEMA`	`context/30-ARCHITECTURE/DATA_MODEL_SCHEMA.md`
`API_INTERFACE_CONTRACT`	`context/30-ARCHITECTURE/API_INTERFACE_CONTRACT.md`
`PROJECT_STRUCTURE_MAP`	`context/30-ARCHITECTURE/PROJECT_STRUCTURE_MAP.md`
`SECURITY_THREAT_MODEL`	`context/30-ARCHITECTURE/SECURITY_THREAT_MODEL.md`
`ARCH_DECISION_RECORDS`	`context/30-ARCHITECTURE/ARCH_DECISION_RECORDS.md`

`doc_type`	Phase
`DESIGN_SYSTEM`	4 — UX/UI
`UI_WIREFRAMES_FLOW`	4 — UX/UI
`ACCESSIBILITY_GUIDE`	4 — UX/UI
`ROADMAP_PHASES`	5 — Planning
`DEPLOYMENT_INFRASTRUCTURE`	5 — Planning
`CI_CD_PIPELINE`	5 — Planning
`TESTING_STRATEGY`	5 — Planning
`RULES`	6 — Root synthesis
`CONTRIBUTING`	6 — Root synthesis
`AGENTS`	6 — Root synthesis
`README`	6 — Root synthesis

Chat history and message limits

Two environment variables control memory usage and prevent context-window overflow:

# .env — these values override the code defaults
CHAT_MAX_HISTORY_MESSAGES=50    # Code default: 100; .env.example sets 50
CHAT_MAX_MESSAGE_LENGTH=20000

The ChatRequest validator enforces both limits at the API boundary:

@field_validator("message")
@classmethod
def sanitize_message(cls, v: str) -> str:
    if len(v) > settings.CHAT_MAX_MESSAGE_LENGTH:
        raise ValueError(
            f"Message exceeds maximum length of {settings.CHAT_MAX_MESSAGE_LENGTH} characters."
        )
    return InputSanitizer.sanitize_message(v)

@field_validator("history")
@classmethod
def validate_history(cls, v: list[dict[str, str]]) -> list[dict[str, str]]:
    max_msgs = settings.CHAT_MAX_HISTORY_MESSAGES
    if len(v) > max_msgs:
        raise ValueError(
            f"Chat history exceeds maximum length ({max_msgs} messages)."
        )

The orchestrator additionally keeps only the last 4 history messages when building the LLM prompt, replacing large assistant responses with a short placeholder to avoid saturating the context window:

for msg in raw_history[-4:]:
    role = msg.get("role", "user")
    content = msg.get("content", "")
    if role == "assistant" and any(
        marker in content for marker in ("**Path:**", "Path:", "[document]")
    ):
        content = (
            "[Previous document generated and saved successfully."
            " Omitted from memory to save context]"
        )

Optimistic UI

The Flutter client renders tokens as they arrive rather than waiting for the full response. This pattern (optimistic UI) keeps perceived latency below 200ms for the first visible character regardless of the LLM provider or network conditions.

When switching from Gemini or Groq to a local Ollama model, reduce LLM_MAX_PROMPT_CHARS to 30000 and RAG_MAX_CHUNKS to 2. This prevents the longer assembly time from making the first token appear slow.

Overview

Core Features

Installation & Setup

Guides

Development

How streaming works

Request and response shapes

Chat stream request

Streaming response (SSE)

Error events

Document generation endpoint

Available doc types

Chat history and message limits

Optimistic UI

Build docs developers (and LLMs) love

Overview

Core Features

Installation & Setup

Guides

Development

​How streaming works

​Request and response shapes

​Chat stream request

​Streaming response (SSE)

​Error events

​Document generation endpoint

​Available doc types

​Chat history and message limits

​Optimistic UI

Build docs developers (and LLMs) love

How streaming works

Request and response shapes

Chat stream request

Streaming response (SSE)

Error events

Document generation endpoint

Available doc types

Chat history and message limits

Optimistic UI