Skip to main content
The SoftArchitect AI chat interface is a Flutter desktop application that streams architectural guidance token-by-token as the LLM generates it. Every message goes through the RAG pipeline before the model sees it, so responses are grounded in your project’s specific technology decisions rather than generic advice.

How streaming works

The Flutter client sends a POST request to /api/v1/chat/stream and keeps the connection open to receive Server-Sent Events (SSE). The backend yields three event types:
EventPayloadWhen
message{"token": "...", "is_final": false}Each incremental token from the LLM
done{"full_response": "", "sources": [...], "metadata": {...}}After the last token
error{"error": "...", "code": "...", "retry": bool}On any failure
The response headers disable caching and proxy buffering so tokens reach the client immediately:
return StreamingResponse(
    event_generator(),
    media_type="text/event-stream",
    headers={
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "X-Accel-Buffering": "no",
    },
)

Request and response shapes

Chat stream request

POST /api/v1/chat/stream
{
  "conversation_id": "550e8400-e29b-41d4-a716-446655440000",
  "message": "How should I structure authentication in my Flutter app?",
  "project_id": "7c9e6679-7425-40de-944b-e07fc1f90ae7",
  "user_name": "Developer",
  "doc_type": "TECH_STACK_DECISION",
  "history": [
    {"role": "user", "content": "I'm building a mobile banking app."},
    {"role": "assistant", "content": "Great. Let's start with the project manifesto..."}
  ],
  "project_context": {
    "context/10-CONTEXT/PROJECT_MANIFESTO.md": "# Mobile Banking App\n..."
  }
}
FieldTypeDescription
conversation_idUUIDUnique identifier for this conversation session.
messagestringUser input. Max CHAT_MAX_MESSAGE_LENGTH characters (default 20,000).
project_idUUIDUsed to route per-project semantic RAG retrieval.
user_namestringInjected into the system prompt for personalized responses.
doc_typestring | nullTarget document type (e.g. TECH_STACK_DECISION). Determines which template and example the orchestrator injects.
historyarrayPrevious messages for multi-turn context (max CHAT_MAX_HISTORY_MESSAGES, code default 100, .env.example template sets 50).
project_contextobjectKey-value map of already-generated project files. Prevents document inconsistency across workflow steps.

Streaming response (SSE)

event: message
data: {"token": "For", "is_final": false}

event: message
data: {"token": " authentication", "is_final": false}

event: message
data: {"token": " in Flutter", "is_final": false}

event: done
data: {"full_response": "", "sources": ["TECH_STACK_DECISION"], "metadata": {"template_used": "TECH_STACK_DECISION"}}

Error events

event: error
data: {"error": "AI Engine is unreachable", "code": "LLM_CONNECTION_ERROR", "retry": true}
Error codes and their retry guidance:
CodeMeaningRetry
LLM_CONNECTION_ERRORLLM provider is unreachableYes
RAG_RETRIEVAL_ERRORChromaDB query failedNo
STREAM_ERRORUnexpected error during streamingNo

Document generation endpoint

POST /api/v1/chat/generate is a legacy endpoint kept for backward compatibility. It accepts a slightly different request shape:
{
  "message": "Build me an architecture for a fintech app",
  "doc_type": "PROJECT_MANIFESTO",
  "project_context": {},
  "chat_history": []
}
New integrations should use /api/v1/chat/stream, which integrates with FastAPI’s dependency injection system and supports API key verification.

Available doc types

The doc_type field maps directly to one of the 24 steps in the Master Workflow. Each type has a corresponding template and example file that the WorkflowInjector reads from disk and injects into the LLM prompt.
doc_typeOutput file
PROJECT_MANIFESTOcontext/10-CONTEXT/PROJECT_MANIFESTO.md
DOMAIN_LANGUAGEcontext/10-CONTEXT/DOMAIN_LANGUAGE.md
USER_JOURNEY_MAPcontext/10-CONTEXT/USER_JOURNEY_MAP.md

Chat history and message limits

Two environment variables control memory usage and prevent context-window overflow:
# .env — these values override the code defaults
CHAT_MAX_HISTORY_MESSAGES=50    # Code default: 100; .env.example sets 50
CHAT_MAX_MESSAGE_LENGTH=20000
The ChatRequest validator enforces both limits at the API boundary:
@field_validator("message")
@classmethod
def sanitize_message(cls, v: str) -> str:
    if len(v) > settings.CHAT_MAX_MESSAGE_LENGTH:
        raise ValueError(
            f"Message exceeds maximum length of {settings.CHAT_MAX_MESSAGE_LENGTH} characters."
        )
    return InputSanitizer.sanitize_message(v)

@field_validator("history")
@classmethod
def validate_history(cls, v: list[dict[str, str]]) -> list[dict[str, str]]:
    max_msgs = settings.CHAT_MAX_HISTORY_MESSAGES
    if len(v) > max_msgs:
        raise ValueError(
            f"Chat history exceeds maximum length ({max_msgs} messages)."
        )
The orchestrator additionally keeps only the last 4 history messages when building the LLM prompt, replacing large assistant responses with a short placeholder to avoid saturating the context window:
for msg in raw_history[-4:]:
    role = msg.get("role", "user")
    content = msg.get("content", "")
    if role == "assistant" and any(
        marker in content for marker in ("**Path:**", "Path:", "[document]")
    ):
        content = (
            "[Previous document generated and saved successfully."
            " Omitted from memory to save context]"
        )

Optimistic UI

The Flutter client renders tokens as they arrive rather than waiting for the full response. This pattern (optimistic UI) keeps perceived latency below 200ms for the first visible character regardless of the LLM provider or network conditions.
When switching from Gemini or Groq to a local Ollama model, reduce LLM_MAX_PROMPT_CHARS to 30000 and RAG_MAX_CHUNKS to 2. This prevents the longer assembly time from making the first token appear slow.

Build docs developers (and LLMs) love