Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/techjarves/USB-Uncensored-LLM/llms.txt

Use this file to discover all available pages before exploring further.

chat_server.py is a zero-dependency Python HTTP server built entirely from the standard library — no pip install required. It serves the FastChatUI.html web interface, persists conversation history as JSON files on the drive, proxies all Ollama API calls (eliminating browser CORS restrictions), and exposes live CPU and RAM metrics. It runs on port 3333 and binds to all network interfaces so it is reachable from phones and other devices on the same LAN.

Configuration Constants

These values are set at the top of chat_server.py and control core server behaviour:
ConstantDefaultDescription
CHAT_SERVER_PORT3333TCP port the HTTP server listens on
OLLAMA_HOSThttp://127.0.0.1:11434Address of the local Ollama engine
LLAMA_CPP_MODEfalseActivated by passing --llama-cpp as a CLI argument; changes OLLAMA_HOST to http://127.0.0.1:8080 and enables OpenAI API translation
All file paths inside the server are resolved relative to the location of chat_server.py itself (i.e., Shared/). This ensures the server always reads and writes to the USB drive, regardless of the current working directory when it is launched.

API Endpoints

GET / and GET /index.html

Serves the main chat interface.
  • Response: 200 text/html — contents of Shared/FastChatUI.html
  • Error: 404 if FastChatUI.html is missing from the Shared/ directory

GET /api/chats

Returns the full saved chat history from disk.
  • Response: 200 application/json — a JSON array of all chat objects stored in Shared/chat_data/chats.json
  • Fallback: If the file is missing or contains malformed JSON, returns 200 with an empty array [] rather than an error, so the UI always loads cleanly

POST /api/chats

Persists the current chat history to disk.
  • Body: JSON array of chat objects (the full chat history from the UI)
  • Response: 200 application/json
    { "ok": true }
    
  • Error: 500 application/json
    { "error": "..." }
    
  • Atomicity: Writes to chats.json.tmp first, then renames to chats.json via os.replace(). This prevents corruption if the process is killed mid-write.

GET /api/settings

Returns the current user settings from Shared/chat_data/settings.json.
  • Response: 200 application/json
    {
      "globalSystemPrompt": "",
      "temperature": 0.7,
      "logMode": "errors_only"
    }
    
  • If the file is missing, these defaults are returned automatically.

POST /api/settings

Merges a partial or full settings object with the existing settings and saves to disk.
  • Body: A partial JSON object — only keys you want to change are required
    { "temperature": 0.9, "logMode": "all" }
    
  • The incoming object is merged on top of the existing settings (missing keys are preserved).
  • logMode is normalized: any value other than "all" is coerced to "errors_only".
  • The in-memory log mode is updated immediately without restart.
  • Response: 200 application/json
    { "ok": true, "logMode": "errors_only" }
    
  • Error: 500 application/json
    { "error": "..." }
    

GET /api/stats

Returns real-time CPU and RAM usage for the host machine.
  • Response: 200 application/json
    {
      "cpu_percent": 14.2,
      "ram_percent": 67.8,
      "has_psutil": false
    }
    
  • has_psutil — indicates whether the optional psutil library was found. The server works without it using platform-native fallbacks:
    PlatformRAM sourceCPU source
    WindowsGlobalMemoryStatusEx (kernel32)GetSystemTimes (kernel32)
    Linux/proc/meminfo/proc/stat delta
    macOS
    On macOS, CPU and RAM are returned as 0.0 in the stdlib fallback path to avoid potential permission issues. Install psutil (pip install psutil) for accurate macOS stats.
  • Error: 500 application/json on unexpected failure

GET | POST | DELETE /ollama/*

Transparent reverse proxy to the local Ollama engine. All Ollama API calls from the browser are routed here to avoid CORS errors.
  • Path rewriting: The /ollama prefix is stripped before forwarding. For example, GET /ollama/api/tags is forwarded to GET http://127.0.0.1:11434/api/tags.
  • Streaming: For /api/chat and /api/generate, the response is streamed back in 4096-byte chunks as they arrive from Ollama, enabling token-by-token rendering in the UI.
  • Validation: POST /ollama/api/chat validates the request body before forwarding:
    • model must be a non-empty string — returns 400 if missing or blank
    • messages must be a non-empty array — returns 400 if missing or empty
  • LLAMA_CPP_MODE bridging: When started with --llama-cpp, the proxy translates /api/chat payloads from Ollama JSONL format to OpenAI /v1/chat/completions format for llama.cpp’s llama-server. SSE (data: {...}) responses from llama-server are bridged back to Ollama-style JSONL ({"message": ..., "done": false}\n) for the UI.
  • Responses:
    • Upstream status code on success
    • 400 — validation failure (missing model or messages)
    • 502 — Ollama engine is unreachable (not running or not yet ready)
    • 500 — unexpected proxy error

Concurrency Model

chat_server.py uses a custom ThreadedHTTPServer class that extends Python’s built-in http.server.HTTPServer:
class ThreadedHTTPServer(http.server.HTTPServer):
    def process_request(self, request, client_address):
        thread = threading.Thread(target=self._handle, args=(request, client_address))
        thread.daemon = True
        thread.start()
Each incoming request is dispatched to a new daemon thread. This means:
  • A long-running streaming response (a model generating thousands of tokens) does not block the UI from loading or saving chats.
  • Hardware stats polls from the UI run concurrently with active generation.
  • Thread safety for shared file access is enforced with threading.RLock()DATA_FILE_LOCK guards chats.json and settings.json, and LOG_MODE_LOCK guards the in-memory log mode state.

CORS

Every response from the server includes the following CORS headers, regardless of endpoint:
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, PUT, DELETE, OPTIONS
Access-Control-Allow-Headers: Content-Type, Authorization
Preflight OPTIONS requests return 204 No Content with these headers and no body. This configuration allows the UI to be accessed from any origin — critical for LAN access from mobile devices whose IP differs from the host machine.

Logging

Log entries are written to Shared/logs/chat_server.log via Python’s logging.handlers.RotatingFileHandler:
SettingValue
Max file size10 MB
Backup count1 (chat_server.log.1)
EncodingUTF-8
Log writes are asynchronous: a QueueHandler / QueueListener pair moves records off the request thread immediately, preventing I/O latency from slowing down responses. Each record is flushed to disk immediately after being written. Each log entry is a structured, multi-line block that includes:
  • Timestamp with timezone
  • Request ID (UUID, unique per request)
  • HTTP method and path
  • Client IP address
  • User-Agent header
  • Model name, temperature, and stream flag
  • Python module, function, and line number
  • Thread name
  • Full hardware snapshot (platform, CPU count, total RAM, Python version)
  • Exception traceback (if applicable)
Log mode is controlled at runtime via POST /api/settings:
logMode valueWhat is logged
"errors_only"Only ERROR-level and above (default)
"all"All levels including INFO (every request)
The mode change takes effect immediately — no server restart required.

Atomic File Writes

Both chats.json and settings.json are saved using a write-then-rename pattern to prevent data loss:
temp_file = CHATS_FILE + ".tmp"
with open(temp_file, "w", encoding="utf-8") as f:
    json.dump(chats, f, ensure_ascii=False, indent=2)
    f.flush()
os.replace(temp_file, CHATS_FILE)
os.replace() is an atomic operation on both POSIX and Windows. If the process is killed between the open and the replace, the original file is untouched. If it is killed after replace, the new data is fully committed. There is no window where the file can be half-written or empty.

Starting the Server Manually

The server is normally started by the OS-specific start script, but it can be run directly:
# Standard mode (proxies Ollama at :11434)
python3 Shared/chat_server.py

# llama.cpp mode (proxies llama-server at :8080, translates API format)
python3 Shared/chat_server.py --llama-cpp

# Suppress automatic browser open on launch
python3 Shared/chat_server.py --no-browser

# Combine flags
python3 Shared/chat_server.py --llama-cpp --no-browser
Use --no-browser when running headlessly (e.g. on a remote Linux server or in a Termux background session). The server still binds to 0.0.0.0:3333 and is reachable over the network.

Build docs developers (and LLMs) love