Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/techjarves/USB-Uncensored-LLM/llms.txt

Use this file to discover all available pages before exploring further.

Common questions about how USB-Uncensored-LLM works, what hardware it needs, and how to customize it. If you’re running into an error rather than a general question, see the Troubleshooting page instead.

Setup & Portability

No. The Ollama engine binary is self-contained in Shared/bin/ and runs directly without installation. The only external requirement is Python 3, which is needed to run the chat server.On Windows, Python is handled automatically — start-fast-chat.bat checks for a portable Python at Shared/python/python.exe first, then falls back to system Python, and downloads an embedded Python (~11 MB) to the USB drive if neither is found.On Linux and macOS, Python 3 is included with the OS on virtually all modern distributions.There are no registry edits, no PATH changes, and no administrator rights required for normal operation.
Yes. Clone or download the repository to any folder on your hard drive (for example, C:\AI or D:\Models) and run the installer from that folder. The system works identically whether the root folder is on a USB drive or an internal disk.Running from an internal SSD is significantly faster than USB — model loading is near-instant compared to the multi-second loads typical on USB 3.0 flash drives. The “USB” in the project name refers to its design goal of portability, not a requirement.
You need to run the installer once per operating system to download the appropriate engine binary for that OS. Specifically:
  • Windows downloads ollama-windows.exe to Shared/bin/
  • macOS downloads ollama-darwin to Shared/bin/
  • Linux downloads ollama-linux and the llama-server library to Shared/bin/ and Shared/lib/
Models are shared — the .gguf weight files in Shared/models/ and the Ollama model registry in Shared/models/ollama_data/ work across all platforms. You do not need to re-download any models when switching operating systems or computers.On subsequent startups on a computer you’ve already set up, just run the start script directly.
Yes. Run Windows/install.bat on the Windows machine (downloads ollama-windows.exe) and Mac/install.command on the Mac (downloads ollama-darwin). Both engine binaries are stored separately in Shared/bin/ — they do not overwrite each other.Both platforms read model weights from the same Shared/models/ directory and the same Ollama registry in Shared/models/ollama_data/, so you only download each model once regardless of how many operating systems you install the engine for.The USB drive must be formatted as exFAT (not FAT32 or NTFS) to be read-write on both Windows and macOS natively, and to support individual files larger than 4 GB (required for most model weights).

Privacy & Security

After the initial model download during installation, USB-Uncensored-LLM operates completely offline. Specifically:
  • The Ollama engine runs entirely locally with no telemetry or usage reporting.
  • The Python chat server (Shared/chat_server.py) has no outbound network connections — it only communicates with the local Ollama engine on 127.0.0.1:11434 and serves responses back to the browser.
  • The FastChatUI is a self-contained HTML file served from Shared/FastChatUI.html — it makes no requests to external CDNs, analytics endpoints, or APIs.
The only time any network traffic occurs after initial setup is if you manually trigger a model re-download via the installer.
All conversations are saved locally to Shared/chat_data/chats.json on the drive. The file is written using atomic operations (data is first written to chats.json.tmp, then renamed over the target using os.replace()) to prevent corruption if the drive is ejected unexpectedly.Nothing is sent to any external server. There is no account system, no cloud sync, and no remote backup. Your chat history exists solely on the physical drive and is lost if the drive is lost or the file is deleted.
No. The chat server binds to 0.0.0.0:3333, which makes it reachable on your local network (LAN) but not from the internet. Your router does not forward external traffic to LAN addresses unless you explicitly configure port forwarding.The 0.0.0.0 bind address is intentional — it allows you to access the chat UI from your phone or tablet on the same Wi-Fi network using the IP address printed in the terminal at startup (for example, http://192.168.1.15:3333). If you only want local access, your OS firewall can restrict port 3333 to 127.0.0.1 only.Exposing the server to the internet via port forwarding is not recommended — the chat server has no authentication system.

Models & Performance

The models in the curated catalog use one of two techniques to remove safety-alignment restrictions:
  • Abliteration — a post-training technique that mathematically identifies and removes the “refusal direction” vectors from a fine-tuned model’s weight matrices. The result is a model that retains all of its knowledge and reasoning ability but no longer has a learned tendency to refuse requests. Gemma 2 2B Abliterated uses this approach.
  • Heretic fine-tuning — instruction-tuning the base model on a dataset that explicitly rewards full compliance with all user requests regardless of content. Gemma 4 E4B Ultra Uncensored Heretic uses this approach.
Both types are open-source community fine-tunes distributed on HuggingFace in GGUF format.
Yes, in two ways:Via the installer (recommended): During install.bat, enter C at the model selection menu when prompted. Paste any direct HuggingFace GGUF URL (ending in .gguf), give the model a short local name, and optionally provide a custom system prompt. The installer downloads the file to Shared/models/, creates a Modelfile, and imports it into the Ollama registry automatically.Manually: Download a .gguf file and place it in Shared/models/. Create a Modelfile at Shared/models/Modelfile-<your-local-name>:
FROM ./<your-model-filename>.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM You are a helpful AI assistant.
Then import it into Ollama:
# From the Shared/models/ directory
ollama-windows.exe create your-local-name -f Modelfile-your-local-name
See Custom Models for a full walkthrough.
All curated models use Q4_K_M quantization (4-bit, K-quant medium). This format offers a practical balance between file size and output quality:
  • File size is approximately 55–60% smaller than the original BF16 weights.
  • Quality loss compared to the full-precision model is minimal for conversational tasks — typically imperceptible in practice.
  • Speed is better than higher-bit quantizations because less data is read from memory per token generated.
If you want higher output quality at the cost of larger file size and slower generation, use the custom model feature to download Q5_K_M or Q8_0 variants of any model from HuggingFace. These are drop-in replacements — just update the Modelfile FROM line to point to the new file.
Generation speed depends heavily on available RAM and whether GPU acceleration is active. Approximate token-per-second speeds for the Gemma 2 2B model (1.6 GB, fully loaded in RAM):
HardwareApproximate Speed
Modern CPU only (no GPU)~10–30 tokens/second
NVIDIA RTX 3080 (CUDA)~50–100 tokens/second
Apple M2 (Metal)~30–60 tokens/second
Android ARM64, 8 GB RAM~3–10 tokens/second
Speeds for larger models (9B, 12B) are roughly proportional — expect 3–5× slower than the 2B model on the same hardware. If generation is slower than expected, see Very slow text generation in the Troubleshooting guide.
Yes. The chat server uses a ThreadedHTTPServer that spawns a new daemon thread per HTTP request, so multiple clients on the LAN can connect simultaneously and receive responses without blocking each other’s connections.However, the Ollama engine processes one inference request at a time. If two users submit chat messages simultaneously, the second request will queue behind the first. Response streaming begins for the second user as soon as the first inference finishes. For light concurrent use (2–3 users), this queuing is generally unnoticeable.

Technical

USB-Uncensored-LLM supports two backend engines:Ollama mode (Windows, macOS, Linux): The Shared/bin/ollama-* binary provides a model registry (ollama create, ollama list) and exposes an Ollama-native HTTP API on port 11434. The chat server proxies /ollama/* requests directly to this API.llama.cpp mode (Android): The Android installer compiles llama-server from the llama.cpp project source directly on the device’s ARM64 processor. This binary exposes an OpenAI-compatible API on port 8080. Because the chat server normally speaks the Ollama API, passing --llama-cpp to chat_server.py enables a translation layer that converts Ollama-style /api/chat payloads to OpenAI /v1/chat/completions requests before forwarding them to llama-server, and translates the SSE streaming response back into Ollama JSONL format for the UI.
All persistent file writes in the chat server use an atomic write pattern:
  1. The new data is serialized to a temporary file alongside the target (e.g., chats.json.tmp).
  2. The file is explicitly flushed to disk with f.flush().
  3. The temporary file is renamed over the target using os.replace(), which is an atomic operation on all supported operating systems.
This means that if the USB drive is ejected, the machine loses power, or the process is killed at any point during a write, one of two outcomes is guaranteed: either the old file remains fully intact, or the new file is fully written. A partial or corrupted write is not possible.
The chat server port is defined by the CHAT_SERVER_PORT constant near the top of Shared/chat_server.py:
CHAT_SERVER_PORT = 3333
To change it, open the file in any text editor and modify that value. The change takes effect the next time the chat server starts.If you change the port, remember to:
  • Update any firewall rules that reference port 3333.
  • Update any bookmarks or shortcuts pointing to http://localhost:3333.
  • Note that the URL printed in the terminal at startup will reflect the new port automatically.

Build docs developers (and LLMs) love