Common questions about how USB-Uncensored-LLM works, what hardware it needs, and how to customize it. If you’re running into an error rather than a general question, see the Troubleshooting page instead.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/techjarves/USB-Uncensored-LLM/llms.txt
Use this file to discover all available pages before exploring further.
Setup & Portability
Do I need to install anything on the host computer?
Do I need to install anything on the host computer?
Shared/bin/ and runs directly without installation. The only external requirement is Python 3, which is needed to run the chat server.On Windows, Python is handled automatically — start-fast-chat.bat checks for a portable Python at Shared/python/python.exe first, then falls back to system Python, and downloads an embedded Python (~11 MB) to the USB drive if neither is found.On Linux and macOS, Python 3 is included with the OS on virtually all modern distributions.There are no registry edits, no PATH changes, and no administrator rights required for normal operation.Can I use this without a USB drive — just on my computer?
Can I use this without a USB drive — just on my computer?
C:\AI or D:\Models) and run the installer from that folder. The system works identically whether the root folder is on a USB drive or an internal disk.Running from an internal SSD is significantly faster than USB — model loading is near-instant compared to the multi-second loads typical on USB 3.0 flash drives. The “USB” in the project name refers to its design goal of portability, not a requirement.Do I need to run the installer on every computer?
Do I need to run the installer on every computer?
- Windows downloads
ollama-windows.exetoShared/bin/ - macOS downloads
ollama-darwintoShared/bin/ - Linux downloads
ollama-linuxand thellama-serverlibrary toShared/bin/andShared/lib/
.gguf weight files in Shared/models/ and the Ollama model registry in Shared/models/ollama_data/ work across all platforms. You do not need to re-download any models when switching operating systems or computers.On subsequent startups on a computer you’ve already set up, just run the start script directly.Can I use the same USB drive on both my Windows PC and my Mac?
Can I use the same USB drive on both my Windows PC and my Mac?
Windows/install.bat on the Windows machine (downloads ollama-windows.exe) and Mac/install.command on the Mac (downloads ollama-darwin). Both engine binaries are stored separately in Shared/bin/ — they do not overwrite each other.Both platforms read model weights from the same Shared/models/ directory and the same Ollama registry in Shared/models/ollama_data/, so you only download each model once regardless of how many operating systems you install the engine for.The USB drive must be formatted as exFAT (not FAT32 or NTFS) to be read-write on both Windows and macOS natively, and to support individual files larger than 4 GB (required for most model weights).Privacy & Security
Does anything phone home?
Does anything phone home?
- The Ollama engine runs entirely locally with no telemetry or usage reporting.
- The Python chat server (
Shared/chat_server.py) has no outbound network connections — it only communicates with the local Ollama engine on127.0.0.1:11434and serves responses back to the browser. - The FastChatUI is a self-contained HTML file served from
Shared/FastChatUI.html— it makes no requests to external CDNs, analytics endpoints, or APIs.
Where is my chat history stored?
Where is my chat history stored?
Shared/chat_data/chats.json on the drive. The file is written using atomic operations (data is first written to chats.json.tmp, then renamed over the target using os.replace()) to prevent corruption if the drive is ejected unexpectedly.Nothing is sent to any external server. There is no account system, no cloud sync, and no remote backup. Your chat history exists solely on the physical drive and is lost if the drive is lost or the file is deleted.Is the chat server accessible from the internet?
Is the chat server accessible from the internet?
0.0.0.0:3333, which makes it reachable on your local network (LAN) but not from the internet. Your router does not forward external traffic to LAN addresses unless you explicitly configure port forwarding.The 0.0.0.0 bind address is intentional — it allows you to access the chat UI from your phone or tablet on the same Wi-Fi network using the IP address printed in the terminal at startup (for example, http://192.168.1.15:3333). If you only want local access, your OS firewall can restrict port 3333 to 127.0.0.1 only.Exposing the server to the internet via port forwarding is not recommended — the chat server has no authentication system.Models & Performance
Why are the models "uncensored"?
Why are the models "uncensored"?
- Abliteration — a post-training technique that mathematically identifies and removes the “refusal direction” vectors from a fine-tuned model’s weight matrices. The result is a model that retains all of its knowledge and reasoning ability but no longer has a learned tendency to refuse requests. Gemma 2 2B Abliterated uses this approach.
- Heretic fine-tuning — instruction-tuning the base model on a dataset that explicitly rewards full compliance with all user requests regardless of content. Gemma 4 E4B Ultra Uncensored Heretic uses this approach.
Can I add my own models?
Can I add my own models?
install.bat, enter C at the model selection menu when prompted. Paste any direct HuggingFace GGUF URL (ending in .gguf), give the model a short local name, and optionally provide a custom system prompt. The installer downloads the file to Shared/models/, creates a Modelfile, and imports it into the Ollama registry automatically.Manually: Download a .gguf file and place it in Shared/models/. Create a Modelfile at Shared/models/Modelfile-<your-local-name>:What quantization format are the models?
What quantization format are the models?
- File size is approximately 55–60% smaller than the original BF16 weights.
- Quality loss compared to the full-precision model is minimal for conversational tasks — typically imperceptible in practice.
- Speed is better than higher-bit quantizations because less data is read from memory per token generated.
Q5_K_M or Q8_0 variants of any model from HuggingFace. These are drop-in replacements — just update the Modelfile FROM line to point to the new file.How fast will responses be?
How fast will responses be?
| Hardware | Approximate Speed |
|---|---|
| Modern CPU only (no GPU) | ~10–30 tokens/second |
| NVIDIA RTX 3080 (CUDA) | ~50–100 tokens/second |
| Apple M2 (Metal) | ~30–60 tokens/second |
| Android ARM64, 8 GB RAM | ~3–10 tokens/second |
Can multiple people use the chat server at the same time?
Can multiple people use the chat server at the same time?
ThreadedHTTPServer that spawns a new daemon thread per HTTP request, so multiple clients on the LAN can connect simultaneously and receive responses without blocking each other’s connections.However, the Ollama engine processes one inference request at a time. If two users submit chat messages simultaneously, the second request will queue behind the first. Response streaming begins for the second user as soon as the first inference finishes. For light concurrent use (2–3 users), this queuing is generally unnoticeable.Technical
What is the difference between Ollama mode and llama.cpp mode?
What is the difference between Ollama mode and llama.cpp mode?
Shared/bin/ollama-* binary provides a model registry (ollama create, ollama list) and exposes an Ollama-native HTTP API on port 11434. The chat server proxies /ollama/* requests directly to this API.llama.cpp mode (Android): The Android installer compiles llama-server from the llama.cpp project source directly on the device’s ARM64 processor. This binary exposes an OpenAI-compatible API on port 8080. Because the chat server normally speaks the Ollama API, passing --llama-cpp to chat_server.py enables a translation layer that converts Ollama-style /api/chat payloads to OpenAI /v1/chat/completions requests before forwarding them to llama-server, and translates the SSE streaming response back into Ollama JSONL format for the UI.How does the chat server handle data corruption on the USB drive?
How does the chat server handle data corruption on the USB drive?
- The new data is serialized to a temporary file alongside the target (e.g.,
chats.json.tmp). - The file is explicitly flushed to disk with
f.flush(). - The temporary file is renamed over the target using
os.replace(), which is an atomic operation on all supported operating systems.
Can I change the server port from 3333?
Can I change the server port from 3333?
CHAT_SERVER_PORT constant near the top of Shared/chat_server.py:- Update any firewall rules that reference port
3333. - Update any bookmarks or shortcuts pointing to
http://localhost:3333. - Note that the URL printed in the terminal at startup will reflect the new port automatically.