ik_llama.cpp builds and runs successfully on Android. The ARM NEON kernels used on arm64-v8a devices are the same well-optimised kernels used on Linux ARM64, so on-device inference is practical on modern Android phones.adb.
Method 1: Build in Termux
Termux gives you a Linux-like environment on Android without root. Install it from F-Droid or the GitHub releases page (the Play Store version is outdated and no longer receives updates).Build
Follow the standard Linux build instructions. The ARM NEON backend is detected and compiled automatically on
arm64 devices:Move your model to the home directory
Model files on the SD card or
Downloads folder are accessed over FUSE, which is significantly slower than the internal storage Termux uses for ~/. Move the model before running inference:Method 2: Cross-compile with the Android NDK
Cross-compiling on a desktop machine is faster than building in Termux and produces the same binaries. You then push them to the device withadb.
Obtain the Android NDK
Download the Android NDK and note the path. The steps below use
$NDK to refer to it.Configure and build
Run the following on your desktop (Linux or macOS):The
-march=armv8.4a+dotprod flag enables the dot-product instructions available on most ARMv8.4+ SoCs (Snapdragon 865 and newer). Omit it for broader compatibility.Install Termux on the device
Install Termux from F-Droid or GitHub, then run
termux-setup-storage in Termux to grant storage access. On Android 11 and later, run the command twice.Push binaries to the device
Copy the compiled binaries to the device SD card with
adb push, then move them inside Termux:Copy your model
Push the model to the SD card first, then move it to internal storage for best performance:
Performance tips
Store models in ~/ (internal storage)
Store models in ~/ (internal storage)
Android’s SD card and shared storage (
/sdcard/) are mounted over FUSE, which adds significant overhead for the large sequential reads that model inference requires. Always move .gguf files to ~/ (i.e. /data/data/com.termux/files/home/) before running inference.Use the ARM NEON backend
Use the ARM NEON backend
On
arm64-v8a devices the NEON backend is compiled in by default. No extra flags are needed. The dot-product extension (+dotprod) further accelerates quantised matrix operations on ARMv8.4+ SoCs — enable it at build time with -DCMAKE_C_FLAGS="-march=armv8.4a+dotprod".Match thread count to physical cores
Match thread count to physical cores
Set
-t to the number of physical CPU cores, not the total thread count. On a phone with a mix of performance and efficiency cores, start with the number of performance cores and adjust from there. See the performance tips page for a systematic approach.Use a quantised model
Use a quantised model
On-device VRAM is shared with system RAM. Use a quantised model (
Q4_K_M or IQ4_XS are good starting points) to keep memory use low and inference speed reasonable.