Audio transcription CLI

This example demonstrates how to use the LFM2-Audio-1.5B model with llama.cpp to transcribe audio files locally in real-time. When you combine the efficiency of llama.cpp with the power of a small audio model like LFM2-Audio-1.5B, you can build real-time applications that run on smartphones, self-driving cars, smart home devices without internet connection or any other cloud service dependencies.

Audio transcription architecture diagram

What you’ll learn

Intelligent audio assistants on the edge are possible, and this repository is just one example of how to build one.

Quick start

Clone the repository

git clone https://github.com/Liquid4All/cookbook.git
cd cookbook/examples/audio-transcription-cli

Install uv

macOS/Linux
Windows

curl -LsSf https://astral.sh/uv/install.sh | sh

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Download audio samples

uv run download_audio_samples.py

Run the transcription CLI

uv run transcribe --audio './audio-samples/barackobamafederalplaza.mp3' --play-audio

By passing the --play-audio flag, you will hear the audio in the background during transcription.

Understanding the architecture

This example is a 100% local audio-to-text transcription CLI that runs on your machine thanks to llama.cpp. Neither input audios nor output text are sent to any server. Everything runs on your machine. The Python code downloads the necessary llama.cpp builds for your platform automatically, so you don’t need to worry about it. Audio support in llama.cpp is still quite experimental, and not fully integrated on the main branch of the llama.cpp project. Because of this, the Liquid AI team has released specialized llama.cpp builds that support the LFM2-Audio-1.5B model.

Supported platformsThe following platforms are currently supported:

android-arm64
macos-arm64
ubuntu-arm64
ubuntu-x64

If your platform is not supported, you will need to wait for the builds to be released.

llama.cpp support for audio models

llama.cpp is a super fast and lightweight open-source inference engine for Language Models. It is written in C++ and can be used to run LLMs on your local machine. Our Python CLI uses llama.cpp under the hood to deliver fast transcriptions, instead of using either PyTorch or the higher-level transformers library. The examples.sh script contains 3 examples on how to run inference with LFM2-Audio-1.5 for 3 common use cases:

Audio to text transcription (ASR)

./llama-lfm2-audio \
    -m $CKPT/LFM2-Audio-1.5B-Q8_0.gguf \
    --mmproj $CKPT/mmproj-audioencoder-LFM2-Audio-1.5B-Q8_0.gguf \
    -mv $CKPT/audiodecoder-LFM2-Audio-1.5B-Q8_0.gguf \
    -sys "Perform ASR." \
    --audio $INPUT_WAV

Text to speech (TTS)

./llama-lfm2-audio \
    -m $CKPT/LFM2-Audio-1.5B-Q8_0.gguf \
    --mmproj $CKPT/mmproj-audioencoder-LFM2-Audio-1.5B-Q8_0.gguf \
    -mv $CKPT/audiodecoder-LFM2-Audio-1.5B-Q8_0.gguf \
    -sys "Perform TTS." \
    -p "My name is Pau Labarta Bajo and I love AI" \
    --output $OUTPUT_WAV

Text to speech with voice instructions

./llama-lfm2-audio \
    -m $CKPT/LFM2-Audio-1.5B-Q8_0.gguf \
    --mmproj $CKPT/mmproj-audioencoder-LFM2-Audio-1.5B-Q8_0.gguf \
    -mv $CKPT/audiodecoder-LFM2-Audio-1.5B-Q8_0.gguf \
    -sys "Perform TTS.
    Use the following voice: A male speaker delivers a very expressive and animated speech, with a low-pitch voice and a slightly close-sounding tone. The recording carries a slight background noise." \
    -p "What is your name man?" \
    --output $OUTPUT_WAV

Further improvements

The decoded text is not perfect, due to overlapping chunks and partial sentences that are grammatically incorrect. To improve the transcription, we can use a text cleaning model in a local 2-step workflow for real-time audio to speech recognition:

LFM2-Audio-1.5B for audio to text extraction
LFM2-350M for text cleaning

What is LFM2-350M?

LFM2-350M is a small text-to-text model that can be used for tasks like text cleaning. To achieve optimal performance for your particular use case, you need to optimize your system and user prompts. One way to do so is by using the Leap Workbench, a no-code tool that we are developing at Liquid AI for tasks like this. If you want to get early access, join the Liquid AI Discord server and head to the #gpt5-level-slms channel.

Source code

View the complete source code on GitHub.

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Audio transcription CLI

What you’ll learn

Quick start

Understanding the architecture

llama.cpp support for audio models

Further improvements

What is LFM2-350M?

Source code

Build docs developers (and LLMs) love

Overview

Local AI Apps

Mobile Deployment

Fine-Tuning

Community

Documentation Index

​What you’ll learn

​Quick start

​Understanding the architecture

​llama.cpp support for audio models

​Further improvements

​What is LFM2-350M?

​Source code

Build docs developers (and LLMs) love

What you’ll learn

Quick start

Understanding the architecture

llama.cpp support for audio models

Further improvements

What is LFM2-350M?

Source code