Audio Capture Pipeline: WASAPI, PCM, and AAC Encoding

Flashback’s audio pipeline is implemented entirely in audio.rs, deliberately isolated from capture.rs so the two modules have no shared types and can evolve independently. Each audio source — system loopback and an optional microphone — runs on its own dedicated OS thread, encodes to AAC via a raw IMFTransform, and delivers packets to a sink that is provided by the caller (either the IMFSinkWriter for manual recording, or the ReplayBuffer ring for Instant Replay).

Like the capture pipeline, the audio pipeline is Windows-only. All logic lives inside audio.rs and is compiled only when target_os = "windows". If a device fails to open, the thread exits cleanly without affecting the rest of the capture session.

Track Model

Every audio source is represented by a TrackKind and an Encoding. These two enums drive the entire per-track setup:

#[derive(Clone)]
pub enum TrackKind {
    SystemLoopback,
    Microphone(String),  // WinRT DeviceInformation ID
}

pub enum Encoding {
    Pcm,          // raw PCM16 delivered to the sink (manual recording)
    Aac(u32),     // bitrate in bits/sec; encoded in-thread (Instant Replay)
}

TrackKind::SystemLoopback opens the default render endpoint with AUDCLNT_STREAMFLAGS_LOOPBACK — it taps into whatever the system is currently playing without needing to route audio through a virtual device. TrackKind::Microphone(id) takes a WinRT DeviceInformation ID (the format produced by DeviceInformation::FindAllAsyncDeviceClass(DeviceClass::AudioCapture)). Because the WinRT ID includes an MMDEVAPI# prefix, audio.rs strips it back to the bare MMDevice endpoint ID before calling IMMDeviceEnumerator::GetDevice:

fn mmdevice_id_from_winrt(id: &str) -> String {
    const MARK: &str = "MMDEVAPI#";
    if let Some(pos) = id.find(MARK) {
        let rest = &id[pos + MARK.len()..];
        return rest.split('#').next().unwrap_or(rest).to_string();
    }
    id.to_string()
}

Per-Track Thread Model

spawn_track creates one OS thread per audio source. The thread initializes its own COM apartment (MTA), registers with MMCSS, opens WASAPI, and loops until the TrackHandle is stopped. No COM objects are passed across threads.

pub fn spawn_track(
    kind: TrackKind,
    encoding: Encoding,
    sample_rate: u32,
    channels: u16,
    sink: Arc<dyn AudioSink>,
    pcm_tap: Option<Arc<dyn PcmTap>>,
) -> TrackHandle

TrackHandle holds an Arc<AtomicBool> stop flag and a JoinHandle. Calling stop() sets the flag and joins the thread; TrackHandle also implements Drop so threads are always joined on engine shutdown.

pub struct TrackHandle {
    stop: Arc<AtomicBool>,
    handle: Option<JoinHandle<()>>,
}

WASAPI Capture Loop

Inside each thread, run_track resolves the device, creates an IAudioClient in shared mode with AUDCLNT_STREAMFLAGS_EVENTCALLBACK, and starts capturing. For loopback, the flag also includes AUDCLNT_STREAMFLAGS_LOOPBACK. The event-driven loop calls WaitForSingleObject with a 100 ms timeout and then drains all available packets from IAudioCaptureClient. The timeout acts as a poll interval — for system loopback, the event is not always signaled (this is a known WASAPI behavior), so the loop does not gate packet draining on the event result:

while !stop.load(Ordering::SeqCst) {
    unsafe { WaitForSingleObject(*event, 100); }
    loop {
        let packet = unsafe { capture.GetNextPacketSize() }.unwrap_or(0);
        if packet == 0 { break; }
        // GetBuffer → process → ReleaseBuffer
    }
}

A 2-second buffer is declared on the IAudioClient (20_000_000 in 100-ns units) so that occasional thread scheduling delays never cause the capture client to lose packets.

Float-to-PCM16 Conversion

Modern Windows audio devices nearly always report their mix format as IEEE 754 float (32-bit little-endian), either with WAVE_FORMAT_IEEE_FLOAT or via WAVE_FORMAT_EXTENSIBLE with KSDATAFORMAT_SUBTYPE_IEEE_FLOAT. Flashback detects this and converts each sample to signed PCM16 before any further processing:

fn float_to_pcm16(raw: &[u8]) -> Vec<u8> {
    let samples = raw.len() / 4;
    let mut out = Vec::with_capacity(samples * 2);
    for i in 0..samples {
        let f = f32::from_le_bytes([raw[i*4], raw[i*4+1], raw[i*4+2], raw[i*4+3]]);
        let v = (f.clamp(-1.0, 1.0) * 32767.0).round() as i16;
        out.extend_from_slice(&v.to_le_bytes());
    }
    out
}

If the format is neither IEEE float nor PCM16, the packet is discarded (silence is substituted) and a warning is printed. In practice this path is never reached on consumer hardware. Silent packets (flagged AUDCLNT_BUFFERFLAGS_SILENT by WASAPI) are replaced with a zero-filled buffer of the same size, which avoids a branch in the downmix and encoding paths.

Downmix for Multichannel Audio

The AAC encoder in Windows Media Foundation only accepts 1 or 2 channels. Game audio is frequently output as 5.1 or 7.1 surround. Flashback applies a standard downmix matrix before encoding:

Source channels	L output	R output
1 (mono)	mono	mono
2 (stereo)	L	R
≥ 3 (surround)	L + 0.707·C + 0.707·Ls…	R + 0.707·C + 0.707·Rs…

The LFE channel (index 3 in standard channel maps) is dropped — it carries sub-bass content below 120 Hz that is inaudible through most playback hardware and would add distortion if mixed into a full-range stereo signal.

fn downmix(pcm: &[u8], src: usize, dst: usize) -> Vec<u8> {
    // src == dst: no-op, return as-is
    // dst == 1: average all channels
    // dst == 2: front L/R full; center + surrounds at 0.707; LFE dropped
    for f in 0..frames {
        let base = f * src;
        let mut l = rd(base);      // front left
        let mut r = rd(base + 1);  // front right
        if src >= 3 {
            let c = 0.707 * rd(base + 2);  // center
            l += c; r += c;
        }
        // surrounds at indices 4+
        let mut i = 4;
        while i < src {
            let s = 0.707 * rd(base + i);
            if (i - 4) % 2 == 0 { l += s; } else { r += s; }
            i += 1;
        }
    }
}

Results are clamped to the i16 range after accumulation to prevent wrap-around distortion.

AAC Encoding with IMFTransform

For manual recording, the IMFSinkWriter resolves its own internal AAC MFT from the declared PCM input / AAC output types — the same automatic resolution it performs for the H.264 video encoder. The capture thread simply calls WriteSample with raw PCM data. For Instant Replay, Flashback needs the encoded AAC packets directly so they can be pushed into the ring buffer. For this path, AacEncoder manages the IMFTransform lifecycle manually:

struct AacEncoder {
    mft: IMFTransform,
    provides_output: bool,  // true if the MFT allocates its own output buffers
    out_size: u32,
    user_data_sent: bool,   // true once AudioSpecificConfig has been forwarded
}

Encoder Type Negotiation

The AAC encoder MFT has a strict requirement: the output type must be set before the input type, and the output type must be chosen from the types the encoder itself enumerates via GetOutputAvailableType. Constructing a type manually (even with identical attribute values) results in MF_E_INVALIDMEDIATYPE. Flashback iterates the available output types and picks the first one that matches the desired sample rate, channel count, and payload type (0 = raw AAC, no ADTS/LOAS framing):

let mut idx = 0u32;
loop {
    let t = match unsafe { mft.GetOutputAvailableType(0, idx) } {
        Ok(t) => t,
        Err(_) => break,
    };
    idx += 1;
    let ch = unsafe { t.GetUINT32(&MF_MT_AUDIO_NUM_CHANNELS).unwrap_or(0) };
    let sr = unsafe { t.GetUINT32(&MF_MT_AUDIO_SAMPLES_PER_SECOND).unwrap_or(0) };
    let pt = unsafe { t.GetUINT32(&MF_MT_AAC_PAYLOAD_TYPE).unwrap_or(0) };
    if ch != channels as u32 || sr != sample_rate || pt != 0 { continue; }
    let b = unsafe { t.GetUINT32(&MF_MT_AUDIO_AVG_BYTES_PER_SECOND).unwrap_or(0) };
    if b == want_bytes { chosen = Some(t); break; }
    chosen.get_or_insert(t); // keep as fallback
}
unsafe { mft.SetOutputType(0, &out_type, 0)?; }
// Then set PCM16 input type

Bitrate Selection

The Windows AAC encoder accepts only four bitrate values (selected via MF_MT_AUDIO_AVG_BYTES_PER_SECOND):

Bytes/sec	Kbps
12 000	96
16 000	128
20 000	160
24 000	192

Flashback uses 96 kbps for mono and 128 kbps for stereo:

fn aac_bitrate(channels: u16) -> u32 {
    if channels <= 1 { 96_000 } else { 128_000 }
}

AudioSpecificConfig and Payload Type

On the first successful ProcessOutput drain, Flashback reads the encoder’s current output media type and extracts:

MF_MT_USER_DATA — the AudioSpecificConfig blob (2–5 bytes that describe the AAC codec parameters). The MP4 muxer needs this to write the esds box; without it, IMFSinkWriter::Finalize fails with MF_E_SINK_HEADERS_NOT_FOUND.
MF_MT_AAC_PAYLOAD_TYPE — confirmed to be 0 (raw AAC). The muxer must declare the same value.

These are forwarded to the sink via AudioSink::set_user_data and AudioSink::set_payload_type:

if !enc.user_data_sent {
    if let Ok(mt) = unsafe { enc.mft.GetOutputCurrentType(0) } {
        if let Some(ud) = blob(&mt, &MF_MT_USER_DATA) {
            sink.set_user_data(ud);
            let payload_type = unsafe { mt.GetUINT32(&MF_MT_AAC_PAYLOAD_TYPE) }.unwrap_or(0);
            sink.set_payload_type(payload_type);
            enc.user_data_sent = true;
        }
    }
}

AudioSink and PcmTap Traits

The pipeline is decoupled from its destination via two traits:

pub trait AudioSink: Send + Sync + 'static {
    fn push(&self, data: Vec<u8>, time: i64, dur: i64);
    fn set_user_data(&self, _data: Vec<u8>) {}   // AAC-only, default no-op
    fn set_payload_type(&self, _v: u32) {}        // AAC-only, default no-op
}

pub trait PcmTap: Send + Sync + 'static {
    fn on_pcm(&self, pcm: &[u8], time: i64, dur: i64);
}

AudioSink is the primary destination. There are two concrete implementations:

EncoderAudioSink (manual recording): wraps Arc<Mutex<Encoder>> and calls encoder.push_audio(stream, data, time, dur), forwarding raw PCM to the SinkWriter under the same mutex that guards video frame writes.
ReplayAudioSink (Instant Replay): pushes already-encoded AAC packets directly into Arc<Mutex<ReplayBuffer>>, rebasing timestamps against video_base so audio and video share the same time origin.

PcmTap provides a side-channel for the post-downmix, pre-encode PCM data. It is used to feed a mixer that produces a blended system+microphone waveform track. Time values are identical to those on the encoded packets (QPC-based), so the mixer can align sources by wall clock.

probe_format: Device Format Discovery

Before opening a WASAPI stream, Flashback calls probe_format to discover the device’s native sample rate and channel count without starting a capture session. This lets the caller declare correct stream metadata in the SinkWriter or ring buffer before any audio data arrives:

pub fn probe_format(kind: &TrackKind) -> Option<(u32, u16)> {
    let device = resolve_device(kind).ok()?;
    let client: IAudioClient = unsafe { device.Activate(CLSCTX_ALL, None).ok()? };
    let pwfx = unsafe { client.GetMixFormat().ok()? };
    let (rate, channels) = unsafe { ((*pwfx).nSamplesPerSec, (*pwfx).nChannels) };
    unsafe { CoTaskMemFree(Some(pwfx as *const _)) };
    Some((rate, channels))
}

The result feeds aac_target_format, which validates that the sample rate is one of the two values the AAC encoder accepts (44 100 or 48 000 Hz) and returns the stereo/mono target channel count. If the device uses an unsupported rate, the audio track is silently omitted rather than breaking the capture:

pub fn aac_target_format(rate: u32, channels: u16) -> Option<(u32, u16)> {
    if rate != 44100 && rate != 48000 { return None; }
    Some((rate, target_channels(channels)))
}

MMCSS Priority

Audio threads register with MMCSS under the "Pro Audio" task — a higher priority class than the video capture threads ("Capture"). This prevents a CPU-intensive game from starving the audio threads. If an audio thread were starved while holding the ReplayBuffer mutex, the video encoding pump (which also needs that mutex) would block, creating a priority inversion. MMCSS scheduling prevents this scenario.

let _mmcss = mmcss_register("Pro Audio");
// MmcssGuard: RAII, calls AvRevertMmThreadCharacteristics on drop

Registration is best-effort: if AvSetMmThreadCharacteristicsW fails (uncommon but possible in sandboxed environments), the thread continues at normal priority and a warning is printed.

Temporal Alignment with Video

Both the EncoderAudioSink and the ReplayAudioSink discard audio packets that arrive before the first video frame establishes a common time origin (video_base). This avoids the alignment problem that arises because WASAPI starts delivering audio immediately, while the first WGC frame may arrive tens of milliseconds later. For manual recording, video_base is the raw WGC SystemRelativeTime of the first frame. Audio packet timestamps (qpc) are rebased: ts = (qpc - video_base).max(0). For Instant Replay, video_base is stored in an Arc<AtomicI64> shared between the video pump and all audio sinks. The atomic is initialized to i64::MIN; audio sinks check for this sentinel and discard packets until the video pump sets a real value at the first encoded frame.

All timestamps in Flashback’s audio and video pipelines are in 100-nanosecond units (the Media Foundation / WGC convention). A 48 000 Hz audio frame of 1024 samples has a duration of 1024 * 10_000_000 / 48_000 = 213_333 units (~21.3 ms).

Architecture

Overall system design, module map, and the Tauri command communication model.

Capture Pipeline

WGC frame acquisition, D3D11 texture path, hardware encoder selection, and ring buffer mechanics.

Capture Settings

How to configure microphone device, system audio, and bitrate from the settings UI.

Troubleshooting

Common issues with audio capture, including silent loopback and microphone device errors.

Architecture

Tauri Commands

Audio Capture Pipeline: WASAPI, PCM, and AAC Encoding

Track Model

Per-Track Thread Model

WASAPI Capture Loop

Float-to-PCM16 Conversion

Downmix for Multichannel Audio

AAC Encoding with IMFTransform

Encoder Type Negotiation

Bitrate Selection

AudioSpecificConfig and Payload Type

AudioSink and PcmTap Traits

probe_format: Device Format Discovery

MMCSS Priority

Temporal Alignment with Video

Architecture

Capture Pipeline

Capture Settings

Troubleshooting

Build docs developers (and LLMs) love

Architecture

Tauri Commands

Documentation Index

​Track Model

​Per-Track Thread Model

​WASAPI Capture Loop

​Float-to-PCM16 Conversion

​Downmix for Multichannel Audio

​AAC Encoding with IMFTransform

​Encoder Type Negotiation

​Bitrate Selection

​AudioSpecificConfig and Payload Type

​AudioSink and PcmTap Traits

​probe_format: Device Format Discovery

​MMCSS Priority

​Temporal Alignment with Video

​Related Pages

Architecture

Capture Pipeline

Capture Settings

Troubleshooting

Build docs developers (and LLMs) love

Track Model

Per-Track Thread Model

WASAPI Capture Loop

Float-to-PCM16 Conversion

Downmix for Multichannel Audio

AAC Encoding with IMFTransform

Encoder Type Negotiation

Bitrate Selection

AudioSpecificConfig and Payload Type

AudioSink and PcmTap Traits

probe_format: Device Format Discovery

MMCSS Priority

Temporal Alignment with Video

Related Pages