Flashback’s audio pipeline is implemented entirely inDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/CaramelHQ/Flashback/llms.txt
Use this file to discover all available pages before exploring further.
audio.rs, deliberately isolated from capture.rs so the two modules have no shared types and can evolve independently. Each audio source — system loopback and an optional microphone — runs on its own dedicated OS thread, encodes to AAC via a raw IMFTransform, and delivers packets to a sink that is provided by the caller (either the IMFSinkWriter for manual recording, or the ReplayBuffer ring for Instant Replay).
Like the capture pipeline, the audio pipeline is Windows-only. All logic lives inside
audio.rs and is compiled only when target_os = "windows". If a device fails to open, the thread exits cleanly without affecting the rest of the capture session.Track Model
Every audio source is represented by aTrackKind and an Encoding. These two enums drive the entire per-track setup:
TrackKind::SystemLoopback opens the default render endpoint with AUDCLNT_STREAMFLAGS_LOOPBACK — it taps into whatever the system is currently playing without needing to route audio through a virtual device.
TrackKind::Microphone(id) takes a WinRT DeviceInformation ID (the format produced by DeviceInformation::FindAllAsyncDeviceClass(DeviceClass::AudioCapture)). Because the WinRT ID includes an MMDEVAPI# prefix, audio.rs strips it back to the bare MMDevice endpoint ID before calling IMMDeviceEnumerator::GetDevice:
Per-Track Thread Model
spawn_track creates one OS thread per audio source. The thread initializes its own COM apartment (MTA), registers with MMCSS, opens WASAPI, and loops until the TrackHandle is stopped. No COM objects are passed across threads.
TrackHandle holds an Arc<AtomicBool> stop flag and a JoinHandle. Calling stop() sets the flag and joins the thread; TrackHandle also implements Drop so threads are always joined on engine shutdown.
WASAPI Capture Loop
Inside each thread,run_track resolves the device, creates an IAudioClient in shared mode with AUDCLNT_STREAMFLAGS_EVENTCALLBACK, and starts capturing. For loopback, the flag also includes AUDCLNT_STREAMFLAGS_LOOPBACK.
The event-driven loop calls WaitForSingleObject with a 100 ms timeout and then drains all available packets from IAudioCaptureClient. The timeout acts as a poll interval — for system loopback, the event is not always signaled (this is a known WASAPI behavior), so the loop does not gate packet draining on the event result:
IAudioClient (20_000_000 in 100-ns units) so that occasional thread scheduling delays never cause the capture client to lose packets.
Float-to-PCM16 Conversion
Modern Windows audio devices nearly always report their mix format as IEEE 754 float (32-bit little-endian), either withWAVE_FORMAT_IEEE_FLOAT or via WAVE_FORMAT_EXTENSIBLE with KSDATAFORMAT_SUBTYPE_IEEE_FLOAT. Flashback detects this and converts each sample to signed PCM16 before any further processing:
AUDCLNT_BUFFERFLAGS_SILENT by WASAPI) are replaced with a zero-filled buffer of the same size, which avoids a branch in the downmix and encoding paths.
Downmix for Multichannel Audio
The AAC encoder in Windows Media Foundation only accepts 1 or 2 channels. Game audio is frequently output as 5.1 or 7.1 surround. Flashback applies a standard downmix matrix before encoding:| Source channels | L output | R output |
|---|---|---|
| 1 (mono) | mono | mono |
| 2 (stereo) | L | R |
| ≥ 3 (surround) | L + 0.707·C + 0.707·Ls… | R + 0.707·C + 0.707·Rs… |
i16 range after accumulation to prevent wrap-around distortion.
AAC Encoding with IMFTransform
For manual recording, theIMFSinkWriter resolves its own internal AAC MFT from the declared PCM input / AAC output types — the same automatic resolution it performs for the H.264 video encoder. The capture thread simply calls WriteSample with raw PCM data.
For Instant Replay, Flashback needs the encoded AAC packets directly so they can be pushed into the ring buffer. For this path, AacEncoder manages the IMFTransform lifecycle manually:
Encoder Type Negotiation
The AAC encoder MFT has a strict requirement: the output type must be set before the input type, and the output type must be chosen from the types the encoder itself enumerates viaGetOutputAvailableType. Constructing a type manually (even with identical attribute values) results in MF_E_INVALIDMEDIATYPE. Flashback iterates the available output types and picks the first one that matches the desired sample rate, channel count, and payload type (0 = raw AAC, no ADTS/LOAS framing):
Bitrate Selection
The Windows AAC encoder accepts only four bitrate values (selected viaMF_MT_AUDIO_AVG_BYTES_PER_SECOND):
| Bytes/sec | Kbps |
|---|---|
| 12 000 | 96 |
| 16 000 | 128 |
| 20 000 | 160 |
| 24 000 | 192 |
AudioSpecificConfig and Payload Type
On the first successfulProcessOutput drain, Flashback reads the encoder’s current output media type and extracts:
MF_MT_USER_DATA— the AudioSpecificConfig blob (2–5 bytes that describe the AAC codec parameters). The MP4 muxer needs this to write theesdsbox; without it,IMFSinkWriter::Finalizefails withMF_E_SINK_HEADERS_NOT_FOUND.MF_MT_AAC_PAYLOAD_TYPE— confirmed to be 0 (raw AAC). The muxer must declare the same value.
AudioSink::set_user_data and AudioSink::set_payload_type:
AudioSink and PcmTap Traits
The pipeline is decoupled from its destination via two traits:AudioSink is the primary destination. There are two concrete implementations:
EncoderAudioSink(manual recording): wrapsArc<Mutex<Encoder>>and callsencoder.push_audio(stream, data, time, dur), forwarding raw PCM to the SinkWriter under the same mutex that guards video frame writes.ReplayAudioSink(Instant Replay): pushes already-encoded AAC packets directly intoArc<Mutex<ReplayBuffer>>, rebasing timestamps againstvideo_baseso audio and video share the same time origin.
PcmTap provides a side-channel for the post-downmix, pre-encode PCM data. It is used to feed a mixer that produces a blended system+microphone waveform track. Time values are identical to those on the encoded packets (QPC-based), so the mixer can align sources by wall clock.
probe_format: Device Format Discovery
Before opening a WASAPI stream, Flashback callsprobe_format to discover the device’s native sample rate and channel count without starting a capture session. This lets the caller declare correct stream metadata in the SinkWriter or ring buffer before any audio data arrives:
aac_target_format, which validates that the sample rate is one of the two values the AAC encoder accepts (44 100 or 48 000 Hz) and returns the stereo/mono target channel count. If the device uses an unsupported rate, the audio track is silently omitted rather than breaking the capture:
MMCSS Priority
Audio threads register with MMCSS under the"Pro Audio" task — a higher priority class than the video capture threads ("Capture"). This prevents a CPU-intensive game from starving the audio threads. If an audio thread were starved while holding the ReplayBuffer mutex, the video encoding pump (which also needs that mutex) would block, creating a priority inversion. MMCSS scheduling prevents this scenario.
AvSetMmThreadCharacteristicsW fails (uncommon but possible in sandboxed environments), the thread continues at normal priority and a warning is printed.
Temporal Alignment with Video
Both theEncoderAudioSink and the ReplayAudioSink discard audio packets that arrive before the first video frame establishes a common time origin (video_base). This avoids the alignment problem that arises because WASAPI starts delivering audio immediately, while the first WGC frame may arrive tens of milliseconds later.
For manual recording, video_base is the raw WGC SystemRelativeTime of the first frame. Audio packet timestamps (qpc) are rebased: ts = (qpc - video_base).max(0).
For Instant Replay, video_base is stored in an Arc<AtomicI64> shared between the video pump and all audio sinks. The atomic is initialized to i64::MIN; audio sinks check for this sentinel and discard packets until the video pump sets a real value at the first encoded frame.
All timestamps in Flashback’s audio and video pipelines are in 100-nanosecond units (the Media Foundation / WGC convention). A 48 000 Hz audio frame of 1024 samples has a duration of
1024 * 10_000_000 / 48_000 = 213_333 units (~21.3 ms).Related Pages
Architecture
Overall system design, module map, and the Tauri command communication model.
Capture Pipeline
WGC frame acquisition, D3D11 texture path, hardware encoder selection, and ring buffer mechanics.
Capture Settings
How to configure microphone device, system audio, and bitrate from the settings UI.
Troubleshooting
Common issues with audio capture, including silent loopback and microphone device errors.