Every exercise in VOZI involves three distinct audio operations: playing a model pronunciation of the target word, listening to the child’s attempt through the microphone, and playing a feedback sound after evaluation. These are handled by three separate services —Documentation Index
Fetch the complete documentation index at: https://mintlify.com/AlonsoSam/vozi-android/llms.txt
Use this file to discover all available pages before exploring further.
TtsService, AudioAssetService, and SttService — each with a narrow responsibility. ExerciseScreen coordinates them, ensuring that only one audio operation is active at any moment and that the microphone is never blocked by a playback player.
TtsService
TtsService reads words aloud using the Android system TTS engine. It acts as a fallback for the “Escuchar” (Listen) button: if a real MP3 recording exists in the asset bundle, AudioAssetService plays it instead. TTS is only used when the audio asset is missing.
| Setting | Value | Reason |
|---|---|---|
| Language | es-ES | Spanish-only app |
| Speech rate | 0.45 | Slower than default so children can follow clearly |
| Pitch | 1.0 | Natural pitch |
awaitSpeakCompletion | true | The caller can await speak() and know when audio ends |
_ensureConfigured() runs only on the first speak() call, so construction is free. A new instance can be created per screen without cost.
AudioAssetService
AudioAssetService plays real MP3 recordings bundled in the app. It uses two independent AudioPlayer instances: one for word audio (_word) and one for feedback sounds (_fx). This separation means a feedback chime never cuts off a word recording mid-play.
Audio focus configuration is critical. By default, audioplayers requests AUDIOFOCUS_GAIN on Android, which holds exclusive audio focus after playback and blocks the microphone from being available to the STT engine on subsequent words. AudioAssetService configures both players with AudioContextConfigFocus.mixWithOthers (Android: AUDIOFOCUS_NONE) so playback never seizes audio focus and the microphone is always available for the next exercise round.
Methods
playWord(String audioKey) → Future<bool>
playWord(String audioKey) → Future<bool>
Looks up
assets/audio/words/<audioKey>.mp3 in the asset bundle. If the file does not exist, returns false immediately and the caller falls back to TtsService. If the file exists, plays it and waits for completion (up to 6 seconds), then returns true. Once the audio has started playing, this method always returns true even if the player emits an error mid-stream — the caller must never fall back to TTS while audio is audibly playing.playCorrect() / playIncorrect() / playSessionComplete()
playCorrect() / playIncorrect() / playSessionComplete()
Each method picks a random file from its feedback folder and plays it on
If an individual feedback file is missing, the exception is silently swallowed — visual feedback remains and the exercise continues normally.
_fx:| Method | Asset folder | File count |
|---|---|---|
playCorrect() | assets/audio/feedback/correct/ | 10 files (correct_01.mp3 … correct_10.mp3) |
playIncorrect() | assets/audio/feedback/incorrect/ | 10 files (incorrect_01.mp3 … incorrect_10.mp3) |
playSessionComplete() | assets/audio/feedback/session/ | 5 files (session_complete_01.mp3 … session_complete_05.mp3) |
stop() and dispose()
stop() and dispose()
stop() calls stop() then release() on both players. release() is essential: it frees the audio focus and any native audio resources, ensuring the STT microphone can acquire them without delay on the next word. dispose() tears down the AudioPlayer objects entirely when the ExerciseScreen is disposed.Each
ExerciseScreen creates its own AudioAssetService instance. The service holds two AudioPlayer objects; call dispose() in the screen’s dispose() method to release them. Do not share a single instance across screens.SttService (Speech-to-Text)
SttService wraps the speech_to_text package for on-device Spanish speech recognition. It is a singleton — exactly one instance exists for the entire app lifetime.
speech_to_text registers its onStatus and onError callbacks only once during the first initialize() call. If each ExerciseScreen created a new SttService, the callbacks from the engine would still point to the first (now-disposed) instance, and all subsequent phoneme screens would receive no recognition results.
Initialization and Locale Selection
init() requests microphone permission and initializes the recognition engine. It retries safely if called multiple times — _available = true short-circuits immediately on subsequent calls.
After a successful init, _chooseSpanishLocale() iterates the device’s available speech locales in priority order:
- Exact match
es_ES - Any locale starting with
es(e.g.es_MX,es_AR) - System locale as a last resort
es_ES is used as a hardcoded fallback.
Listening Flow
The two-click “Speak” button model maps to two internal state values:Safety Caps
listen() always sets two time limits to prevent the recognizer from running indefinitely:
| Option | Value | Purpose |
|---|---|---|
listenFor | 60 seconds | Hard cap on total listening session |
pauseFor | 30 seconds | Auto-stop after 30 s of silence |
_report() is called with whatever partial text was recognized up to that point — the attempt is never silently discarded.
partialResults: true is also set so that the last partial transcript is always available in _lastWords. If the child stops the mic before the engine emits a final result, stopAndReport() can still evaluate the partial text rather than returning an empty string.
SttResult
cancel() vs stop() vs stopAndReport()
| Method | Behavior |
|---|---|
stopAndReport() | Stops listening and immediately calls onResult with current text. Used by the second button click. |
stop() | Stops the engine without reporting. Used when navigating away mid-attempt. |
cancel() | Discards all state — clears _lastWords, resets flags, cancels the engine session. Called before starting a fresh listen for the next word. |
Media Coordination
ExerciseScreen enforces a strict rule: no two audio operations ever overlap. Before any new audio action (playing a word, starting the mic, playing feedback), a helper method stops all three services:
playWord(), before listen(), and on screen disposal. The pattern prevents a scenario where, for example, a slow TTS utterance from the previous word is still playing when the child presses “Hablar” for the next.
Privacy
VOZI processes speech entirely on-device. The STT engine (Android’sSpeechRecognizer API) returns a text string — VOZI never receives or stores raw audio. The recognized text (SttResult.words) is used only to:
- Evaluate correctness via
WordEvaluator - Store as
SpeechAttempt.recognizedTextin localshared_preferencesfor the adult dashboard
practice_attempts table stores only phoneme_code, target_word, was_correct, score, age_band_code, and created_at — there is no recognized_text column.