Hi everyone,
I’m finalizing a local voice satellite setup using ESP32-S3 hardware (Waveshare ESP32-S3-Board v3 and ReSpeaker Lite). My goal is to train a robust microWakeWord model using t0tter’s notebook pipeline.
I am targeting Spanish (Spain/Castilian). The main challenge is achieving high accuracy for my specific voice without overfitting, while maintaining enough generalization for the model to work reliably in a real home environment with other family members.
My Proposed “Hybrid” Dataset Strategy:
Instead of relying on a single source, I plan to structure the positive samples in three distinct layers. I would like your feedback on this mix:
-
The “Classic” Layer (Generalization):
Standard TTS generation (Piper, Google, etc.) using the default script.- Purpose: To ensure the model learns the phonemes of the wake word broadly and doesn’t become “deaf” to slight variations or other household members.
-
The “Cloned” Layer (Prosody & Timbre):
Using Applio (RVC) to process expressive inputs generated by advanced TTS models (referencing this BentoML analysis on open source models).- Purpose: To inject my specific vocal timbre into the dataset while maintaining perfect synthetic prosody/intonation, which is crucial for the musicality of Spanish.
-
The “Real World” Layer (Ground Truth):
Actual raw recordings of my voice speaking the wake word naturally.- Purpose: To capture the specific acoustic characteristics of the intended hardware (microphone frequency response) and natural speech imperfections (breathing, speed variations) that synthesis often misses.
My questions for the community:
- Balancing the Mix: In a typical 2026 microWakeWord training run, what ratio would you recommend to avoid overfitting?
- Hypothesis: 40% Generic TTS / 40% Applio Cloned / 20% Real Recordings?
- Real Audio Integration: When introducing “Real World” recordings into t0tter’s notebook, should I pre-process them (normalize/denoise) strictly, or is it better to leave them somewhat “raw” to help the model adapt to the noise floor of the house?
- Waveshare v3 vs ReSpeaker: Has anyone noticed a significant difference in inference performance or false rejection rates (FRR) between these two boards when using custom-trained models?
One final “Reality Check”:
Am I over-engineering this?
Considering the state of microWakeWord in 2026, is the standard workflow (using just generic Piper/Google TTS + a few raw recordings) usually sufficient for a high-acceptance Spanish wake word? Or does the “Speech-to-Speech” (RVC) approach provide a noticeable leap in daily reliability?
I’m aiming for the “sweet spot” between a perfect dataset and an over-engineered one. Thanks for your time!