[Discussion] ESP32-S3 Wake Word Strategy: Hybrid Dataset (Real Voice + Applio RVC + Classic TTS) for Spanish

Hi everyone,

I’m finalizing a local voice satellite setup using ESP32-S3 hardware (Waveshare ESP32-S3-Board v3 and ReSpeaker Lite). My goal is to train a robust microWakeWord model using t0tter’s notebook pipeline.

I am targeting Spanish (Spain/Castilian). The main challenge is achieving high accuracy for my specific voice without overfitting, while maintaining enough generalization for the model to work reliably in a real home environment with other family members.

My Proposed “Hybrid” Dataset Strategy:

Instead of relying on a single source, I plan to structure the positive samples in three distinct layers. I would like your feedback on this mix:

  1. The “Classic” Layer (Generalization):
    Standard TTS generation (Piper, Google, etc.) using the default script.

    • Purpose: To ensure the model learns the phonemes of the wake word broadly and doesn’t become “deaf” to slight variations or other household members.
  2. The “Cloned” Layer (Prosody & Timbre):
    Using Applio (RVC) to process expressive inputs generated by advanced TTS models (referencing this BentoML analysis on open source models).

    • Purpose: To inject my specific vocal timbre into the dataset while maintaining perfect synthetic prosody/intonation, which is crucial for the musicality of Spanish.
  3. The “Real World” Layer (Ground Truth):
    Actual raw recordings of my voice speaking the wake word naturally.

    • Purpose: To capture the specific acoustic characteristics of the intended hardware (microphone frequency response) and natural speech imperfections (breathing, speed variations) that synthesis often misses.

My questions for the community:

  1. Balancing the Mix: In a typical 2026 microWakeWord training run, what ratio would you recommend to avoid overfitting?
    • Hypothesis: 40% Generic TTS / 40% Applio Cloned / 20% Real Recordings?
  2. Real Audio Integration: When introducing “Real World” recordings into t0tter’s notebook, should I pre-process them (normalize/denoise) strictly, or is it better to leave them somewhat “raw” to help the model adapt to the noise floor of the house?
  3. Waveshare v3 vs ReSpeaker: Has anyone noticed a significant difference in inference performance or false rejection rates (FRR) between these two boards when using custom-trained models?

One final “Reality Check”:
Am I over-engineering this?
Considering the state of microWakeWord in 2026, is the standard workflow (using just generic Piper/Google TTS + a few raw recordings) usually sufficient for a high-acceptance Spanish wake word? Or does the “Speech-to-Speech” (RVC) approach provide a noticeable leap in daily reliability?

I’m aiming for the “sweet spot” between a perfect dataset and an over-engineered one. Thanks for your time!