The ESP BOX uses the newer ESP32 S3.
Willow uses the absolute latest ESP-SR framework with their Audio Front End Framework. We place AFE between the dual mic i2s hardware so that all audio fed to wake, on device recognition, and audio streaming to inference server has:
- AEC (Acoustic Echo Cancellation)
- NS (Noise Suppression)
- BSS (Blind Source Separation)
- MISO (Multi Input Single Output)
- VAD (Voice Activity Detection)
- AGC (Automatic Gain Control)
Additionally, the ESP BOX enclosure has been acoustically engineered by Espressif with tuned microphone cavities, etc. Because of this functionality, ESP-SR has actually been tested and certified by Amazon themselves (I see the irony) for use as an Alexa platform device.
Wake word is instant, as in imperceptible, and the VAD timeout is currently set to 100ms. We have a multitude of ESP-SR and ESP-DSP tuning parameters for any of these features. Also, while it is the same engine, we use the Alexa, Hi ESP, and Hi Lexin wake words, which have been trained and recorded by Espressif and professional audio engineers on 20,000 speech samples across 500 individual speakers (mix of genders, including 100 children) at distances of 1-3m. For each wake word.
We will be using this process to train “Hi Willow” and other wake words as it makes sense.
In looking at this process (which is pretty much industry standard for commercial grade wake word implementations) the wake training process is, in fact, very involved. You can see the metrics of wake activation, false wake, resource utilization, etc here:
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/benchmark/README.html
We have reliable wake word detection across all of our supported wake words and clean speech for speech recognition from at least 25-30ft away (even without line of sight to device - around corners, etc) in acoustically challenging environments (acoustic echo, noise, etc). You can see from the benchmarks above Wakenet activation is 98-94% reliable depending on environmental conditions, all while minimizing false wake.
The ESP32 S3 has two cores and we assign AFE and audio tasks to Core 1 with different levels of FreeRTOS scheduling priority, while leaving “misc” assigned to Core 0 with different level of FreeRTOS scheduling priority depending on the task. We currently have plenty of CPU time across cores to spare for other future tasks and we will be able to optimize this further.
Combined with our inference server or on device command recognition (via ESP-SR Multinet 6) we have response times and interactivity that is actually superior to Echo/Alexa because of local control. Please see the demo:
Since recording that on Monday we have shaved off another ~200ms or so in the recognition → transcript → send to HA pipeline.
Additionally, we have an initial audio test harness and have verified this pipeline across at least 1,000 consecutive runs with approximately 1% failure rate.
All in all, not bad for a $50 device you can just take out of the box, flash, and put on the counter !