Year of the Voice - Chapter 4: Wake words

? Max9814 is a microphone preamp with AGC its external to the ESP32 as the internal ADC is just pretty bad.
Max9814 with an I2S ADC should work better than most as its an analogue preamp and AGC than just quantising small signals into blocky signals with a straight digital mic.

I use them with usb soundcards on the Pi and they do farfield much better as most mics are setup for a broadcast distance of <0.3m

I’m not sure what the difference is yet completely but I hear you. I’ll read up on these two different types you mentioned asap so I can get this one running. I also have a voice assistant unit I’m working on I can you with that is from an everything smart home year of the voice tutorial. It uses a external mic, amp, and an esp32 s3 dev board because it is fast enough to process on the device or something like that.

Max9814 is pure analogue, but a great chip.
I don’t think HA uses the Esspressif IDF which as far as I know only has the vector optimisation of the esp32-s3.
Without that its not much faster than a standard esp32-s3 with DSP/ML can be 10x faster

I don’t know perhaps I am being lazy but I have tried to create my own open wake word at https://colab.research.google.com but after 6 attempts each one fails with the following error.

Now I do not want to change the environment. I would rather wait for the guidance of someone more knowledgable!
If anyone has a solution please let me know!

Dunno why but the issue is fixed. Happy me!

Here is quite the collection English wake words.

Those are only ”OpenWakeWord” models for remote on-server wakeword (with the server always listening) so they do not work on the Home Assistant Voice Preview Edition as that requires ”Micro Wake Word” models for on-device wakeword that can run the locally on the ESP32 microcontroller. While you could techically get the old OpenWakeWord to work running remotley, the new default standard is to use MicroWakeWord instead:

About on-device wake word processing (microWakeWord)

The microWakeWord created by Kevin Ahrendt enables ESPHome to detect wake words on devices like the ESP32-S3-BOX-3.

Because openWakeWord is too large to run on low-power devices like the S3-BOX-3, openWakeWord runs wake word detection on the Home Assistant server.

Doing wake word detection on Home Assistant allows low-power devices like the M5 ATOM Echo Development Kit to simply stream audio and let all of the processing happen elsewhere. The downside is that adding more voice assistants requires more CPU usage in Home Assistant as well as more network traffic.

Enter microWakeWord; a more light-weight model based on Google’s Inception neural network. Because his new model is not as large, it can be run on low-power devices with an ESP32 chip, such as the ESP32-S3 chip inside the S3-BOX-3! (It also works on the, now discontinued, S3-BOX and S3-BOX-Lite).

Currently, there are three models trained for microWakeWord:

  • okay nabu
  • hey jarvis
  • alexa

It doesn’t have to be a server Getting started - Local - Home Assistant as the system should run on a Pi3/4 but using a LLM based ASR such as Whisper means things are mismatched as Piper is a embedded TTS happy on Pi3/4 whilst Whisper still struggles with its more accurate models on a Pi5.

The huge ammount of work by Kevin Ahrendt is being let down by the contributors of the dataset creation and training code as its pretty woeful as the quality of a model has one huge defining factor and that is the quality of the dataset.

The dataset sucks currently and I have posted issues and created a few demo repos but still stuck using synthetic data but at least have a datum that is far more accurate as an example.

OpenWakeWord compute is still relatively low when not using an ESP32-S3 and that Tf4Micro only offers a subset of the ML layers of TFlite and its not just about compute, OpenWakeWord likely uses layers not avail in TF4Micro.

There likely is always going to be a problem with the ESP32-S3 as the commercial models for speech enhancement are much bigger than the current Xmos model and use targetted voice extraction based on user profiles that can cope with ‘Cocktail party’ problems of doubletalk.
So when TV and any media or other voice is speaking the ‘Target’ voice can still be extracted.
Because Speech Enhancement needs to be 1st in the pipeline to feed the KWS and ASR its unlikely this will ever run on a ESP32-S3 or even the Xmos as the compute is too great.

The methods microWakeWord uses should create KWS that are more accurate than OpenWakeWord and has methods such as on-device training that can learn through use and get more accurate.
The choice should be a compromise to get a relatively easy custom wakeword or more accurate classification model such as microWakeWord, with much more involved dataset and training requirements.
Custom wakewords are a problem though as speech enhancement can be trained to be more accurate by having the KW in the dataset, which again means a compromise or further complex training.

Even a PiZero2 can run multiple instances of either OpenWakeWord or microWakeWord whilst I am unsure if even a Pi5 could run the Speech Enhancement to match what Big Data is using on there hardware which are not microcontrollers… I say unsure as opensource doesn’t seem to have any similar quantised Speech Enhancement models as maybe they could, but very unlikely they would run on micro as layer complexity and compute are likely to be problems.