Year of the Voice - Chapter 4: Wake words

Lakini · October 16, 2023, 12:16pm

My chain is
Atom ECHO => TPlink WiFi access point (TLWA901ND V5) with 100Mbit LAN connection => LAN cable => main router (o2 homebox 6641) with 1000MBit ports => LAN cable => Raspberry Pi4

stuartiannaylor · October 16, 2023, 12:21pm

I dunno but never liked the idea of using the same mqtt network for audio.

Haven’t got one but just presuming its the same.

cavaco · October 16, 2023, 12:50pm

I have this issue discribed above , can anyone point me to the right direction , is there any update for the atom ??

Thank you

Lakini · October 16, 2023, 12:56pm

From how I see it, at this point in time the best we can do is collect cases until someone with more insights into the underlying tech finds the pattern/problem.

What is your setup? Hardware and software?

cavaco · October 16, 2023, 1:25pm

I have HA in a Proxmox VM , with superviser …

Home Assistant 2023.10.3
Supervisor 2023.10.0
Operating System 11.0
Interface: 20231005.0 - latest

using an ATOM , i just bought 2 units , both have the same issue , the firmware in the atom is

atom-echo-voice-assistant
por m5stack
Firmware: 2023.10.0b1 (Oct 13 2023, 23:14:59)
Hardware: 1.0

I have tried and without wake word … by pressing the button , and the ATOM doesnt have this problems , so its related to the wake word , without it i can press multiple times the button and it responds all the time !!!

cavaco · October 16, 2023, 1:27pm

i have open a issue in github for someone that wants to join :

El_Pollo_Loco · October 16, 2023, 4:09pm

I have 1 Atom Echo flashed with the voice assistant with the same symptoms when using openwakeword. I also have 2 ESP32-S3-Box-Lite that have voice assistant flashed on them with the exact same symptoms on those. I can ask a few commands (2-5) and then the voice assistants freeze up.

I have another ESP32-S3-Box Lite flashed with Willow voice assistant for about 3 weeks now and it does not have any issues responding to wake words or getting locked up. Even when the Atom Echo and others don’t respond using openwakeword, the Willow box will still respond.

TheFes · October 16, 2023, 6:12pm

Oh, I never heard about it until I saw it on Amazon a few days ago, noticed it was running on an ESP32 and wondered if it could be used for Assist.

Can you tell what kind of issues there were with the device?

stuartiannaylor · October 16, 2023, 8:06pm

The one I have is more of a tech demonstrator than actual product as Esspressif have packed in every function on a S3 microcontroller.

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Its the usual tiny box with a toy amplifier, but also to squeeze so much in the KWS model they run is extremely quantised.

Likely any esp32-s3 could make a really great wireless KWS device as the settings for the BSS could be tweaked and better models could be made.
As it is to squeeze all in everything is a tad thin and makes a good demonstrator, but maybe not a good product.

Likely an ESP32-S3 would really benefit from a 2 mic design 71.45mm @ 48Khz using the ADF above.
Create models as from what Esspressif do it would seem the BSS splits into x2 channels and they run x2 KWS to detect which is the KW and following command.
Rather than just I2S mics a I2S ADC and Max9814 as the AGC on those is pretty awesome and apart from closer tolerances and being smally there is no difference than being smaller.
Willow are still using the Espressif KWS models and they can be hit or miss, I think they work on a rolling window with a fairly slow rate and much of the rejections is the KW not fitting a current window, or its just bad

synesthesiam · October 16, 2023, 8:46pm

I would like to set up a page for recommended RPi products. Really, I wish people could buy the MAX9814 mic you linked in a nice little USB package

It seems like there’s no product that isn’t:

Meant for something else and therefore more expensive (Anker C300 webcam)
Meant for something else and therefore not as performant (Anker S330 speakerphone)
In pieces and requires soldering, etc.

cooltings · October 16, 2023, 10:00pm

So I can’t just plug it in and select as input?

stuartiannaylor · October 17, 2023, 10:34am

All that is needed is a 3.5mm TRS jack that ends with dupont connectors and then no soldering needed.
Just never found one.

Its only the Max9814 board its that any analogue preamp with silicon AGC can extend into near/far field with any USB as really they are expecting close field Mics and why often input volume is low.
So any mic preamp or even a mems with built in AGC its just the Max9814 with controllable gain and AGC is widely avail and we just lack 3.5mm TRS jack plugs ending in dupoints as surely they most exist or easily attained.

My fave USB is because its a very rare stereo ADC is the PLUGABLE USB AUDIO ADAPTER $9.95

That simple analogue x2 Mics 71.45mm spaced mic array could be used on various devices from Pi to ESP32-S3 and with a lowcost ADC so that you can use the special Alexa audio sauce in
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Send the stereo channels to a Pi and run 2x KWS to get the KW hit to select the best channel from the BSS output.
Then you actually have a farfield mic that uses BSS for audio preprocessing.

Likely we can do the 2x KWS on a ESP32-S3 and just send the ‘Voice Command Audio’ with TFLite4Micro.

But yeah getting some easily available components would be a massive plus considering we are not talking much more than 3.5mm TRS dupoint ending jackplugs and premade HomeAssistant housings.

At least then we are a little closer to commercial performance in the quality of recognition and surely its better than pushing extremely poor quality units in terms of function and recognition purely because they come already in a housing but for purpose are relative ewaste.

On a Pi still searching for a BSS alg to make a nice efficient C/C++ routine for and its just a shame the BSS from Esspressif is a blob.
I did do a simple delaysum beamformer but a KWS needs to lock on to the command sentance and even with the respeaker 2mic which is better than some think still needs a case.

So for me someone if someone who is conversent with ESP32 hack out the above Audio Front-end Framework - ESP32-S3 - — ESP-SR latest documentation and just couple it to TFlite4Micro with the same as Esspressif do that is really just 2x KWS and the BSS stream with that highest Softmax is used.

That the 2mic preamp with AGC is made or assembled from parts using electrets or mems I don’t really care that just one exists.
Otherwise we are still at the same point with Year of the Voice where likely an Alexa or Google just works so much better and when advocating poorly fitting speakerphones and webcams even much cheaper.

PS this was just a hack for 2 existing projects that I just converted to realtime but if anyone would like to clean the code up and optimise the FFT to use Neon please do as at least it is some initial audio processing which is a massive part of what smart speakers do.

I name drop the Max9815 and that usb because apart from the 3.5mm TRS jack to duponts you can drill a hole in a case and push fit the mics into 9.5mm rubber grommets because they are avail and relatively easy.
The honest truth though is the devices and systems for voice control of a standard that many are used to on other devices may not be this year.

I have hunch because as opposed to many other algs its computational load is less that Esspressif have some form of DUET BSS ALG the math and C/C++ skills are way beyond my simple hack ability.

Noise with smartspeakers is often command voice vs media noise and the sources are often clearly spatially different.
BSS is not perfect but in that 80/20 rule such as static noise filters or AEC only processing ‘Own’ noise BSS will cover them all and likely its a variation of that Google use their Voicefilter lite as they scrapped beam forming and now have just 2 mics and lower cost.

I am not a ESP32-S3 fan boy either as keep dodging how to use there IDF but they do have an Alexa certified Audio Front-end Framework and the bits needed are actually less than they put in there S3 box systems.

You can not just stick a single mic input with no audio processing to a synthesized voice KWS to a full vocab ASR to use simple word stemming for control and say Voilà ‘Year of the Voice’.
Not with the many engineered systems most people are now used to.
Maybe call it the HomeAssistant AIY voice kit and declare scope of intent.

chris.huitema · October 17, 2023, 11:49am

During the video they mentioned about sharing trained models somewhere so we don’t all create the same thing over and over, has this been setup yet?

What are some examples of custom wake words people are using?

I’m thinking…
Potato, Hey Potato, Oi Potato

ej52 · October 17, 2023, 11:58am

This is amazing! Great work all!!

I have just ordered a bunch of omni microphones and speakers to make DIY satellites with ESP’s.

In the meantime, I have got the wake word working on a NUC8i5BEH with built-in microphone array and it is working great

cavaco · October 17, 2023, 1:06pm

I have tried the recomended satellite with Raspi 3 , and anker s330 , and it works fine , so far no wake work freeze , it seams to be related only to ATOM

El_Pollo_Loco · October 17, 2023, 2:59pm

How is the recognition quality / performance with the Raspi and S330? I find the ATOM to be OK if i have it at my desk, or speaking directly to it <10 feet away. My ESP Boxes are better, I can speak indirectly or yell from another room in the house, but I do have to adjust my speech cadence to get the best results. I am curious what other devices or microphones people are hooking up to test with this.

jm1982 · October 17, 2023, 3:07pm

Hi! I have a Jabra Speak 510, plugged in USB to my home lab server, which is hosting the {single/main} instance of HA --docker.

Basically: it’s not a voice satellite setup + since it’s on docker, I can’t use the “Assist Microphone” Add-on (that is mentioned for HA OS type). PS: I have openwakework docker deployed and well connected to HA (works well from my {remote} laptop mic, using “OK Nabu” in the Debug Assistant).

I didn’t get how (/if?) i can make the Jabra hear the wake word ?

stuartiannaylor · October 17, 2023, 3:26pm

You may have to share your sound device via the docker run command / yaml.

You can have multiple docker containers all sharing the same device but need asound.conf in each container or file(s) shared from the host as asound.conf.

Doug or Mike may be able to help as a long time back we got it working fine with a respeaker 2 mic via alsa for host and container but should be the same for containers.

You just have to remember each container acts as a isolated instance so you have to give access and use dmix and dsnoop with ipc key and ipc perm to share it so its not blocking multiple use.

cavaco · October 17, 2023, 3:27pm

Is better than the Atom , its faster and can respond multiple times within seconds , you just need to say the wake word again , iv found that it sometimes failed to get the word , but that is just my native language that the speach to text gets the wrong thing , i need to create a better word or alias , not related to the speaker or the solution , becasue simpler words in my native language it gets realy good , even from across the room !!!

So yes its a better solution if you have a Raspberry arround , and want to spend the money on the anker …

Noblewolf · October 17, 2023, 3:37pm

Glad to know it is working! Can you point me to how to install the .yaml https://github.com/esphome/firmware/blob/1cc35128b9d3d2e7edf2dd62331a058cc27e754d/voice-assistant/esp32-s3-box-lite.yaml file on the esp32 S3 Box? Just need to know how to start, because I’ve not messed with esp32 stuff before. I’ve searched, but I guess I’m not searching the right works.