Year of the Voice - Chapter 4: Wake words

The one I have is more of a tech demonstrator than actual product as Esspressif have packed in every function on a S3 microcontroller.

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Its the usual tiny box with a toy amplifier, but also to squeeze so much in the KWS model they run is extremely quantised.

Likely any esp32-s3 could make a really great wireless KWS device as the settings for the BSS could be tweaked and better models could be made.
As it is to squeeze all in everything is a tad thin and makes a good demonstrator, but maybe not a good product.

Likely an ESP32-S3 would really benefit from a 2 mic design 71.45mm @ 48Khz using the ADF above.
Create models as from what Esspressif do it would seem the BSS splits into x2 channels and they run x2 KWS to detect which is the KW and following command.
Rather than just I2S mics a I2S ADC and Max9814 as the AGC on those is pretty awesome and apart from closer tolerances and being smally there is no difference than being smaller.
Willow are still using the Espressif KWS models and they can be hit or miss, I think they work on a rolling window with a fairly slow rate and much of the rejections is the KW not fitting a current window, or its just bad :slight_smile:

I would like to set up a page for recommended RPi products. Really, I wish people could buy the MAX9814 mic you linked in a nice little USB package :grinning_face_with_smiling_eyes:

It seems like there’s no product that isn’t:

  1. Meant for something else and therefore more expensive (Anker C300 webcam)
  2. Meant for something else and therefore not as performant (Anker S330 speakerphone)
  3. In pieces and requires soldering, etc.

So I can’t just plug it in and select as input?

All that is needed is a 3.5mm TRS jack that ends with dupont connectors and then no soldering needed.
Just never found one.

Its only the Max9814 board its that any analogue preamp with silicon AGC can extend into near/far field with any USB as really they are expecting close field Mics and why often input volume is low.
So any mic preamp or even a mems with built in AGC its just the Max9814 with controllable gain and AGC is widely avail and we just lack 3.5mm TRS jack plugs ending in dupoints as surely they most exist or easily attained.

My fave USB is because its a very rare stereo ADC is the PLUGABLE USB AUDIO ADAPTER $9.95

That simple analogue x2 Mics 71.45mm spaced mic array could be used on various devices from Pi to ESP32-S3 and with a lowcost ADC so that you can use the special Alexa audio sauce in
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Send the stereo channels to a Pi and run 2x KWS to get the KW hit to select the best channel from the BSS output.
Then you actually have a farfield mic that uses BSS for audio preprocessing.

Likely we can do the 2x KWS on a ESP32-S3 and just send the ‘Voice Command Audio’ with TFLite4Micro.

But yeah getting some easily available components would be a massive plus considering we are not talking much more than 3.5mm TRS dupoint ending jackplugs and premade HomeAssistant housings.

At least then we are a little closer to commercial performance in the quality of recognition and surely its better than pushing extremely poor quality units in terms of function and recognition purely because they come already in a housing but for purpose are relative ewaste.

On a Pi still searching for a BSS alg to make a nice efficient C/C++ routine for and its just a shame the BSS from Esspressif is a blob.
I did do a simple delaysum beamformer but a KWS needs to lock on to the command sentance and even with the respeaker 2mic which is better than some think still needs a case.

So for me someone if someone who is conversent with ESP32 hack out the above Audio Front-end Framework - ESP32-S3 - — ESP-SR latest documentation and just couple it to TFlite4Micro with the same as Esspressif do that is really just 2x KWS and the BSS stream with that highest Softmax is used.

That the 2mic preamp with AGC is made or assembled from parts using electrets or mems I don’t really care that just one exists.
Otherwise we are still at the same point with Year of the Voice where likely an Alexa or Google just works so much better and when advocating poorly fitting speakerphones and webcams even much cheaper.

PS this was just a hack for 2 existing projects that I just converted to realtime but if anyone would like to clean the code up and optimise the FFT to use Neon please do as at least it is some initial audio processing which is a massive part of what smart speakers do.

I name drop the Max9815 and that usb because apart from the 3.5mm TRS jack to duponts you can drill a hole in a case and push fit the mics into 9.5mm rubber grommets because they are avail and relatively easy.
The honest truth though is the devices and systems for voice control of a standard that many are used to on other devices may not be this year.

I have hunch because as opposed to many other algs its computational load is less that Esspressif have some form of DUET BSS ALG the math and C/C++ skills are way beyond my simple hack ability.

Noise with smartspeakers is often command voice vs media noise and the sources are often clearly spatially different.
BSS is not perfect but in that 80/20 rule such as static noise filters or AEC only processing ‘Own’ noise BSS will cover them all and likely its a variation of that Google use their Voicefilter lite as they scrapped beam forming and now have just 2 mics and lower cost.

I am not a ESP32-S3 fan boy either as keep dodging how to use there IDF but they do have an Alexa certified Audio Front-end Framework and the bits needed are actually less than they put in there S3 box systems.

You can not just stick a single mic input with no audio processing to a synthesized voice KWS to a full vocab ASR to use simple word stemming for control and say Voilà ‘Year of the Voice’.
Not with the many engineered systems most people are now used to.
Maybe call it the HomeAssistant AIY voice kit and declare scope of intent.

1 Like

During the video they mentioned about sharing trained models somewhere so we don’t all create the same thing over and over, has this been setup yet?

What are some examples of custom wake words people are using?

I’m thinking…
Potato, Hey Potato, Oi Potato

4 Likes

This is amazing! Great work all!!

I have just ordered a bunch of omni microphones and speakers to make DIY satellites with ESP’s.

In the meantime, I have got the wake word working on a NUC8i5BEH with built-in microphone array and it is working great :partying_face:

1 Like

I have tried the recomended satellite with Raspi 3 , and anker s330 , and it works fine , so far no wake work freeze , it seams to be related only to ATOM

4 Likes

How is the recognition quality / performance with the Raspi and S330? I find the ATOM to be OK if i have it at my desk, or speaking directly to it <10 feet away. My ESP Boxes are better, I can speak indirectly or yell from another room in the house, but I do have to adjust my speech cadence to get the best results. I am curious what other devices or microphones people are hooking up to test with this.

Hi! I have a Jabra Speak 510, plugged in USB to my home lab server, which is hosting the {single/main} instance of HA --docker.

Basically: it’s not a voice satellite setup + since it’s on docker, I can’t use the “Assist Microphone” Add-on (that is mentioned for HA OS type). PS: I have openwakework docker deployed and well connected to HA (works well from my {remote} laptop mic, using “OK Nabu” in the Debug Assistant).

I didn’t get how (/if?) i can make the Jabra hear the wake word ?

You may have to share your sound device via the docker run command / yaml.

You can have multiple docker containers all sharing the same device but need asound.conf in each container or file(s) shared from the host as asound.conf.

Doug or Mike may be able to help as a long time back we got it working fine with a respeaker 2 mic via alsa for host and container but should be the same for containers.

You just have to remember each container acts as a isolated instance so you have to give access and use dmix and dsnoop with ipc key and ipc perm to share it so its not blocking multiple use.

Is better than the Atom , its faster and can respond multiple times within seconds , you just need to say the wake word again , iv found that it sometimes failed to get the word , but that is just my native language that the speach to text gets the wrong thing , i need to create a better word or alias , not related to the speaker or the solution , becasue simpler words in my native language it gets realy good , even from across the room !!!

So yes its a better solution if you have a Raspberry arround , and want to spend the money on the anker …

4 Likes

Glad to know it is working! Can you point me to how to install the .yaml https://github.com/esphome/firmware/blob/1cc35128b9d3d2e7edf2dd62331a058cc27e754d/voice-assistant/esp32-s3-box-lite.yaml file on the esp32 S3 Box? Just need to know how to start, because I’ve not messed with esp32 stuff before. I’ve searched, but I guess I’m not searching the right works.

Ah, cool that you narrowed it down. The Pi was also connected via WiFi, from what I see in the picture, so I guess that rules network topology out as a reason.

Although I’m now wondering if a pi3 (or even older) wouldn’t be strong enough to do the wake word detection directly on the device without streaming audio to HA :wink:

thanks for your response! I tried my luck by creating the (missing) file /etc/asound.conf on my host (as it is detailed, in the thread) + re-created the volumes (docker compose down -v), with the add. 2x lines below (see comments):

  homeassistant:
    container_name: homeassistant
    image: "homeassistant/home-assistant:latest"
    volumes:
      - ~/docker/homeassistant/config:/config
      - ~/docker/data/media:/media
      - /etc/localtime:/etc/localtime:ro
      - /run/dbus:/run/dbus:ro
      - /etc/asound.conf:/etc/asound.conf:ro    # share audio to the container
    ipc: host                                   # share audio to the container
    privileged: true
    network_mode: host
    environment:
      - PUID=1000
      - PGID=1000
    restart: unless-stopped

Though, it doesn’t seem to work… even if i couldn’t find a way to really validate if the speaker/mic are sent “from the guest to the host” (no cli tools like speaker-test nor aplay, in the guest HA image).

It’s not that it bothers me too much; I suppose my setup is not that exotic, so i count on more advanced/independent users from the community, to validate a working solution soon!

Looks like libasound is missing from the container and prob easier to just get @synesthesiam to update it.
You might be the 1st to try a local audio input than a satelite, you can install if you exec into the container then do a docker commit to save a containers changes to a new image.
Likely also libasound2-plugins needs to be recompiled after the latest versions of Speex & Speexdsp also as the Speex plugins don’t get compiled as for some reason debian still lags behind on the final RC of speex.

I am just mentioning that here so that maybe @synesthesiam will notice it.

I’m thinking of much ruder.

I’m thinking of the expressions on visitors’ faces when they see me say “hey f***er, turn the lights down”

2 Likes

I use Echo One.

Can you try adding this to your compose?

homeassistant:
    ...
    devices:
        - "/dev/snd:/dev/snd"

There is no alsa-utils with the docker compose

version: '3.8'

services:
  openwakeword:
      container_name: openwakeword
      image: rhasspy/wyoming-openwakeword:latest
      restart: unless-stopped
      devices:
        - "/dev/snd:/dev/snd"
      ports:
        - "10400:10400"

  whisper:
      container_name: whisper
      image: rhasspy/wyoming-whisper:latest
      restart: unless-stopped
      ports:
        - "10300:10300"
      command: ["--model", "tiny-int8", "--language", "en"]
      volumes:
        - ./wyoming:/data

  piper:
      image: rhasspy/wyoming-piper:latest
      container_name: piper
      restart: unless-stopped
      ports:
        - "10200:10200"
      volumes:
        - ./wyoming:/data
      command: ["--voice", "en_US-lessac-medium"]

I don’t usually use docker compose but should it not have a dockerfile entry such as

openwakeword:
  build: 
      context: .
      dockerfile: Dockerfile_openwakeword

Then in the same dir have a file Dockerfile_openwakeword

from  rhasspy/wyoming-whisper:latest
RUN apt-get update && apt-get install alsa-utils

?

Someone will prob say as never used compose before but looks like you maybe could have just one dockerfile with multiple from clauses?
Dunno I exec into the container and do it manual and then

root@91bed14e90b7:/# aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: rockchipdp0 [rockchip-dp0], device 0: rockchip-dp0 spdif-hifi-0 [rockchip-dp0 spdif-hifi-0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: rockchiphdmi0 [rockchip-hdmi0], device 0: rockchip-hdmi0 i2s-hifi-0 [rockchip-hdmi0 i2s-hifi-0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 2: rockchipes8388 [rockchip-es8388], device 0: dailink-multicodecs ES8323.6-0010-0 [dailink-multicodecs ES8323.6-0010-0]
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 3: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

Dunno as docker compose is freaking me out as doesn’t seem to emphemeral but someone will say but don’t think its running that Dockerfile