Year of the Voice - Chapter 4: Wake words

I’m definitely keen on this too. I was hoping to use “Jarvis” but with OpenWakeWord we only have access to “Hey Jarvis”.

I’d imagine this type of thing will be common amongst us so sharing the trained models is a great idea. That ay we can also download a bunch of them to swap between if we get bored with them.

Just setup my M5 Echo. I am hearing a slight stutter during playback. Anyone else hearing that or know how to tweak it if it is a CPU bottleneck? Or is it a slight clipping…I’m not sure…?

Do you just get that at the very start of the playback? I get that when playing TTS to my M5stack-spk kit’s.

I never got to the bottom of it.

No at various points during the playback. Not sure if this will work but here is a snippet SndUp | Post Info You can hear the stutter when “window” is spoken the second time.

It’s not playing for me.

So, if you had to rank theses solutions, what would you do:

  1. Raspberry 3 or 4 + Respeaker 6-mic + ODAS
  2. ESP32-S3 Korvo 2 (2 mic) + ESP-IDF’s AFE + some VAD to avoid streaming without voice activity
  3. Any PI-like (like Orange PI Zero 2 or 3) + pluggable USB audio + MAX9814 amplified for the microphone + some audio processing (what? how?)
  4. ESP32-S3 BOX 3 (or lite)
  5. M5 Stack Echo

In term of output audio quality?
In term of sensing distance?
In term of price to own?
In term of effort required to reach good performance?

1 Like

I tried ODAS so an X on that one as the quality is bad. In fact all beamforming is bad if there is not a mechanism to lock the beam.
This would likely be done via a KWS that can attain a target as you collate beamform info for the KW period and maintain that beamform until the command sentence stop.
Delay-sum is the simplest sort of beam and did an example if someone ever wants to tidy and optimise my 1st ever C/C++ hack.

Margin is the max TDOA which is dictated by mic distance (mm) divided by the factor of speed of sound (343 m/s) divided by sample rate 60mm mic spacing @ 48Khz = 60 / (343000/48000) = 8.39 so a int margin of 8 16Khz only gives 2.798 so of no use

/tmp/ds-out contains current TDOA so poll to set LEDs To monitor watch -n 0.1 cat /tmp/ds-out

To fix the beamformer write a file to /tmp/ds-in echo 1 > /tmp/ds-in sets beam to a delay of 1 Delete file to clear and use TDOA

You need some feed back so the KWS controls the beam and all that is currently sort of missing. That is why conference mics and speaker phone don’t work all that well with any dynamic noise as they are freefloating and beamforming to predominant input and have no target for the full command sentence.

I used Delay+Sum on a simple 2mic because the more advanced and more mics expotentially the computional needs rise.
There is MVDR / GSC and a whole rake of beamform algs that are only really suited to highspeed DSP.
So no to the Respeaker 6-mic and use there DSP with the USB version.
Speed of sound is 343ms so in mm 343,000mm divide that by your samples rate say 48Khz = 7.14mm for a single sample difference and really you don’t want to be going above 65mm approx due to aliasing.
But @ 71.45mm you get x10 samples diff to do your beamforming on and that is the only reason mems mics are beneficial as they hold really tight tolerances.

So no to beamforming as wow it takes some computational power and unless your Amazon and bake it into silcon its prob a big no, as often they are above 48Khz to garner bigger sample diff.

Also watching how Google proceeded they copied Amazon at the start and then thought hold on this beamforming ain’t that great anyway and dropped to x2 Mics to some sort of BSS instead (VoiceFilterLite) as likely in that 80/20 rule of how smartspeaker are a domestic consumer product the biggest noise problem is often very spatially unique from the voice command.
So likely they use a DUET type 2x channel BSS and maybe even run without AEC.
Out of the 2 Amazon are losing huge sums of money where Google for them not so much.

I have never found any Linux realtime code for 2 channel BSS and have been searching for a long time and it will take a DSP opensource guru than me.
Esspressif do have a binary blob in there ADF
https://www.espressif.com/en/solutions/audio-solutions/esp-afe

Its any S3 and likely a ADC module and 2x Max9814 and a £7 Esp32-s3 like the T7-S3 – LILYGO® (Its so cute) could do the job and just use the esp-afe (audio front-end) and websockets libs (TCP).

I hate the term satelite as there is no such thing as in automation we deal with devices and the term satelite seems to assume a Mic system, Pixel indicator and audio out, whilst for me those are all seperate items allocated to a zone or subzone (assembly) but essentiall seperate devices.

ESP32-S3 Korvo far too expensive even the ESP32-S3-Box as what you should be able to do is position several mics in a zone so that pickup is always beneficial to any sources of noise.
Likely 2 or 4 mics aranged to benefit room setup where simple positional physics is cheaper and of higher quality than trying to source a voice command in a tsunami of noise.

On the esp32-S3 the BSS splits the signal into x2 spatially different signals and I have hunch that they run 2x KWS and the one with the hit uses that stream for that command sentance.
If you jetison all the dross off an ESP32-S3-Box as who needs a screen so small that to be able to see what it says you have to be close so its pointless using voice control as it might as well be a switch.
The onboard amp is toy like and so is the TTS and the ASR is pretty limited.

So I am a esp32-s3 fan and is it the pcm1801 2x I2S aliexpress modules are avail as so are Max9814 mics and just hook up to the AFE and even just send the 2x streams to a Pi4 for KWS.
I think with TFLite4Micro we can do exactly what David Scripka has done but to be honest I could not care less about custom KW, I just want a KW that works 100% and well.
Big data does this by high quality datasets of use and collates metadata of hardware and user and has created huge high quality datasets.
Allow users to opt-in to collate a dataset of use but also collate KW and command sentances locally and use transfer learning where a smaller on-device trained KWS model biases the weights of the pretrained so it creates a profile of use.

I think upstream on a Pi or otherwise due to the huge idle lengths that slowly it can tick away and train small models of captured data and OTA the KW to KW devices.
So its not exactly on-device training its upstream ondevice training and the KWS will get more accurate through use and it will adapt to the users and hardware it consists of.

A PiZero is like the M5 but at least you have choice of audio hardware.
The 2 mic hat often is placed facing the ceiling and you lose all the rear rejection the PCB gives that a right angle connector should fix or wall mount.
Its not capable of any algs or KWS but like the M5 it could transmit the streams upstream and also gives choice to improve with mic preamps and analogue AGC or use the 2mic as its not that bad and an improvement over the M5 for not much more.

Pi Zero2 or many of the really great clones we have now on a Cortex A53/A55 has more power than a ESP32-S3 but we don’t have any opensource BSS say as this commercial one.
https://vocal.com/blind-signal-separation/degenerate-unmixing-estimation/

So for me currently its only a ESP32-S3 that also only broadcasts on KW hit as I hate the idea of 24/7 bugging and eavesdropping my own home wirelessly.
I think any $20 SBC should even do a better job but until some Opensource C/C++ DSP guy(s) donates his time to create some form of BSS we are stuck with beamforming and I think its a bit Mweh or ESP32-S3 and the Esspressif blobs.

I should say though there are some amazing filters as has anyone ever checked the quality of

The LADSPA filter will run on a Pi5 (It runs on a Opi5 with just the odd buffer underun) via ALSA or Pipewire and its quality is actually RtxVoice like (Really if someone dropped the sonos tract engine and ported to TFlite or Onnx then it would not be single thread only and not just need a single big core.
So upstream full bandwidth audio to a filter is also another choice.
Also allow users to opt in and donate metadata rich data to the Dev provider to make better KWS & ASR as we can make great datasets and collate as we go along.
We just need hardware, gender, age band, region, native lang, we do not need your name or address…
Then opensource can start to compete with the huge dataset advantage Big data has.

Yup I saw this and am actively working to get it working, but it would still be nice if this was supported natively

Same here, let me know if you get it working. As of now I can setup the stream_assist component, but can’t configure it without the component going 500. :confused:

There’s MATRIX Voice (matrix-io.github.io) a MIC array with ESP32 onboard. Would it be possible to use this board as the “microphone” for home assistant?

If you haven’t already got it then a no and don’t.
Its just a multi mic array with no processing with an esp32 (Its only the more powerful vector enabled S3) that can fully utilise the Esspressif AFE.

If you do have it then you could use those channels if you do the algs yourself, or just a channel (looks the part though).

1 Like

Thanks! I’ve found this project: speechbrain/sepformer-whamr · Hugging Face that’s using Speechbrain library that’s made from the authors of ODAS (no update since 2 years or so). I tried in colab and indeed, it’s quite efficient at separating the speakers from a multi-speaker recording or cleaning a noisy environment (just add this to their colab to test these features):

from speechbrain.pretrained import SepformerSeparation as separator
import torchaudio

# Speaker separation
model = separator.from_hparams(source="speechbrain/sepformer-whamr", savedir='pretrained_models/sepformer-whamr')

# for custom file, change path
est_sources = model.separate_file(path='noisyenv3.wav') 

torchaudio.save("sourcenoisy1.wav", est_sources[:, :, 0].detach().cpu(), 8000)
torchaudio.save("sourcenoisy2.wav", est_sources[:, :, 1].detach().cpu(), 8000)

# And cleaning the audio
model2 = separator.from_hparams(source="speechbrain/sepformer-whamr-enhancement", savedir='pretrained_models/sepformer-whamr-enhancement')

est_sources2 = model2.separate_file(path='noisyenv2.wav') 

torchaudio.save("enhanced_noisy.wav", est_sources2[:, :, 0].detach().cpu(), 8000)

Indeed, it’s way better than ODAS, but I doubt it’d ever run on a RPI5 in realtime unless their model are simplified, quantified and shrinked.

I presume ‘sepformer’ is some type of tranformer architecture that there are quite a lot of optimised libs for and quantisation could speed it up.
That level of ML and C/C++ beyond me though, I had a look at the sepformer and likely the model will need to be retrained for 16Khz.
Its so long ago I ran ODAS but it ran on a Pi3 but results where ‘prototype/research’ level than anything useful.

I am pretty sure there are algs as Esspressif are using one but actually never heard the output it gives (Think its likely some form of DUET but could be just FastICA) so its possible just not done.
If you stick a KWS on each seperated channel and use transfer learning of a ‘ondevice trained model’ effectively your coming closer to Googles Targetted voice extraction of VoiceFilterLite for KW.

A single channel and a filter such as GitHub - Rikorose/DeepFilterNet: Noise supression using deep filtering or even lighter GitHub - breizhn/DTLN: Tensorflow 2.x implementation of the DTLN real time speech denoising model. With TF-lite, ONNX and real-time audio processing support. (not same level though) likely would be easier just to convert the model to something other than Sonos Tract (Single thread only).

You could just use the simple beamformer I created to help lower reverberation with a bit of speech enhancement as it will easily run as is on anything PiZero2 or above. (It works but with no prior C/C++ I just hacked it) feed that to an upstream filter.
I just would of liked someone that is a little more C/C++ conversant and maybe thread that also and optimise for Neon so its super light.

I did find one piece of IVA code that I should look at again as the Android and folder structure is confusing me to the actual source.

I looked at before and it threw me as the main alg seems to be named MVDR2 which I presumed MVDR not IVA the conv made me look again and it would seem IVA but need to hack out from Android to Linux audio.

I’m running HA, Wyoming, Piper, and OpenWakeWord in a Docker container stack, and using ATOM Echoes. The Echoes work for a while, and then stop. Looking at the Esphome debug logs, they stop looping. Cutting power and restarting them doesn’t have any affect. I’m using Esphome 2023.10.0 and 2023.10.1.

I did try restarting the OpenWakeWord container, and suddenly the Echoes went back running the loop.

[12:28:41][D][voice_assistant:468]: Event Type: 2
[12:28:41][D][voice_assistant:550]: Assist Pipeline ended
[12:28:41][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to WAIT_FOR_VAD
[12:28:41][D][voice_assistant:372]: Desired state set to WAITING_FOR_VAD
[12:28:41][D][voice_assistant:176]: Waiting for speech...
[12:28:41][D][voice_assistant:366]: State changed from WAIT_FOR_VAD to WAITING_FOR_VAD
[12:28:41][D][voice_assistant:189]: VAD detected speech
[12:28:41][D][voice_assistant:366]: State changed from WAITING_FOR_VAD to START_PIPELINE
[12:28:41][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE
[12:28:41][D][voice_assistant:206]: Requesting start...
[12:28:41][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE
[12:28:41][D][light:036]: 'M5Stack Atom Echo 888038' Setting:
[12:28:41][D][light:051]:   Brightness: 100%
[12:28:41][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[12:28:41][D][voice_assistant:387]: Client started, streaming microphone
[12:28:41][D][voice_assistant:366]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[12:28:41][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE
[12:28:41][D][voice_assistant:468]: Event Type: 1
[12:28:41][D][voice_assistant:471]: Assist Pipeline running
[12:28:41][D][voice_assistant:468]: Event Type: 9
[12:28:46][D][voice_assistant:468]: Event Type: 0
[12:28:46][D][voice_assistant:468]: Event Type: 2
[12:28:46][D][voice_assistant:550]: Assist Pipeline ended
[12:28:46][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to WAIT_FOR_VAD
[12:28:46][D][voice_assistant:372]: Desired state set to WAITING_FOR_VAD
[12:28:46][D][voice_assistant:176]: Waiting for speech...
[12:28:46][D][voice_assistant:366]: State changed from WAIT_FOR_VAD to WAITING_FOR_VAD
[12:28:46][D][voice_assistant:189]: VAD detected speech
[12:28:46][D][voice_assistant:366]: State changed from WAITING_FOR_VAD to START_PIPELINE
[12:28:46][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE
[12:28:46][D][voice_assistant:206]: Requesting start...
[12:28:46][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE
[12:28:46][D][light:036]: 'M5Stack Atom Echo 888038' Setting:
[12:28:46][D][light:051]:   Brightness: 100%
[12:28:46][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[12:28:46][D][voice_assistant:387]: Client started, streaming microphone
[12:28:46][D][voice_assistant:366]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[12:28:46][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE
[12:28:46][D][voice_assistant:468]: Event Type: 1
[12:28:46][D][voice_assistant:471]: Assist Pipeline running
[12:28:46][D][voice_assistant:468]: Event Type: 9

There’s nothing in the OpenWakeWord container logs that shows an error.

INFO:root:Ready
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO:root:Ready
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO:root:Ready
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO:root:Ready
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO:root:Ready
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

But clearly the stream from the Echoes to HA/OpenWakeWord is silently failing after a period of time, and the Echoes sit idle.

Edit: Failed after about 5 minutes.

[12:32:08][D][voice_assistant:468]: Event Type: 0
[12:32:08][D][voice_assistant:468]: Event Type: 2
[12:32:08][D][voice_assistant:550]: Assist Pipeline ended
[12:32:08][D][voice_assistant:366]: State changed from STREAMING_MICROPHONE to WAIT_FOR_VAD
[12:32:08][D][voice_assistant:372]: Desired state set to WAITING_FOR_VAD
[12:32:08][D][voice_assistant:176]: Waiting for speech...
[12:32:08][D][voice_assistant:366]: State changed from WAIT_FOR_VAD to WAITING_FOR_VAD
[12:32:08][D][voice_assistant:189]: VAD detected speech
[12:32:08][D][voice_assistant:366]: State changed from WAITING_FOR_VAD to START_PIPELINE
[12:32:08][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE
[12:32:08][D][voice_assistant:206]: Requesting start...
[12:32:08][D][voice_assistant:366]: State changed from START_PIPELINE to STARTING_PIPELINE
[12:32:08][D][light:036]: 'M5Stack Atom Echo 888038' Setting:
[12:32:08][D][light:051]:   Brightness: 100%
[12:32:08][D][light:059]:   Red: 100%, Green: 0%, Blue: 100%
[12:32:08][D][voice_assistant:387]: Client started, streaming microphone
[12:32:08][D][voice_assistant:366]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[12:32:08][D][voice_assistant:372]: Desired state set to STREAMING_MICROPHONE
[12:32:08][D][voice_assistant:468]: Event Type: 1
[12:32:08][D][voice_assistant:471]: Assist Pipeline running
[12:32:08][D][voice_assistant:468]: Event Type: 9

Ok, maybe this is fixed in 1.8.0 of the Docker image. Wake word detection stops working · Issue #2 · rhasspy/wyoming-openwakeword · GitHub

I had to download it and then it would play…

Yep, that worked. It’s a similar stutter to what I get but so far I’ve only noticed mine at the very start of the TTS playback. I’ll try a few tests this weekend and see if that has changed as I haven’t used the device for a few weeks but it has had the ESPhome version updated a number of times over that period.

1 Like

Has anyone managed to get wake words working with two pipelines? It seems that only the preferred voice assistant is detecting wake words for me.

I feel like I’m missing something. Why is there not a way to get this easily working on a tablet that’s running home Assistant? For example, I have an always on fire tablet that has a microphone. Would love to be able to use it as a voice Assistant!

2 Likes

Same here. Is there comprehensive documentation somewhere on how, for example, to activate a voice assistant and a satellite on a dedicated Rpi?
HAOs run on Rpi4 8Gb but I have no idea what to install in HA and what to install in dedicated Rpi with satellite
RpiOS + ?
HAOS?
I use NabuCasa