Year of the Voice - Chapter 4: Wake words

I believe the other wake words come with the Porcupine1 wake engine.

Right, ok. That explains it. Thanks.

This is … not fun :frowning:
Having installed Voice Assist, and got homeassistant-satellite on a RasPi3 to recognise my USB mic and headphones … I have been asking it to “Turn the study light on” or “Turn on the study light”.
With the --debug option I can see that it detected:

  • ’ Turn the study like on.’
  • ’ turn the study light’
  • ’ Turn the study light.’
  • ’ turn on the study flight’
  • ’ turn on the stubby light.’
  • ’ Turn on the stabbing light.’

Very much a surly teenager deliberately misinterpreting every command.

Finally I tried just “Turn on the light” and, despite it appearing to detect ’ Turn on the light.’, once again the “Sorry, I couldn’t understand that” :frowning: Or is that because HA doesn’t know which room my voice assist satellite is in ?

DEBUG:homeassistant_satellite.remote:{'type': 'auth_required', 'ha_version': '2023.10.3'}
DEBUG:homeassistant_satellite.remote:{'type': 'auth_ok', 'ha_version': '2023.10.3'}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'result', 'success': True, 'result': None}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'run-start', 'data': {'pipeline': '01gzx9v1fjm5mmjvej04fadjv5', 'language': 'en', 'runner_data': {'stt_binary_handler_id': 1, 'timeout': 300}}, 'timestamp': '2023-10-16T04:51:04.091539+00:00'}}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'wake_word-start', 'data': {'entity_id': 'wake_word.openwakeword', 'metadata': {'format': 'wav', 'codec': 'pcm', 'bit_rate': 16, 'sample_rate': 16000, 'channel': 1}, 'timeout': 3}, 'timestamp': '2023-10-16T04:51:04.091655+00:00'}}
DEBUG:__main__:wake_word-start {'entity_id': 'wake_word.openwakeword', 'metadata': {'format': 'wav', 'codec': 'pcm', 'bit_rate': 16, 'sample_rate': 16000, 'channel': 1}, 'timeout': 3}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'wake_word-end', 'data': {'wake_word_output': {'wake_word_id': 'hey_rhasspy_v0.1', 'timestamp': 3490}}, 'timestamp': '2023-10-16T04:51:11.096731+00:00'}}
DEBUG:__main__:wake_word-end {'wake_word_output': {'wake_word_id': 'hey_rhasspy_v0.1', 'timestamp': 3490}}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'stt-start', 'data': {'engine': 'stt.faster_whisper', 'metadata': {'language': 'en', 'format': 'wav', 'codec': 'pcm', 'bit_rate': 16, 'sample_rate': 16000, 'channel': 1}}, 'timestamp': '2023-10-16T04:51:11.096827+00:00'}}
DEBUG:__main__:stt-start {'engine': 'stt.faster_whisper', 'metadata': {'language': 'en', 'format': 'wav', 'codec': 'pcm', 'bit_rate': 16, 'sample_rate': 16000, 'channel': 1}}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'stt-vad-start', 'data': {'timestamp': 3715}, 'timestamp': '2023-10-16T04:51:11.473481+00:00'}}
DEBUG:__main__:stt-vad-start {'timestamp': 3715}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'stt-vad-end', 'data': {'timestamp': 4400}, 'timestamp': '2023-10-16T04:51:12.848287+00:00'}}
DEBUG:__main__:stt-vad-end {'timestamp': 4400}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'stt-end', 'data': {'stt_output': {'text': ' Turn on the light.'}}, 'timestamp': '2023-10-16T04:51:13.319297+00:00'}}
DEBUG:__main__:stt-end {'stt_output': {'text': ' Turn on the light.'}}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'intent-start', 'data': {'engine': 'homeassistant', 'language': 'en', 'intent_input': ' Turn on the light.', 'conversation_id': None, 'device_id': None}, 'timestamp': '2023-10-16T04:51:13.319343+00:00'}}
DEBUG:__main__:intent-start {'engine': 'homeassistant', 'language': 'en', 'intent_input': ' Turn on the light.', 'conversation_id': None, 'device_id': None}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'intent-end', 'data': {'intent_output': {'response': {'speech': {'plain': {'speech': "Sorry, I couldn't understand that", 'extra_data': None}}, 'card': {}, 'language': 'en', 'response_type': 'error', 'data': {'code': 'no_intent_match'}}, 'conversation_id': None}}, 'timestamp': '2023-10-16T04:51:13.330464+00:00'}}
DEBUG:__main__:intent-end {'intent_output': {'response': {'speech': {'plain': {'speech': "Sorry, I couldn't understand that", 'extra_data': None}}, 'card': {}, 'language': 'en', 'response_type': 'error', 'data': {'code': 'no_intent_match'}}, 'conversation_id': None}}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'tts-start', 'data': {'engine': 'tts.piper', 'language': 'en_GB', 'voice': 'en_GB-alba-medium', 'tts_input': "Sorry, I couldn't understand that"}, 'timestamp': '2023-10-16T04:51:13.330497+00:00'}}
DEBUG:__main__:tts-start {'engine': 'tts.piper', 'language': 'en_GB', 'voice': 'en_GB-alba-medium', 'tts_input': "Sorry, I couldn't understand that"}
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'tts-end', 'data': {'tts_output': {'media_id': "media-source://tts/tts.piper?message=Sorry,+I+couldn't+understand+that&language=en_GB&voice=en_GB-alba-medium", 'url': '/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-gb_35f6e7cd1a_tts.piper.wav', 'mime_type': 'audio/x-wav'}}, 'timestamp': '2023-10-16T04:51:13.330692+00:00'}}
DEBUG:__main__:tts-end {'tts_output': {'media_id': "media-source://tts/tts.piper?message=Sorry,+I+couldn't+understand+that&language=en_GB&voice=en_GB-alba-medium", 'url': '/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-gb_35f6e7cd1a_tts.piper.wav', 'mime_type': 'audio/x-wav'}}
DEBUG:root:play ffmpeg: ['ffmpeg', '-i', 'http://192.168.1.98:8123/api/tts_proxy/dae2cdcb27a1d1c3b07ba2c7db91480f9d4bfd8f_en-gb_35f6e7cd1a_tts.piper.wav', '-f', 'wav', '-ar', '22050', '-ac', '1', '-filter:a', 'volume=1.0', '-']
DEBUG:root:play: ['aplay', '-D', 'plughw:CARD=Headphones,DEV=0', '-r', '22050', '-c', '1', '-f', 'S16_LE', '-t', 'raw']
Playing raw data 'stdin' : Signed 16 bit Little Endian, Rate 22050 Hz, Mono
DEBUG:homeassistant_satellite.remote:{'id': 1, 'type': 'event', 'event': {'type': 'run-end', 'data': None, 'timestamp': '2023-10-16T04:51:13.330712+00:00'}}
DEBUG:__main__:run-end None
DEBUG:homeassistant_satellite.remote:Pipeline finished

Ohhhh … changing microphone helped the audio quality immensely … but unfortunately not so much improvement in the recognition :frowning:

There is far too much running on a ESP32-S3-Box and trying to do ASR & TTS on it spreads resourses to thin.
Likely though its still a perfect platform for a post BSS (Blind Source Seperation) or have KWS onboard and broadcast only on KW hit.
Also the ESP32-S3-Box is just bloated hardware wise where £50 returns a lot of not nessacary
Any ESP32-S3 can use the ADF that has the only instance of a free BSS alg avail even though a blob.
It would be quite simple to employ TFLite4Micro with a pretrained model to select the output from the BSS to get the ‘Voice’.
Basically esspressif run 2x KWS on the outputs to select which is the voice command.
Just needs a esp32-s3 addon board of 2x Mics, ADC and I recommend Max9814 for the analogue ADC to extend farfield.

Very likely as the basic word stemming with a full vocab simple ASR is likely to do so.
Likely you could train on the fly a n-gram LM (Language model) that is implicit to the entities and load up the ASR using a LM.

I still think a LLM would be much better than basic word stemming as LLM’s have really made basic word stemming obselete.
What you do is use Langchain and have the entity sentences presented as documents.

A LM would likely improve current but LLM’s are really making all previous methods obselete.
LMs are pretty old tech but quick to create on the fly and reload a ASR on any entity changes.

Do you have any network infrastructure recommendations around this setup?
There are hints in a few threads that a constant UDP stream isn’t reliably supported by certain wifi setups.

I think if your using a Pi then its now using websockets so TCP, but the M5-Stack might well be still using UDP which always was a bad idea, even from the point your constantly broadcasting a audio stream to all endpoints in that network as they still need to check if the packet is applicable.

Getting the same issue with an Atom Echo using HA Cloud pipelines and rhasspy/wyoming-openwakeword docker.
Toggling the wake word switch doesn’t resolve it but it seems to start responding again after 5-10 mins.

My chain is
Atom ECHO => TPlink WiFi access point (TLWA901ND V5) with 100Mbit LAN connection => LAN cable => main router (o2 homebox 6641) with 1000MBit ports => LAN cable => Raspberry Pi4

I dunno but never liked the idea of using the same mqtt network for audio.

Haven’t got one but just presuming its the same.

I have this issue discribed above , can anyone point me to the right direction , is there any update for the atom ??

Thank you

From how I see it, at this point in time the best we can do is collect cases until someone with more insights into the underlying tech finds the pattern/problem.

What is your setup? Hardware and software?

I have HA in a Proxmox VM , with superviser …

Home Assistant 2023.10.3
Supervisor 2023.10.0
Operating System 11.0
Interface: 20231005.0 - latest

using an ATOM , i just bought 2 units , both have the same issue , the firmware in the atom is

atom-echo-voice-assistant
por m5stack
Firmware: 2023.10.0b1 (Oct 13 2023, 23:14:59)
Hardware: 1.0

I have tried and without wake word … by pressing the button , and the ATOM doesnt have this problems , so its related to the wake word , without it i can press multiple times the button and it responds all the time !!!

i have open a issue in github for someone that wants to join :

4 Likes

I have 1 Atom Echo flashed with the voice assistant with the same symptoms when using openwakeword. I also have 2 ESP32-S3-Box-Lite that have voice assistant flashed on them with the exact same symptoms on those. I can ask a few commands (2-5) and then the voice assistants freeze up.

I have another ESP32-S3-Box Lite flashed with Willow voice assistant for about 3 weeks now and it does not have any issues responding to wake words or getting locked up. Even when the Atom Echo and others don’t respond using openwakeword, the Willow box will still respond.

3 Likes

Oh, I never heard about it until I saw it on Amazon a few days ago, noticed it was running on an ESP32 and wondered if it could be used for Assist.

Can you tell what kind of issues there were with the device?

The one I have is more of a tech demonstrator than actual product as Esspressif have packed in every function on a S3 microcontroller.

https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Its the usual tiny box with a toy amplifier, but also to squeeze so much in the KWS model they run is extremely quantised.

Likely any esp32-s3 could make a really great wireless KWS device as the settings for the BSS could be tweaked and better models could be made.
As it is to squeeze all in everything is a tad thin and makes a good demonstrator, but maybe not a good product.

Likely an ESP32-S3 would really benefit from a 2 mic design 71.45mm @ 48Khz using the ADF above.
Create models as from what Esspressif do it would seem the BSS splits into x2 channels and they run x2 KWS to detect which is the KW and following command.
Rather than just I2S mics a I2S ADC and Max9814 as the AGC on those is pretty awesome and apart from closer tolerances and being smally there is no difference than being smaller.
Willow are still using the Espressif KWS models and they can be hit or miss, I think they work on a rolling window with a fairly slow rate and much of the rejections is the KW not fitting a current window, or its just bad :slight_smile:

I would like to set up a page for recommended RPi products. Really, I wish people could buy the MAX9814 mic you linked in a nice little USB package :grinning_face_with_smiling_eyes:

It seems like there’s no product that isn’t:

  1. Meant for something else and therefore more expensive (Anker C300 webcam)
  2. Meant for something else and therefore not as performant (Anker S330 speakerphone)
  3. In pieces and requires soldering, etc.

So I can’t just plug it in and select as input?

All that is needed is a 3.5mm TRS jack that ends with dupont connectors and then no soldering needed.
Just never found one.

Its only the Max9814 board its that any analogue preamp with silicon AGC can extend into near/far field with any USB as really they are expecting close field Mics and why often input volume is low.
So any mic preamp or even a mems with built in AGC its just the Max9814 with controllable gain and AGC is widely avail and we just lack 3.5mm TRS jack plugs ending in dupoints as surely they most exist or easily attained.

My fave USB is because its a very rare stereo ADC is the PLUGABLE USB AUDIO ADAPTER $9.95

That simple analogue x2 Mics 71.45mm spaced mic array could be used on various devices from Pi to ESP32-S3 and with a lowcost ADC so that you can use the special Alexa audio sauce in
https://docs.espressif.com/projects/esp-sr/en/latest/esp32s3/audio_front_end/README.html

Send the stereo channels to a Pi and run 2x KWS to get the KW hit to select the best channel from the BSS output.
Then you actually have a farfield mic that uses BSS for audio preprocessing.

Likely we can do the 2x KWS on a ESP32-S3 and just send the ‘Voice Command Audio’ with TFLite4Micro.

But yeah getting some easily available components would be a massive plus considering we are not talking much more than 3.5mm TRS dupoint ending jackplugs and premade HomeAssistant housings.

At least then we are a little closer to commercial performance in the quality of recognition and surely its better than pushing extremely poor quality units in terms of function and recognition purely because they come already in a housing but for purpose are relative ewaste.

On a Pi still searching for a BSS alg to make a nice efficient C/C++ routine for and its just a shame the BSS from Esspressif is a blob.
I did do a simple delaysum beamformer but a KWS needs to lock on to the command sentance and even with the respeaker 2mic which is better than some think still needs a case.

So for me someone if someone who is conversent with ESP32 hack out the above Audio Front-end Framework - ESP32-S3 - — ESP-SR latest documentation and just couple it to TFlite4Micro with the same as Esspressif do that is really just 2x KWS and the BSS stream with that highest Softmax is used.

That the 2mic preamp with AGC is made or assembled from parts using electrets or mems I don’t really care that just one exists.
Otherwise we are still at the same point with Year of the Voice where likely an Alexa or Google just works so much better and when advocating poorly fitting speakerphones and webcams even much cheaper.

PS this was just a hack for 2 existing projects that I just converted to realtime but if anyone would like to clean the code up and optimise the FFT to use Neon please do as at least it is some initial audio processing which is a massive part of what smart speakers do.

I name drop the Max9815 and that usb because apart from the 3.5mm TRS jack to duponts you can drill a hole in a case and push fit the mics into 9.5mm rubber grommets because they are avail and relatively easy.
The honest truth though is the devices and systems for voice control of a standard that many are used to on other devices may not be this year.

I have hunch because as opposed to many other algs its computational load is less that Esspressif have some form of DUET BSS ALG the math and C/C++ skills are way beyond my simple hack ability.

Noise with smartspeakers is often command voice vs media noise and the sources are often clearly spatially different.
BSS is not perfect but in that 80/20 rule such as static noise filters or AEC only processing ‘Own’ noise BSS will cover them all and likely its a variation of that Google use their Voicefilter lite as they scrapped beam forming and now have just 2 mics and lower cost.

I am not a ESP32-S3 fan boy either as keep dodging how to use there IDF but they do have an Alexa certified Audio Front-end Framework and the bits needed are actually less than they put in there S3 box systems.

You can not just stick a single mic input with no audio processing to a synthesized voice KWS to a full vocab ASR to use simple word stemming for control and say Voilà ‘Year of the Voice’.
Not with the many engineered systems most people are now used to.
Maybe call it the HomeAssistant AIY voice kit and declare scope of intent.

1 Like