The era of open voice assistants has arrived

popcornboy · December 22, 2024, 2:09pm

Fair enough for this “preview” device, but for the final device release it would be nice not having to add a cinch plug for every single voice device.

Merc · December 22, 2024, 2:29pm

This is THE FINAL device from Nabu Casa for now. What will change and evolve is the software.
Other people might pick up the openly available pcb designs and build other devices based on that.

It’s all in the YouTube video…

Hedda · December 22, 2024, 4:14pm

FYI, FutureProofHomes has now also announced the final hardware design of their much more advanced ”Satellite1 Dev Kit” (or rather announced a public beta pre-launch with pre-order for their the USA only) so that development board hardware looks fully ready too even if not available to ship as of yet.

As you can see in their video they taken a very different approach by making it modular using a two-board design that seperate the compute board from the voice board, and making it compatible with the Raspberry Pi Zero standard it will be both flexible today and upgradable to other compute boards in the future.

Lakini · December 22, 2024, 8:37pm

Okay, not sure if this popped up anywhere else yet, but anyway: coming back to the topic around “switching between internal speaker and 3.5 plug output”. I initially thought that the 3.5mm plug is used in the “oldschool” way, where the signal to the plug by default is connected to the amp for the internal speaker, and only gets disconnected mechanically when a plug is inserted. Luckily enough the VPE uses the same approach as the respeaker_lite: it uses the microphone detection pin of the 3.5mm plug to switch the amp for the internal speaker on and off. The respeaker docs also have the infos how to switch off the plug detection: GitHub - respeaker/ReSpeaker_Lite
The nice thing is that this could technically also be made controllable, via one of the unused gpio pins of the esp and a few parts.

With that it would be possible to use the internal speaker in parallel to the 3.5mm jack, but basically “on demand”, aka: switchable

I didn’t look far enough into the DAC yet, maybe there is also a way to easily switch the signal to the 3.5mm jack on and off, then it should be possible to fully choose which audio-out to use in a given situation without having to plug and unplug the 3.5mm jack: both, only internal, only external, none

Hedda · December 22, 2024, 8:46pm

Would probably also be a good idea to make use of that Grove port to connect a relay so can switch off power to the external speaker/reciever when it is not being used.

Maybe a good idea to have a simple way to have a eaxh to use stand-by feature that will power off external speaker/reciever via the Grove port deing for example the night or away mode so external speaker/reciever draw to much power.

I known that having many external speaker/reciever running 24-hours pay day will quickly become expensive in electricity usage.

nickrout · December 23, 2024, 4:29am

I don’t know if it is significant, but there is a sensor in the esphome source which reports whether the speaker is recently plugged or unplugged.

AshaiRey · December 23, 2024, 7:20am

I hope that you really understand that you are comparing apples with pears as you do in the case with google/alexa/siri and the voice PE.

GuySie · December 23, 2024, 11:03am

The two Australian retailers are:
Smart Guys: https://smartguys.com.au/product/home-assistant-voice-preview-edition/
OZ Smart Things: HA Voice Preview Edition,Smart Home Australia

We’ll update the website to match soon.

stuartiannaylor · December 23, 2024, 11:06am

Its purely voice technolgy and originally the focus was on beamforming but it is not as effective as our brains at focussing and extracting the voice of interest.
Google in 2019 published VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
The TensorflowLite version ran and runs on thier Pixel phones, but if you have been following voice tech and reading published papers you would know this provides the best results with less microphones and computational usage.
Its just the current “state of the art” and last couple of years many papers have been published on this form of tech whilst beamforming has took a back seat.

Much of what we have is from papers and repos of big tech, MicroWakeWord if you browse the source [ Copyright 2023 The Google Research Authors.] (GitHub - kahrendt/microWakeWord: A TensorFlow based wake word detection training framework using synthetic sample generation suitable for certain microcontrollers.) kahrendt has done a stella job of extracting it from google-research/kws_streaming at master · google-research/google-research · GitHub

Microwake word seems to be working brilliantly even if handicapped by only having synthetic data to work with. I should fire of an idea to kahrendt that we should capture ondevice data locally and do something called on-device training. Where a locally trained model of capture bias’s the weights of the large pretarined model and makes it more accurate for the users using it.
These are all methods from Tensorflow (tflite) provided by Google.

Same with openwake word as the paper and embedding model is by Google.
Piper if I remember rightly is a refactored and rebranded wavglow that might of been nvidia and current we use Whisper an OpenAI product.

I am making direct comparisons as opensource due to limited resources and expertease is borrowing the cast off’s of older tech to produce its own systems.
From Meta to Tencent there are various models out there, but no-one in the community is at the level of creating models from scratch.
You can not get away from comparing against big tech as the original software source we are using is from big tech.

Some of my objections are that we should not be comparing Voice PE to other smart speakers because smart speakers are peer2peer because they fit a commercial model.

I firmly believe Speaker and Microphone are seperate opensource products and opensource already has great wireless audio in microcontroller Squeezelite or Sonos competing Snapcast
Both are client/server and not peer2peer but great opensource and in production for many years.
I don’t understand why we are cloning commercial peer2peer ‘smart speakers’ as seperation of Microphone and speaker has numerous advantages but we are not.

From software to format the comparison is very valid as even Nanbu Casa has a cloud sevice.
The biggest hurdle for opensource is that we don’t have the Gold Standard datasets that have the discipline and metadata of collection that bigdata has.
The old adage of garbage in garbage out means we are always handicapped by using synthetic data and still we don’t seem to have an opt in to collate on device, so that it can be submitted to an opensource repositary.

When I say what we have is still a long way behind Google and others is true and the comparison is valid when the products in form, software origin, purpose and use is near identical.

What kahrendt did with microWakeWord is really great as I have used the Google reaserch to train various KWS, but wow the framework would of took some digging out as also tensorflow/tensorflow/lite/experimental/microfrontend at master · tensorflow/tensorflow · GitHub

I had a go at a 1st C project to hack together a 2 Channel Beamformer and maybe someone of kahrendt level could polish my hack and use it on micro.
It would be really great if microWakeWord could be exported for Arm as DSP doesn’t really lending itself to Python unless you throw hardware at it.

I mention that as currently unlike big data I don’t think the beamforming and KWS are linked as the recognition of the KW should also give the direction of the beam to lock onto for that command sentence, which is implemented in the above even though a rough hack. I have hunch that is how the early smart speakers worked, but really without that you have a conference mic.

The comparison that opensource is some way behind big tech is just true unfortunately.

donburch888 · December 24, 2024, 8:41am

Gotta love that their web page says they have stock … but are closed till Jan 7th

Hicks12 · December 24, 2024, 2:32pm

Am I being dense or I cannot see a documented way of adjusting volume of the hardware via automations? Not sure why this is missed off the list of options as it seems like a basic feature to me.

I just want to set the volume to 100% make an announcement and then set them back as the speaker is too quiet for some things to be left low.

I’m going with me having a brain fog filled day after finishing for the holidays so I’ll check again later but if someone knows that would be ideal!

Lakini · December 24, 2024, 3:20pm

I guess you can hook into the logic that the hardware volume wheel uses: home-assistant-voice-pe/home-assistant-voice.yaml at 44eaaafdb3bfb2672a8c6f91563ce8c062bfde0f · esphome/home-assistant-voice-pe · GitHub

alander · December 24, 2024, 4:01pm

to change the volume on the hardware itself - use the media_player set volume action? That works for me

action: media_player.volume_set
target:
  entity_id: media_player.home_assistant_voice_1234567_media_player
data:
  volume_level: 0.49

Hicks12 · December 24, 2024, 10:34pm

Perfect, I was being an idiot not thinking of that as a media player as output so that makes sense I even seen it in the visual one not just by yaml when I am using the correct logic.

Thanks for the two pointing it out, was definitely too tired when looking earlier haha! Problem solved.

jessendelft · December 25, 2024, 9:26pm

I have an idea on how to use the HA voice as a module for a larger speaker with some additional features, like a display. With that in mind have I been looking into the Grove port and its capabilities.

For the display to make sense, I need metadata from the speaker. For example:

when it is talking.
what the text is that’s being spoken.
what type of device is being controlled (if any), and how.

As far as I can tell, the Grove port is currently set up to read sensor data. Can I also use it to read data from the speaker (e.g. push i2c messages to a device when it’s processing/talking)? If yes, is there documentation on the API?

Thanks, and great work so far!

Fugazzy · December 25, 2024, 9:37pm

Installing the voice comes to defining wifi, after that I get this message:

Rafaille · December 25, 2024, 10:08pm

Just guessing as I don’t have the device but is your HA instance accessible through http:// (and not https://)?

alander · December 26, 2024, 6:37am

I got an error following configuring the wifi (assuming the German text says the same thing I got) that I don’t know why but the device was using what was defined in the Home assistant URL rather than the Internal URL.

My external access is not on all the time (have cloudflare) - but after I switched on my external access it got past the wifi screen and using VAPE would only work when the external access was switched on.

I since then configured the Home assistant URL to be the same as the internal URL and is now no longer dependant on cloudflare tunnel to be running. My mobile app still has the external URL.

Bit puzzled why the VAPE can’t use the internal URL?

Fugazzy · December 26, 2024, 9:40am

Thank you! i was not aware of that.
I found the necessary setting under
settings - network
The HomeAssistant URL there did not work.
It works with my https URL (via nginx).
But it also works with automatic switched on which seems to use just the IP address.

Pops1 · December 26, 2024, 7:34pm

The documentation indicates the VPE currently uses just three wakewords . Will it be possible to stream other wakewords stored locally on the Home Assistant server by selecting them through them home assistant voice assistant?

Thanks!