The era of open voice assistants has arrived

Why do you ask? Read the source.

They have called it the preview edition, but think the hardware design is prob set in stone.
Some of the sales speak was likely optimism as at least this time, it is sold as a preview for what it says the future of voice assistants.
Google is some way ahead of that as without doubt the targetted voice extraction of the later Nest Audio devices, they outperform what we just saw in the preview video.
We never got any demonstration of farfield or 3rd party media noise that the current crop of closed source do quite well.
Google has halted all assistant dev apart from local models running on accelerators like there Tensor chip in the Pixel devices.
Near all the big players are moving away from cloud devices as it doesn’t make revenue for them but the hardware costs have limited them to mid to flagship phones and tablets, whilst they still sell the original cloud based devices.

HA seems to be tackling the problem by building up to LLM driven accelerated devices, but yeah a stereo mic beamformer even if powered by that xmos chip, is a considerable way behind Google & Apple.

There was some basic elementary 101 errors in its design and this hovering of permissive licences, to refactor and rebrand just wastes time whilst training existing with new language models and capturing voice data would be faster.

I keep repeating a request to allow an opt-in to collate data on device and submit in batches to HA as opensource still has some very poor quality datasets compared to what big data has.
Until that has been overcome opensource will be fudging models with synthetic created data that just isn’t the same as real world capture.

Even what is being doing now is essentially wrong and will create a poor dataset as I did post as an issue but was just closed as completed…

It going to be a long haul before some of the sales speak becomes anywhere near true, but at least there is an effort being made in the opensource community.
Opensource is still making 101 errors whilst the likes of Google are past Phd with many active employee’s conducting cutting edge dev in this field.

HA do there best with what they have, even if some of the errors are frustrating at times.

Until now ESPHome mediaplayer could not play radio streams with AAC+ coding, only mp3 streams (squeezelite on ESP32 plays AAC+).
Is there any hope that this new hardware brings also improvements to mediaplayer capabilities?

How does the Home Assistant Assist pipeline currently handle multiple Voice devices? If the devices are in close proximity, is the software currently smart enough to respond from the location it heard the clearest, or will both respond?

Good question. I’ve ordered two with a plan to have them close enough this could be an issue.

That was addressed during the live YouTube event. The idea seems to be that the first one to hear the wake word will respond. All others will be quieted. There was some discussion that the first heard is usually but not always the closest.

1 Like

I am not sure if that is implemented yet, Rhasspy used to handle multiple satelites but still not sure if in anyway it could detect the best signal or distance.

I don’t like the satelite infrastructure and that mics should connect to a websockets server that can handle concurrent clients.
That way its pretty easy to organise them into zones and generally a KWS gives hit probability of 0-1, so the KWS that sends the highest hit probability as long as its the same model should be a good indication of the best mic to use…
You have a small ‘debounce latency’ delay to wait for all the results come in and use the highest score and boot the other kws mic arrays.
.99 means its 99% probable the KW is correct and with the same model and hardware then the highest hit should be the best and likely closest, without need of complex additional algs.

I have been banging on about that for a while where zones are at the start of the process than allocating after and will be working on a rough demo the next couple of days as an alternative to the peer2peer wyoming…

Alot has been cobbled together. Emerging entrepreneurs creating their own hardware (albeit backed by nabu casa repos) and FOMO are what I am assuming caused the rush to get the hardware out and the software decent enough to show off. XMOS hardware is most likely because it’s proven effective and a shortcut to get all of that working in one package rather than glueing together and reinventing existing oss projects.

Tbh, it reminds me of Steve Jobs swapping crashing iPhones during its reveal. Ok, maybe not quite that, but it still begs the question of why so many things seem rushed. This isn’t me talking down, I understand things aren’t as cut and dried as that, but it’s still something to point out.

All in all, a great job. I am travelling down the assist rabbit hole with non PE hardware and just in the couple months I’ve been tinkering, the intent processing has changed drastically. Last week I could ask assist to “turn the temperature to 22” and it would, this week, I get “sorry I couldn’t understand that”, which means things are moving fast.

I get what the end goal is here but I also am bemused with this iteration of hardware and some of the software choices. It seems like nabu wants its own ecosystem (Wyoming backed data transfering), and I think we’re all a little weary of voice assistant ecosystems.

I’ll be optimistically cautious and take the good with the bad. A lot of hard work has gone into getting what we have out and I am grateful for it.

Here’s to putting on sunglasses, the future looks bright :sunglasses:.

1 Like

I have never had any commercially available smart speakers/ voice assistants and could not care less about them.
Zero need to ask such a thing to tell me a joke or read some crap from the Internet to me.
For smart music streaming I have been running a squeezebox system for more than 15 years, which perfectly integrated into home assistant.
So no need, for me, to have yet another music player.
If this new device has better voice recognition, which it seems to, that is all I am asking for to begin with. I would even consider using rhasspy speech since the intents cover 90% of my use cases.
I am very excited for the new voice assistant and curious how it will perform.
If I support all the great effort through its price this is fine with me. All the other stuff that I am doing in my house on the shoulders of home assistant came for free after all.

I have the feeling that many people seem to live in a reality where they expect everything for free…after all Google is, right :rofl:

If your privacy is worth nothing to you, this might not be the right plattform.

Thanks to everyone at HA and Nabu Casa for the great effort that went into this, and everything else that is already there, of course.

Merry X-mas

Merc

3 Likes

Already ordered mine - thank you for this awesome release.
So about 3d files. Would you be releasing STEP files? Or any other solids? STLs are fine for 3d printing and accessories but modifications need solid files not mesh. Yes, we can reverse engineer the STLs but it would be a lot easier if you just released originals. I already got some ideas on how to make those files easier to print. Thanks!

Please also add a switch to mute the speaker(s).

Plug any 3.5mm stereo/mono plug into the ext. speaker socket. That will disconnect the internal speaker.

I think 2 voice-related things could be improved/added:

  1. Timers. I wish we had maybe Timers item in side panel. So you could set timer from the dashboard too, pause and cancel timers from the dashboard. Set timers beep sound and output speaker.

  2. Alarm clock. Looks like it’s completely missing in HA. I wish we could set Alarm clock with voice or in dashboard. Like timers it needs a place in side panel too.


There could be indicator on the dashboard if timer or Alarm clock is active.

1 Like

Does anyone know one way or the other for sure if the device can initiate conversations via automations or is it strictly always only listening for the wake word to initiate some interaction?

Hmmm… Apparently you can no longer buy them through Seeedstudio

+1 however music playback specific requests are probably better directed to the Music Assistant project (which do share some of the same developers as Home Assistant), check out these existing feature request discussions about ESP32 playback:

I posted this directly related feature request there asking for "Matter Casting " (a.k.a. MatterCast):

There is by the way a good summary of the Music Assistant project in this blog post:

Would it be possible to add “Matter Casting” (a.k.a. MatterCast) audio/music and video player (streaming reciever) for new/upcoming video and music cast standard support in the future?

Please consider researching and planning for adding custom “Matter Casting” (a.k.a. MatterCast) receiver/client for Music Assistant and later in the future also “Matter Casting” streaming service for newer and upcoming connected smart-speakers and smart-displays/televisions (like the latest products from Amazon) that will be able to act as receiver endpoints and audio/music player for “Matter Casting” (a.k.a MatterCast) audio/music and video streaming when those become available. Matter Casting is aimed at democratize local video and audio casting in a universal way that can be supported by all ecosystems and platforms.

Matter Casting” is a new open protocol media streaming standard and the Matter Casting APIs for casting video and audio streams over a local network is only a small part of the currently much-hyped Matter standard suite for IoT which is being led, promoted and developed by the CSA (Connectivity Standards Alliance) and its very impressive list of member companies:

Yep, ffmpeg is used on the HA side to convert all incoming audio into something ESPHome understands (feature not limited to Voice PE)

1 Like

Wyoming is an open voice assistant protocol that we created to allow hosting Speech-to-Text etc engines in different processes or hosts (like your beefy server). Voice PE leverages the ESPHome protocol as it’s an ESPHome device. Everything is open and we encourage people and companies to use and integrate them as they wish.

This is not possible yet but planned.

2 Likes