The era of open voice assistants has arrived

nickrout · December 21, 2024, 5:52am

Why do you ask? Read the source.

stuartiannaylor · December 21, 2024, 10:59am

They have called it the preview edition, but think the hardware design is prob set in stone.
Some of the sales speak was likely optimism as at least this time, it is sold as a preview for what it says the future of voice assistants.
Google is some way ahead of that as without doubt the targetted voice extraction of the later Nest Audio devices, they outperform what we just saw in the preview video.
We never got any demonstration of farfield or 3rd party media noise that the current crop of closed source do quite well.
Google has halted all assistant dev apart from local models running on accelerators like there Tensor chip in the Pixel devices.
Near all the big players are moving away from cloud devices as it doesn’t make revenue for them but the hardware costs have limited them to mid to flagship phones and tablets, whilst they still sell the original cloud based devices.

HA seems to be tackling the problem by building up to LLM driven accelerated devices, but yeah a stereo mic beamformer even if powered by that xmos chip, is a considerable way behind Google & Apple.

There was some basic elementary 101 errors in its design and this hovering of permissive licences, to refactor and rebrand just wastes time whilst training existing with new language models and capturing voice data would be faster.

I keep repeating a request to allow an opt-in to collate data on device and submit in batches to HA as opensource still has some very poor quality datasets compared to what big data has.
Until that has been overcome opensource will be fudging models with synthetic created data that just isn’t the same as real world capture.

Even what is being doing now is essentially wrong and will create a poor dataset as I did post as an issue but was just closed as completed…

github.com/OHF-Voice/wake-word-collective

Misunderstanding the problems with room reverberation (RIR)

opened 11:41AM - 06 Dec 24 UTC

closed 10:14PM - 13 Dec 24 UTC

StuartIanNaylor

Misunderstanding the problems with room reverberation (RIR) and how easily they …can be created > Room Impulse Response (RIR) is an audio signal processing task that involves capturing and analyzing the acoustic characteristics of a room https://ohf-voice.github.io/wake-word-collective/ 'the wake word while you walk around the room' Its extremely hard (beamforming) to remove room impulse response, but very easy to create with many github projects to do this https://github.com/LCAV/pyroomacoustics is one of many that can accurately add RIR You only need <0.3m recordings that have very low RIR that can create many distances accurately with tools such as pyroomacoustics. This allows you to provide datasets for certain devices as for a device with beamforming walking around a room including large RIR (large rooms) will greatly increase dataset entropy and resultant models will be less accurate as the beamforming will attenuate RIR. Also any model created with samples containing RIR due to the nature of RIR sound will bounce off walls and arrive at the mic at different time periods due to differing distances. These mix at the mic and the more distant the mic the more the recorded spectra will differ as more reverberation and mixing will happen creating very different harmonics. At least if your recording at a distance supply metadata of that distance so a dataset can be filtered to only include near <0.3 recordings and also allow you to be sure you have an even spread than create dataset bias. Metadata is hugely important, Recording Device, Recording distance, Gender, Age Band, Lang, Region (Reginal accents), is essential so that you can filter and create evenly spread datasets or tailor a dataset to a type of metadata for more accuracy. From what I can see https://ohf-voice.github.io/wake-word-collective/ is going to create datasets only for devices without any form of RIR attenuation (beamforming and such) and create innacurate models as there is a limit to how much RIR you can include in samples as further distances to the mic in big rooms give big differences in spectra and greatly increasing dataset entropy and hence lowering model accuracy. You can include a bit of RIR with non beamforming devices to increase accuracy of more distant wakewords but the tradeoff is overall accuracy and without the ability to filter too distant RIR for this type of device the dataset is pure potluck of what environment users submit. Bemused as always. Stuart

It going to be a long haul before some of the sales speak becomes anywhere near true, but at least there is an effort being made in the opensource community.
Opensource is still making 101 errors whilst the likes of Google are past Phd with many active employee’s conducting cutting edge dev in this field.

HA do there best with what they have, even if some of the errors are frustrating at times.

ratsepa · December 21, 2024, 11:33am

Until now ESPHome mediaplayer could not play radio streams with AAC+ coding, only mp3 streams (squeezelite on ESP32 plays AAC+).
Is there any hope that this new hardware brings also improvements to mediaplayer capabilities?

ronaldheft · December 21, 2024, 4:19pm

How does the Home Assistant Assist pipeline currently handle multiple Voice devices? If the devices are in close proximity, is the software currently smart enough to respond from the location it heard the clearest, or will both respond?

fenty17 · December 21, 2024, 4:40pm

Good question. I’ve ordered two with a plan to have them close enough this could be an issue.

michaelwacey · December 21, 2024, 4:43pm

That was addressed during the live YouTube event. The idea seems to be that the first one to hear the wake word will respond. All others will be quieted. There was some discussion that the first heard is usually but not always the closest.

stuartiannaylor · December 21, 2024, 4:53pm

I am not sure if that is implemented yet, Rhasspy used to handle multiple satelites but still not sure if in anyway it could detect the best signal or distance.

I don’t like the satelite infrastructure and that mics should connect to a websockets server that can handle concurrent clients.
That way its pretty easy to organise them into zones and generally a KWS gives hit probability of 0-1, so the KWS that sends the highest hit probability as long as its the same model should be a good indication of the best mic to use…
You have a small ‘debounce latency’ delay to wait for all the results come in and use the highest score and boot the other kws mic arrays.
.99 means its 99% probable the KW is correct and with the same model and hardware then the highest hit should be the best and likely closest, without need of complex additional algs.

I have been banging on about that for a while where zones are at the start of the process than allocating after and will be working on a rough demo the next couple of days as an alternative to the peer2peer wyoming…

baudneo · December 21, 2024, 5:07pm

Alot has been cobbled together. Emerging entrepreneurs creating their own hardware (albeit backed by nabu casa repos) and FOMO are what I am assuming caused the rush to get the hardware out and the software decent enough to show off. XMOS hardware is most likely because it’s proven effective and a shortcut to get all of that working in one package rather than glueing together and reinventing existing oss projects.

Tbh, it reminds me of Steve Jobs swapping crashing iPhones during its reveal. Ok, maybe not quite that, but it still begs the question of why so many things seem rushed. This isn’t me talking down, I understand things aren’t as cut and dried as that, but it’s still something to point out.

All in all, a great job. I am travelling down the assist rabbit hole with non PE hardware and just in the couple months I’ve been tinkering, the intent processing has changed drastically. Last week I could ask assist to “turn the temperature to 22” and it would, this week, I get “sorry I couldn’t understand that”, which means things are moving fast.

I get what the end goal is here but I also am bemused with this iteration of hardware and some of the software choices. It seems like nabu wants its own ecosystem (Wyoming backed data transfering), and I think we’re all a little weary of voice assistant ecosystems.

I’ll be optimistically cautious and take the good with the bad. A lot of hard work has gone into getting what we have out and I am grateful for it.

Here’s to putting on sunglasses, the future looks bright .

Merc · December 21, 2024, 7:27pm

I have never had any commercially available smart speakers/ voice assistants and could not care less about them.
Zero need to ask such a thing to tell me a joke or read some crap from the Internet to me.
For smart music streaming I have been running a squeezebox system for more than 15 years, which perfectly integrated into home assistant.
So no need, for me, to have yet another music player.
If this new device has better voice recognition, which it seems to, that is all I am asking for to begin with. I would even consider using rhasspy speech since the intents cover 90% of my use cases.
I am very excited for the new voice assistant and curious how it will perform.
If I support all the great effort through its price this is fine with me. All the other stuff that I am doing in my house on the shoulders of home assistant came for free after all.

I have the feeling that many people seem to live in a reality where they expect everything for free…after all Google is, right

If your privacy is worth nothing to you, this might not be the right plattform.

Thanks to everyone at HA and Nabu Casa for the great effort that went into this, and everything else that is already there, of course.

Merry X-mas

Merc

luka6000 · December 21, 2024, 8:37pm

Already ordered mine - thank you for this awesome release.
So about 3d files. Would you be releasing STEP files? Or any other solids? STLs are fine for 3d printing and accessories but modifications need solid files not mesh. Yes, we can reverse engineer the STLs but it would be a lot easier if you just released originals. I already got some ideas on how to make those files easier to print. Thanks!

popcornboy · December 21, 2024, 10:47pm

Please also add a switch to mute the speaker(s).

tom_l · December 21, 2024, 11:09pm

Plug any 3.5mm stereo/mono plug into the ext. speaker socket. That will disconnect the internal speaker.

dumbdevice · December 22, 2024, 12:28am

I think 2 voice-related things could be improved/added:

Timers. I wish we had maybe Timers item in side panel. So you could set timer from the dashboard too, pause and cancel timers from the dashboard. Set timers beep sound and output speaker.
Alarm clock. Looks like it’s completely missing in HA. I wish we could set Alarm clock with voice or in dashboard. Like timers it needs a place in side panel too.

There could be indicator on the dashboard if timer or Alarm clock is active.

finity · December 22, 2024, 3:15am

Does anyone know one way or the other for sure if the device can initiate conversations via automations or is it strictly always only listening for the wake word to initiate some interaction?

fversteegen · December 22, 2024, 6:48am

Hmmm… Apparently you can no longer buy them through Seeedstudio

Hedda · December 22, 2024, 8:27am

+1 however music playback specific requests are probably better directed to the Music Assistant project (which do share some of the same developers as Home Assistant), check out these existing feature request discussions about ESP32 playback:

music-assistant · Discussions · GitHub

I posted this directly related feature request there asking for "Matter Casting " (a.k.a. MatterCast):

"Matter Casting" (a.k.a. MatterCast) audio/music and video player (streaming reciever) for new/upcoming video and music cast standard? · music-assistant · Discussion #2342 · GitHub

There is by the way a good summary of the Music Assistant project in this blog post:

Music Assistant 2.0: Your Music, Your Players - #927 by nbraude

Hedda · December 22, 2024, 8:33am

Would it be possible to add “Matter Casting” (a.k.a. MatterCast) audio/music and video player (streaming reciever) for new/upcoming video and music cast standard support in the future?

"Matter Casting" (a.k.a. MatterCast) audio/music and video player (streaming reciever) for new/upcoming video and music cast standard? · music-assistant · Discussion #2342 · GitHub

Please consider researching and planning for adding custom “Matter Casting” (a.k.a. MatterCast) receiver/client for Music Assistant and later in the future also “Matter Casting” streaming service for newer and upcoming connected smart-speakers and smart-displays/televisions (like the latest products from Amazon) that will be able to act as receiver endpoints and audio/music player for “Matter Casting” (a.k.a MatterCast) audio/music and video streaming when those become available. Matter Casting is aimed at democratize local video and audio casting in a universal way that can be supported by all ecosystems and platforms.

https://www.matteralpha.com/explainer/what-is-matter-casting (Note! unofficial explainer blog)
What is Matter Casting? | Know-how | matter-smarthome (Note! another unofficial blog)

“Matter Casting” is a new open protocol media streaming standard and the Matter Casting APIs for casting video and audio streams over a local network is only a small part of the currently much-hyped Matter standard suite for IoT which is being led, promoted and developed by the CSA (Connectivity Standards Alliance) and its very impressive list of member companies:

balloob · December 22, 2024, 9:32am

Yep, ffmpeg is used on the HA side to convert all incoming audio into something ESPHome understands (feature not limited to Voice PE)

balloob · December 22, 2024, 9:36am

Wyoming is an open voice assistant protocol that we created to allow hosting Speech-to-Text etc engines in different processes or hosts (like your beefy server). Voice PE leverages the ESPHome protocol as it’s an ESPHome device. Everything is open and we encourage people and companies to use and integrate them as they wish.

balloob · December 22, 2024, 9:37am

This is not possible yet but planned.