The era of open voice assistants has arrived

mnowok · December 20, 2024, 4:32pm

Can this device use cloud speech to text other than Home Assistant Cloud? For example Google STT? I’ve been using this cloud stt and tts with m5stack atom echo device with success (atom software modified to send speech to google home speaker). I don’t need Home Assistant Cloud at all (own my domain and SSL certificate) and google STT is very cheap for my usage (a few dollars for months).

If assist configuration works on atom will it work on new HA device or is it limited to HA cloud STT/TTS only?

nickrout · December 20, 2024, 7:21pm

That’s not the release thread. It is a history and “state of play” blog post.

mchk · December 20, 2024, 7:26pm

you can use vosk (with dict) or rhasspy-speech for this tasks
these STT will work fast even on weak devices

daywalker03 · December 20, 2024, 7:29pm

Or you can offload local processing to a more powerful system, which is what you’d want to do if you want locally hosted AI as well as faster responses to your commands.

Hedda · December 20, 2024, 8:55pm

For a non-preview edition or future new revision would you please consider also adding an ESP32-H2 (or ESP32-C6) SoC as another “coprocessor” and secondary IoT radio module (making the same PCB have both a ESP32-S3 and a ESP32-H2 on a single board so that second SoC can be used as dedicated Thread Border Router for Home Assistant?

The main real-world use case reason to add an ESP32-H2 (or an ESP32-C6) module other than just using it as a generic-coprocessor MCU SoC that could be used to offload stuff that ESP32-H2 (or ESP32-C6) has an IEEE 802.15.4 radio which means it can be used as a “Thread Border Router” (with OpenThread Border Router firmware), for a Thread network used by the Matter integration in Home Assistant.

Adding such a ESP32-H2 (or ESP32-C6) SoC or module with its own antenna would take some space on the PCB however should not add that much larger BOM cost since ESP32 chips are not expensive, and I think the additional possibilities such an extra ESP32 SoC could add should hopefully more than make up for that slightly higher cost! If go with the slightly more powerful ESP32-C6 then could perhaps off-load some other processes too (like maybe any sensors connected to the Grove port).

You can possibly in the future alternativly also use ESP32-H2 (or ESP32-C6) on a single of them as a remote Zigbee Coordinator (also known as a Serial-over-IP Zigbee controller adapter) for Home Assistant’s built-in ZHA integration (native Zigbee Gateway), see/follow this work-in-progress but note that the ESP Zigbee radio library for zigpy is is still very experimental and not yet fully working with the zha project.

jaswalters · December 20, 2024, 9:02pm

Plug your own speaker into the 3.5mm port.

GilDev · December 20, 2024, 9:30pm

Excited to try this out! I will probably wait until French is supported though.
Would love a round case design also!

NathanCu · December 20, 2024, 10:40pm

That would be a killer device H

And if I could put that in a chassis that had bangin speakers. Maybe a Squeeze lite compatible player…

HITChris · December 20, 2024, 11:06pm

Is PoE support on the roadmap?

domain_int · December 20, 2024, 11:53pm

Its a real shame. This could have changed the entire voice landscape. for a LOT of people.

looks like I wont be replacing my google system in a hurry.

melloxious · December 21, 2024, 12:45am

Who will the initial retailers be in Australia?

newpond · December 21, 2024, 1:08am

loads of stock, buy it today… unless you live in the uk where both resellers are out of stock and on pre order!!!

same with France and Germany

thomasf · December 21, 2024, 1:44am

It seems that Seeed studio stock is sold out. No more previews for Australia. Super sad I’ve missed out. Been following all the announcements but apparently wasn’t fast enough.

finity · December 21, 2024, 1:54am

It’s hard to tell the difference since it contains almost all the duplicate info contained here.

either way, it’s not that big of a deal. just an observation.

Chrisalbertson · December 21, 2024, 2:23am

I do own a 3D printer. I find that to get a high-quality finish, I need to use spray paint. if you wanted a black case should try painting a white case.

Mosher · December 21, 2024, 5:44am

reduce layer hight to 0.12 or 0.1mm and the finish will be much better

Mosher · December 21, 2024, 5:46am

Does Voice PE has BLE Proxy functionality out of the box?

nickrout · December 21, 2024, 5:52am

Why do you ask? Read the source.

stuartiannaylor · December 21, 2024, 10:59am

They have called it the preview edition, but think the hardware design is prob set in stone.
Some of the sales speak was likely optimism as at least this time, it is sold as a preview for what it says the future of voice assistants.
Google is some way ahead of that as without doubt the targetted voice extraction of the later Nest Audio devices, they outperform what we just saw in the preview video.
We never got any demonstration of farfield or 3rd party media noise that the current crop of closed source do quite well.
Google has halted all assistant dev apart from local models running on accelerators like there Tensor chip in the Pixel devices.
Near all the big players are moving away from cloud devices as it doesn’t make revenue for them but the hardware costs have limited them to mid to flagship phones and tablets, whilst they still sell the original cloud based devices.

HA seems to be tackling the problem by building up to LLM driven accelerated devices, but yeah a stereo mic beamformer even if powered by that xmos chip, is a considerable way behind Google & Apple.

There was some basic elementary 101 errors in its design and this hovering of permissive licences, to refactor and rebrand just wastes time whilst training existing with new language models and capturing voice data would be faster.

I keep repeating a request to allow an opt-in to collate data on device and submit in batches to HA as opensource still has some very poor quality datasets compared to what big data has.
Until that has been overcome opensource will be fudging models with synthetic created data that just isn’t the same as real world capture.

Even what is being doing now is essentially wrong and will create a poor dataset as I did post as an issue but was just closed as completed…

github.com/OHF-Voice/wake-word-collective

Misunderstanding the problems with room reverberation (RIR)

opened 11:41AM - 06 Dec 24 UTC

closed 10:14PM - 13 Dec 24 UTC

StuartIanNaylor

Misunderstanding the problems with room reverberation (RIR) and how easily they …can be created > Room Impulse Response (RIR) is an audio signal processing task that involves capturing and analyzing the acoustic characteristics of a room https://ohf-voice.github.io/wake-word-collective/ 'the wake word while you walk around the room' Its extremely hard (beamforming) to remove room impulse response, but very easy to create with many github projects to do this https://github.com/LCAV/pyroomacoustics is one of many that can accurately add RIR You only need <0.3m recordings that have very low RIR that can create many distances accurately with tools such as pyroomacoustics. This allows you to provide datasets for certain devices as for a device with beamforming walking around a room including large RIR (large rooms) will greatly increase dataset entropy and resultant models will be less accurate as the beamforming will attenuate RIR. Also any model created with samples containing RIR due to the nature of RIR sound will bounce off walls and arrive at the mic at different time periods due to differing distances. These mix at the mic and the more distant the mic the more the recorded spectra will differ as more reverberation and mixing will happen creating very different harmonics. At least if your recording at a distance supply metadata of that distance so a dataset can be filtered to only include near <0.3 recordings and also allow you to be sure you have an even spread than create dataset bias. Metadata is hugely important, Recording Device, Recording distance, Gender, Age Band, Lang, Region (Reginal accents), is essential so that you can filter and create evenly spread datasets or tailor a dataset to a type of metadata for more accuracy. From what I can see https://ohf-voice.github.io/wake-word-collective/ is going to create datasets only for devices without any form of RIR attenuation (beamforming and such) and create innacurate models as there is a limit to how much RIR you can include in samples as further distances to the mic in big rooms give big differences in spectra and greatly increasing dataset entropy and hence lowering model accuracy. You can include a bit of RIR with non beamforming devices to increase accuracy of more distant wakewords but the tradeoff is overall accuracy and without the ability to filter too distant RIR for this type of device the dataset is pure potluck of what environment users submit. Bemused as always. Stuart

It going to be a long haul before some of the sales speak becomes anywhere near true, but at least there is an effort being made in the opensource community.
Opensource is still making 101 errors whilst the likes of Google are past Phd with many active employee’s conducting cutting edge dev in this field.

HA do there best with what they have, even if some of the errors are frustrating at times.

ratsepa · December 21, 2024, 11:33am

Until now ESPHome mediaplayer could not play radio streams with AAC+ coding, only mp3 streams (squeezelite on ESP32 plays AAC+).
Is there any hope that this new hardware brings also improvements to mediaplayer capabilities?