The era of open voice assistants has arrived

pimw · December 20, 2024, 10:32am

Congrats with the product release! I’ve ordered one.

Question:

i’d like to use it local controlled only
the only thing i wanna do, is for the assistant to execute max 10 scripts
no interest at all in llm and all the fancy stuff
English language is fine

Is it reasonable to do this on a Intel i5-10500, 2 cores being exposed to Home Assistant OS, without (i)gpu?

Lakini · December 20, 2024, 10:37am

They added more files over night, this page is pretty extensive by now: Downloads – Home Assistant Voice Preview Edition

BeastHouse · December 20, 2024, 11:04am

Bummer they must be out of stock now!

stuartiannaylor · December 20, 2024, 11:22am

Its sort of stange as the audio out is being returned from the upstream ASR when really the upstream ASR and intent response could just stream to wireless audio.

A pi with a respeaker 2 mic can do the same, but dislike the lag the driver software has from each new version of RaspiOS and prefer using stereo USB soundcards such as Plugable USB Audio Adapter – Plugable Technologies
or ADA-15 USB - HQ MINI audio | Axagon

I am not really a fan of how Mike sets out the voice infrastructure as yeah audio out is central and likely should not need to be in a microphone enclosure.
The Python wyoming ‘open standard’ is freshly created whilst Linux has a huge array of high performance C libs for audio and doesn’t make a lot of sense for me at least when we have ALSA to Pulse and the newer pipewire, but you can just pipe to a network socket if you wished which again uses high performance existing linux libs than Python creating unnecessary load for embedded…
But when you have great opensource wireless audio such as Squeezelite or Snapcast it makes even less sense to me.

I have been following Rhasspy and Mycroft from early days and have a repo at StuartIanNaylor · GitHub but thinking of starting again with LinuxVoiceContainer · GitHub just to create some tutorials on how to DiY and use some of the already existing 1st class high performance audio libs Linux already has to offer.
Building a beamforming microphone array on a Pi Zero2 or Radxa as I think I can do better with opensource than opting out to closed source hardware such as the xmos…
Next couple of days I will be making some vids and tutorials on LinuxVoiceContainer · GitHub as an alternative to the HA offering as the implementation often has me bemused.
Only little things but they add up as with stereo beamforming generally you have a front facing device, where the enclosure itself acts to attenuate from the rear.
Top up as with HA with 2 mics on top the beamforming is only on the x axis as three mics is the minium in a triangular config to also include the Y planar axis also…
Guess you could use the HA unit on its side but the wheel and button doesn’t lend itself for that in the manner its been constructed.
In fact why have a wheel and button for a voice input… and again bemused
Also why use Whisper as its huge and not that great for command sentences and why are we waiting for HA ASR when so much existing ASR is already production proved.
HA is a great piece of opensource automation control software for near all home control devices and protocols.
I am confused why like Google and the rest they seem to be making there own embedded brand of everything from ASR, TTS to wireless audio when so much already exists in the opensource arena.
My current favorite for ASR is GitHub - wenet-e2e/wenet: Production First and Production Ready End-to-End Speech Recognition Toolkit as its massively lighter than Whisper and can run on much lighter hardware or be a central ASR on a multiclient system where recognition latency is very small the more hardware you throw at it.
Its all been really frustrating as opensource does have competing software but has nowhere near the levels of discipline in the datasets bigdata have to train the opensource software and this is still true.
I am not sure why more isn’t focussed on create true highquality large datasets and new language models are created for existing than refactoring and creating own brand modules…
But hey…

mp583 · December 20, 2024, 12:01pm

Excellent! It looks really cool, I like that I can plug it into another speaker for music. I imagine including a high quality speaker at this point bumped up the price too much. Excited for the RGB ring light too.

I think it needs a more friendly name though, something like Harvey - H(ome)A(ssistant)rV(oice)ey?

CJB · December 20, 2024, 12:50pm

Great! But please don’t forget us in New Zealand as well.

Tylast · December 20, 2024, 1:42pm

Is there a way to have an intercom feature between 2 of these devices?

domain_int · December 20, 2024, 1:49pm

I know its a pipe dream but it will make me purchase 10 of these, can they sync music???

praying its a yes.

domain_int · December 20, 2024, 1:49pm

This would be cool

stuartiannaylor · December 20, 2024, 2:13pm

Don’t think so and its a shame it doesn’t use existing wireless audio opensource software and just be a client to one of those.
Not sure how much resources are left on the ESP32-S3 but squeezlite has been ported to Esp32 as in https://raspiaudio.com/ whilst full blown Sonos opensource Snapcast has much tighter sync that feathers the time sync with zero glitches, but a Pi Zero is a minimum with a Pi Zero2 prob being a better bet as huge step up for only £5 more.
Squeezelite and Snapcast are great pieces of wire audio opensource, one for lighter hardware (Squeezelite) and the other you could argue its even better than Sonos with tighter sync and up 96Khz multichannel if your hardware can cope but its all written in high performing C and supposedly still runs on the original Zero.

Mosher · December 20, 2024, 2:20pm

Is there a way to improve response speed of the voice assistant by adding some hardware accelerator like Hailo AI?

HVR88 · December 20, 2024, 2:30pm

MUTE only means one thing - to stop something from producing sound. Maybe someone would use the word to imply silencing a mic, but that’s incorrect usage of the word.

ndom91 · December 20, 2024, 2:42pm

One of the annoying things to me on the S3-Box-3 was that when timers triggered and the sound played, it could only be stopped by physically pressing the button on the box.

Is it possible to stop the timer alert sound via voice now?

For example with Google Asisstant you set your timer, it triggers and the bell rings, and then you can just say “stop” to have it stop ringing.

I’d love this sort of functionality here.

Lakini · December 20, 2024, 2:44pm

Well, “mute” is also quite an industry standard when it comes to microphones, so that in itself is fine. The tricky part is that in this device you have both directions, sound-wise.

Example where mute is used for a mic, but where it’s only about a mic: Rolls MM11Pro 1-Kanal Mic/Mute Momenttaster

mquhob · December 20, 2024, 2:55pm

Does it play well enough with a ha green. I just bought it a month ago and now I am hearing that the voice assistant just launched might need better hardware to process the requests which will be a pity and waste of money on my already made purchase

koconut · December 20, 2024, 2:55pm

Is there a way to add as airplay device so it can sync with other existing speakers for multi-room audio?

ndom91 · December 20, 2024, 3:01pm

That rockchip SOC isn’t going to be fast enough to run LLMs locally, but otherwise you should be able to run the voice model locally with decent performance (guesstimate 3-5s response time for basic on/off commands).

Your best bet is going to be to just use homeassistant cloud though and not do it 100% locally

Merc · December 20, 2024, 3:26pm

Maybe overall not such a bad idea to

watch the stream
read the product page
read the faqs.

That would answer 90% of the repeated questions in this thread.

Lakini · December 20, 2024, 3:34pm

To be fair: the product (documentation) page was updated quite a bit in the last 24 hours, so things that were not there before only were provided later.

Also: this device is marketed as an “out of the box” experience without needing to diy. Then a livestream recording for the hardcore fans is nice to have, but it shouldnt be needed when the documentation is uptodate. Which it now (mostly) is.

madelena · December 20, 2024, 4:19pm

Yes, to stop a timer on the Voice PE, you can say “Stop”, because there is now a micro Wake Word trained for that purpose.