The era of open voice assistants has arrived

Rafaille · January 3, 2025, 10:41am

I suspect you are using Ollama to keep everything local but with GPT, all the commands you mentioned work well with a high success rate. I had to rework the prompt quite a few times to get there as well as using OpenAI “functions” for more complex tasks.
If you stick with Ollama, chances are it will get there eventually too.

hamido · January 3, 2025, 11:00am

That’s how I have it configured, but it’s not really end-game. HAs sentence support is extremely lacking at the moment, and while I can add sentences and remember what’s possible, that doesn’t pass the family-test, the LLM is the best option there.

hamido · January 3, 2025, 11:01am

This was my worry, but my understanding was that done correctly with the right model and implementation it could work well. I understand we’re not there yet and that’s ok, but I’m hoping it’s possible.

hamido · January 3, 2025, 11:03am

Absolutely, directive number one for my home automation is that all control remains entirely local, with zero dependence on outside connections.

Spotify is the only external connection I allow but that’s acceptable as it’s “only music”. I don’t consider this an exception to the rule as much as an exemption.

I’m not using open-web-ui yet but I’ll look into it, thanks for the heads up.

stuartiannaylor · January 3, 2025, 11:09am

It doesn’t do closest just 1st in any group so far.

johnuopini · January 3, 2025, 11:21am

What group?

guix77 · January 3, 2025, 11:42am

Here’s my review of the PE after a few days:

packaging: I’m not usually interested in nice packaging because it’s the content I’m buying, so maybe it’s because it’s Nabu / HA, but honestly I’ve rarely seen such a beautiful package. Surprising to see what you can do with just cardboard !
case: much nicer than I expected - I was first disappointed because I would have wanted a round one
setup: super smooth!

Now the meat:

disclaimer: I didn’t ever own a Google / Amazon etc. device, but I’ve built several DIY voice assistants with ESP32 / Raspberry. They were not good enough to use daily.
mic array: I was expecting way lower performance but it’s actually good enough to keep the system in production and be usable ; I would have loved at least a 4 array mic though, just because in the place where the PE is, it would be a lot better.
speaker: biggest disappointment here. It’s acceptable, not more. But hey it’s PE only.
wake word: using OK Nabu, it’s surprisingly very good with me. I can be heard from the other level of my house seperated by 1 stair and 6 meters. It reaches its limits when there’s a mid to high volume of music / tv playing. It works OK for adults but my 3+ year old is having great difficulty triggering the wake word. More on that later.
STT: We speak french. With my previous DIY builds I went 100% local and it was not good enough. Since the local support of french is advertised as KO, I went for Google STT. With that, it recognizes almost flawlessly everything that I say. From what I understand I should pay nothing to a few cents a day.
Agent: As in my previous DIY tests, I quickly became frustrated with the limitation of the intents system so I installed the OpenAI agent. It looks like I should also pay nothing to a few cents a day. I’m using the new option to use the local agent first and then fallback on OpenAI, which is super neat.
With Google STT + OpenAI (or probably any other combination of cloud services) it’s honestly awesome. I can really speak normally and 80% of the time it will do what I want, with queries that can be sometimes complex such as “Shut down all the rollershutters except the one from the living”.

Now for the local side of things:

Wake word: Still needs work since it’s the most crucial piece to ensure privacy, but it’s coming really nice! For my 3+ year old, I used HA’s website to add her voice but I’m wondering what the process of using those samples is. More precisely, when can I expect those samples to be actually integrated into the training of my local device?
STT is the main remaining problem in poorly supported languages. Surely we can contribute to Common Voice but again we miss an incentive showing us when our samples will be used. Also, I could imagine a far better system but it would require development, infrastucture, community involvment and licences issues solving : what if our satellites could be used in the same time with the normal Assist pipeline (using Google or Azure STT), and with a HA cloud service that would train a STT model, by getting our sentences along with the text returned by the cloud service? Yes I know, it’s a huge project, but we can always dream
LLM: we’re already covered there for local solutions although it seems people are having problems. I’ll give it a try with my i3 10100 with 16G RAM and no discrete GPU but I doubt it can be any good. Even so, would I not pay more for the electricity consumption than for a cloud LLM service…?

Coming back to the PE device, what I would love to see in a future Nabu device:

at least a 4 mic array
better speaker(s), enough for music in a bedroom / office so around 20 to 30W
enough punch to use mmWave LD24500 and BT proxy on top of that
even more GPIO extensions and enough empty place to add whatever low consuming sensors we want

Congratulations Nabu Casa team, you did a very nice work there!!!

stuartiannaylor · January 3, 2025, 11:56am

I dunno to be honest as it was answered in the above long thread and just remember thinking that 1st is going to have some bad effects.
If you have 2 in a zone, floor or whatever, depending on how many you have its likely 1st will choose a bad microphone 50% of the time due to whatever network and hardware you have depending on where you are.
I have been advocating that cloning the commercial peer2peer systems that are so disliked seems an odd infrastructure choice when sharing intensive process such as ASR, TTS & LLM is such an obvious client/server structure with its diversity of use.
I have been saying I have a hunch that as long as you use the same model and hardware in any zone then the KW hit probability given by the KWS would likely be analogous to the best signal recieved.
The question has been asked before and the answer was 1st unfortunately.

dumbdevice · January 3, 2025, 1:47pm

Yeah, same experience. Ollama just not reliable and not predictable at the moment. If only there was Ollama model designed for controlling HA.

One of the reason I tried Ollama (first reason - it’s 100% local) is that faster-whisper is also not reliable and Ollama could fix its errors.

johnuopini · January 3, 2025, 9:50pm

It’s still not clear what 1st mean to me, so of o have 2 PE in 2 zones and both are able to hear a command, will the “closest” be activated or it’s un prendi? I unfortunately do not have the hardware till end of January due to supply issues otherwise I would test myself.

stuartiannaylor · January 3, 2025, 10:27pm

Not closest, just which ever one reacts 1st via the network…

melloxious · January 5, 2025, 5:44am

How do you change the wifi network of the device?

jazzmonger · January 5, 2025, 5:59am

Make sure you VOTE for the timer function to be expanded so it can really be used in HA instead of just on the device itself!

jazzmonger · January 5, 2025, 6:08am

Make sure you vote on this!

dumbdevice · January 5, 2025, 12:01pm

Agree, and not only Timers but Alarm clocks would be nice to have too

luka6000 · January 9, 2025, 10:04pm

top and bottom parts done.

Both available with stp files.

Rudd-O · January 10, 2025, 7:00pm

For those of you who want to keep the enclosure of the device, but want to have the device near-vertical

NathanCu · January 10, 2025, 8:54pm

Printing one right now in the same color as the black silk case I printed earlier this week.

Rudd-O · January 10, 2025, 10:56pm

A remix of my model that allows for direct embedding of the electronics and controls (jog/button) of the Voice Assistant in my model would be most welcome. It would, of course, require disassembly of the device… but that would be kind of the point of such a remix.

NathanCu · January 11, 2025, 5:26am

Its a good file - nice tolerances. Came out nicely at .12 on an X1C.

Solid build and nicely done… Although for me at least, it needs something (dunno what yet) to snap the device in. until i get my microwakewordin im punching the button a lot and Ive already knocked it out twice. Don’t know what that snap thing is / looks like yet. As soon as I figure it out -I’ll pop back.