2023: Home Assistant's year of Voice

Maybe home assistant should separate the project of Voice Recognition to another open source project, and make voice plugin for HA. Good open source voice recognition not only home assistant problem, it is problem of all open source community.

1 Like

The person they hired is a developer of a project exactly like what you’re describing Rhasspy so that does not make any sense

Well, it makes some sense and in fact that is what they probably will do.

Nabu Casa is exclusive for HA, but still not really needed for HA. It is more a complementary product.
ESPHome is also a project that is not exclusively for HA, but HA benefits a lot from it.

I think Rhasspy will be following the ESPHome format.

Hi, I think there’s a crucial step in voice assistant algorithms that all of them miss: intent clarification. In theory, people should know what they want specifically to communicate to the voice assistant and get it done. However, most people don’t even know what exactly they want from the voice assistant, they usually have a vague idea only. This is why most voice assistants have failed to be used for more than simple mundane tasks like setting timers and such. Nobody can clearly explain what they want if it’s too complex. The voice assistant thus needs to have an intermediate step in its algorithm: intent clarification. Intent clarification should assist the user to clarify their intent by suggesting to them what they possibly want to get done even from vague requests. A more conversational, chat GPT style of implementation is the missing ingredient needed to generate strong adoption for voice as a main form of computing input once and for all.

VR in the form like Rhasspy will mostly just be another way of adding automations to HA, where a button or other trigger is not really possible.
It could be to add flour to the shopping list when you are kneading the dough or as I do get train times when I am putting on shoes and jacket in the morning (I am a subconscious time optimist by nature, so I am always balancing on seconds and actually using the touch interface on the tablet might make me lose those important seconds :smiley:)

I think the devs have been clear that this sort of thing is an explicit non-goal. They want a limited vocabulary in lots of languages that is focused on the control functions of HA, and not anything like a smart speaker from Google or Amazon.

I think it’s completely correct that the community is ill equipped to build a local only general assistant functionality. However, based on personal experience I don’t think this reduced functionality will be viewed as very useful when implemented, especially after folks have become used to what Google Home and Alexa can do already.

Time will tell, and I think the devs are smart and will course correct if this turns out to be less useful than people think.

It turns out that using Apple has already managed to build a dream voice smarthome. There is also a video in the article. It is not clear how proprietary technology in conjunction with ChatGPT works so well. ChatGPT in an iOS Shortcut — Worlds Smartest HomeKit Voice Assistant | by Mate Marschalko | Jan, 2023 | Medium

1 Like

Hurrah. Another cloud solution. We are saved!

2 Likes

Hi, recently found this blog post and thought you might be interested in the offline and open-source voice assistant Jaco:

It’s a bit similar to the old Snips assistant, but with a lot of architectural improvements, and this time completely offline and open source;) The main improvement was done in the skill concept, Jaco supports skills in any programming language and with any dependencies, not only python scripts. To improve the security of such a feature, all skills are isolated in docker containers, and a permission system is used to prevent skills from acessing topics they shouldn’t have access to. So it’s really easy to extend the assistant’s functionallity with new skills, which then can be shared with others through Jaco’s skill store.

The assistant currently supports German, English, Spanish and French, and the recogonition performance is state-of-the-art in most speech-to-intent benchmarks. It’s designed to run on most linux computers or directly on a raspberry pi.


Regarding your initial post, about collecting intent examples for different home automation actions, I think it would be quite easy to integrate them into a sharable Jaco skill, which converts the voice intents to Home Assistant intents that trigger some actions.
(I don’t have HA myself, I’m using Jaco mainly for research purposes, so this would be a task for some motivated dev here)
Most Jaco’s skills are also compatible with Rhasspy so it would be possible to use the skills only and Rhasspy as assistant (for example to improve the language support)

Following the ideas from the posts directly above, it might also possible to use ChatGPT to convert more sophisticated voice requests directly into HA intents. (Jaco recently got a ChatGPT skill, which could be used as base implementation)


What do you think about it?

One thing is clear, that all technologies that work with ML spend a lot of resources. And it is desirable that HA devices have a special processor, like Google Coral. It is also possible for HAs to organize a peer-to-peer network, and it would be possible to share computing resources. Also, huge arrays of hard drives are needed, because now neural networks know all the information on the entire Internet. Therefore, it is still possible to divide, a primitive voice assistant works locally, and a super advanced one already uses the power of Nabu Casa, for example. Of course, I wanted everything to be local, but so far they do not produce such hard drives so that the knowledge of all mankind can fit on them. But the truth is, it may be possible to install a Home assistant server for a residential building by agreeing with the home builder for thousands of Flats.

ML is really the art of defining a limit and stick to it.
Rhasspy can run on a RPi4 with a 8GB drive, so it is possible, but it will not be a VR that can handle everything. The body used to train it is limited to common words for home automation commands and the CPU usage is kept down by having a list of accepted commands, so the search in those is all the is needed.

Because alexa listens to everything (and stores them and uses them against you) even when you have not activated it by it’s ‘Alexa’ name.
One word - privacy.
Law enforcement today, uses people’s private Facebook to indict them.
Alexa unlike facebook does not only capture what you give to it (like facebook), it listens to, and saves everything around you - ALL THE TIME. - IN YOUR OWN HOME.

The guy even tries to usurp the Home Assistant name.

You do know that Google Home’s don’t do that. The Nest Hubs process all the audio locally in a trained model. No audio gets sent to Google. That’s why the voice recognition doesn’t work as well as Alexa’s. :slight_smile:

You don’t need to settle for a limited vocabulary and speech control just because you don’t want your audio being sent out of the home. Local control of course is different, but it would be good if folks didn’t assert the need for a new voice interface for privacy reasons when the Google Nest Hubs already provide that, at least when compared to Alexa.

Google only runs the hotword detection offline.
Once a hotword is detected or think it is detected, then it will transmit audio to the Google cloud service.
Google assistant does as default not save the audio data, but it seems to be saving all other stuff as default, like derived data from the audio, the location and so on.
https://support.google.com/googlenest/answer/7072285

1 Like

I’m not really familiar with how Alexa or Goole Assistant works in background (not interesetd, as it does not support Polish on smart devices), but just one question to knowlegable people; would it be possible to create sort of skill that would listen to what we are saying to Alexa or Google and return back to HA transcript? This way existing HW could be used to listen to voice commands and new HA functionaluty of VR would take from there…Obviously this way we would not disconnect from cloud, but we could use quality microphones, that are readily available.

I do not think Google and Amazon will allow that, unless they completely decide to abandon smart devices, which is unlikely.

I would think it’s possible, but would probably work the way my car control skills work

First, you have to launch the skill by saying “launch my car controls”. Once in the skill, then you can say “start my car” as you are then talking to “the skill” at that point.

The problem is the skill is still cloud based, and you have to now do a two step process (launch skill then ask it to do something). At that point, you might as well just use the existing home assistant Alexa integration options.

I’ve seen a few comments about using ESP’s as satellite microphones, and replies saying they likely don’t have enough power to do word detection on their own. Another comment mentioned the Nicla Voice but at $82 USD for the chip itself its hard to justify. Has anyone looked into the ESP32-S3 chip for this? It seems perfect for local voice control. It has on-device voice processing and espressif has been training more custom wake words for it. The ESP32-S3-Korvo board looks like it could fit inside a Google Home/Nest Mini. Theres also the cheaper ESP32-Korvo which uses an ESP32-WROVER-E chip.

The newer ESP chips looks promising and on the Rhasspy forum there are discussions on that topic, like this one:

2 Likes