2023: Home Assistant's year of Voice

TL;DR: It is our goal for 2023 to let users control Home Assistant in their own language. Mike Hansen, creator of Rhasspy, has joined Nabu Casa to lead this effort. We’re starting off by building a collection of intent matching sentences in every language.

Usually, the month of December is meant to reflect back. However, we already did that last month when we hosted the State of the Open Home 2022. We didn’t only reflect, we also announced our focus for next year: 2023 is going to be the year of voice.

It is our goal for 2023 to let users control Home Assistant in their own language.

It’s a big and bold goal, but achievable given the right constraints. The amount of work laid out for us can be summarised as follows:

Our #1 priority is supporting different languages. There are enough projects out there trying to create an English voice assistant. But for us, that just doesn’t cut it. People need to be able to speak in their own language, as that is the most accessible and only acceptable language for a voice assistant for the smart home.

To keep the amount of work ahead of us manageable, we’re going to limit the number of possible actions and focus on the basics of interacting with your smart home. No web searches, making calls, or voice games. And definitely no “by the way”s!

We are going to start with a few actions and build up the language models around that. Home Assistant supports 62 different languages in its user interface. And it’s our goal to support all these languages with voice. We think that we can achieve that by leveraging Home Assistant’s strongest asset: our community.

Our history with voice assistants

If you follow the news, it might sound like voice assistants have failed. Amazon is set to lose $10 billion on Alexa this year and is planning layoffs. Google too, is reducing its support for Google Assistant as it’s trying to cut costs. The truth is that voice, as the next computing platform that drives billions of dollars of extra revenue, has failed. Instead, users mainly use their voice assistants to manage shopping lists, set timers, play music, and control their homes. Voice has failed being a source of revenue, it has not failed its users.

With Home Assistant we’ve always been interested in voice. We used to work with Snips back in the day, but they got acquired and shut down. We worked with Stanford on their Almond/Genie platform, but it is a research driven project that never got production ready. And yes, you can use Home Assistant to send all your data to the clouds of Google and Amazon to leverage their voice assistants, but you shouldn’t have to give up your privacy to turn on the lights by voice.

The most promising project out there is Rhasspy, created by Mike Hansen. A project that allows people to build their own local voice assistant, which can also tie into Home Assistant. Rhasspy stands out from other open source voice projects because Mike doesn’t focus on just English. Instead, his goal is to make it work for everyone. This is going great as Rhasspy supports already 16 different languages today.

With Home Assistant we want to make a privacy and locally focused smart home available to everyone. Mike’s approach with Rhasspy aligns with Home Assistant, and so we’re happy to announce that Mike has joined Nabu Casa to work full-time on voice in Home Assistant.

Iterating in the open

With Home Assistant we prefer to get the things we’re building in the user’s hands as early as possible. Even basic functionality allows users to find things that work and don’t work, allowing us to address the direction if needed.

A voice assistant has a lot of different parts: hot word detection, speech to text, intent recognition, intent execution, text to speech. Making each work in every language is a lot of work. The most important part is the intent recognition and intent execution. We need to be able to understand your commands and execute them.

We started gathering these command sentences in our new intents repository. It will soon power the existing conversation integration in Home Assistant, allowing you to use our app to write and say commands.

The conversation integration is exposed in Home Assistant via a service call and is also available via an API to external applications or scripts. This allows developers to experiment with sending commands from various sources, like a telegram chatbot.

How you can help

For each language we’re collecting sentences of commands that control your smart home in our intents repository. Each sentence will need to be annotated with its intention.

Take for example the sentence: Turn on the bedroom lights. Write it up like Turn on the {area} lights and it becomes a generic command to turn on all the lights in a specific area. Now we need to collect all the other variations too.

We’ve created a YAML-based format to declare and test these sentences. The next step is that we need you 🫵

For each language we’re going to need one or more language leaders. Language leaders are responsible for reviewing the contributions in their language and making sure that they are grammatically correct. If you want to apply to be a language leader, join us in #devs_voice on Discord or open an issue in our intents repository.

We also need people that want to contribute sentences to their language to help build out our collection. See our intents repository on how to get started.


This is a companion discussion topic for the original entry at https://www.home-assistant.io/blog/2022/12/20/year-of-voice/
11 Likes

Yippee! Local voice assistant! I do not use them now because they are not local. I like the approach to cover languages over function. Looking forward to seeing how HA Voice evolves in 2023! I need to start planning for HW. What is HW going to look like for local voice assistants throughout a home?

1 Like

Interesting. Will keep my eye on it what with not wsnting to give the kids smartphones.

We’re a bilingual household. Any thoughts on that situation? Different hot words picking a particular language?

4 Likes

Is it going to be possible to incorporate ownership into the intents? I.e., “Turn on {person’s} {device}”? My wife and I both have switches on our nightstand, for example, so it would be great to be able to say “Turn on my nightstand” and have my light turn on.

6 Likes

Hi!

That’s a great news! Unfortunately some languages will be more difficult to deal with than English.

Take a look at sample sentence from this article:

Turn on the bedroom lights

“Bedroom” is an area name and it is “sypialnia” in Polish. Polish sentence in this case would be:

Włącz światło w <area> which translates 1:1 as Turn on the lights in <area> .

Unfortunately you can’t use base form “sypialnia” here. In Polish we have declension and you have to use locative case here. So the form used here would be “sypialni” („Włącz światło w sypialni” not „Włącz światło w sypialnia”). You can see all forms of “sypialnia” in Wiktionary (I’ve reached the limit of 2 links per comment for new users). If area would be the living room then the base form in Polish would be “salon” and locative case would be “salonie”.

Do you have any plans to support more complex grammars like the Polish one?

5 Likes

An ESP with a microphone sitting around waiting to accept voice commands would be amazing… Do you plan to support voice authentication too, or just work on accepting the words of anyone for a start?

9 Likes

The hardest part of a voice assistant isn’t the software, but the hardware. Google and Amazon have had some level of success because they have attractive speakers that people actually want to place around their home. They also have integrations with third parties, so e.g. my Sonos speakers can accept voice commands.

So in my case, it doesn’t matter how capable the assistant is: I won’t use it unless it can be triggered from my Sonos devices.

Don’t get me wrong; I like the direction, and am very interested in the service calls to interact with the assistant, which has potential for automation. But I can’t see myself using the voice component if it requires me to place esp32s in exposed locations around my house.

30 Likes

Rhasspy, even if it works perfectly well in all languages, is only half the solution.

The other half is the hardware. To replace Alexa in my setup for home management commands i’d need decent far field microphones and a speaker capable of tts in an attractive case that could be displayed in each room of the house without looking ugly. (A bare esp in an oblong box is not a device that will score many WAF points)

To be and even better replacement, it would have to be able to plug into HA media to stream music.

And I’d need one in every room as I now do with Alexa, so price point would also be an important factor.

21 Likes

Must be time to have another look at ripping apart those Google home minis and replacing the guts with an ESP32 again.

8 Likes

Looking forward to this. Someday I hope to be able to ask HA to turn on device for X minutes… device turns on and X minutes later, it turns off.

9 Likes

Unfortunately, kinda agree here. I’m all for private, local voice assistants. But Alexa is pretty ingrained in our house and supports so much, that I don’t think I could swap out Alexa for Rhasspy without a mini revolt. I assume some/most of this voice goodness can make it back to Alexa/Google Home in addition to Rhasspy?

3 Likes

Why trade Alexa or anything else for “HA Assistant”? Use both. They will have different wake words so it’s like having 2 servants instead of 1.

3 Likes

Very excited about this being given some priority. I live in a bilingual family (English / Japanese) and we are constantly frustrated by the fact we can’t get Google Assistant to recognize references to entities in two different languages. Having a more locally definable and controllable may be a way to work around this.

4 Likes

I’m sure that this has already been brought up as an option, but having phrases as aliases would be ideal. In my view it would be infinitely customizable, and much simpler to implement than built in actions.

For example we could have service.voice “turn off lights” be the trigger for an automation, it it’s up to the user to complete the actions of the automation.

This would is similar to tags and how they are used.

2 Likes

This is what I’ve done in the past. There are some open source models for detecting a language from audio, but I’m not sure how good they are.

For matching text to intents, we can usually trade flexibility for accuracy by accepting more sentences than we need. In English, this would roughly be something like turn on (a | an | the) <name> which will accept sentences like “turn on an light”. Not a problem, since the speech to text system producing the text should have a problem language model.

This will become a problem later when we want to generate sentences from the templates and train offline speech to text systems, so we will need more complex ways of filtering out ungrammatical sentences.

Also, we have some discussion around Polish gender in responses going on now :slight_smile:

1 Like

I think this is a worthy challenge, and the geek in me loves the idea. Especially the part about local control, as an alternative to big corporations trying to monetize our every action.

The practical side of me isn’t so sure. Others have already mentioned the hardware hurtle. I just don’t see myself investing in microphones for every room. Another concern is taking away developer resources from all the existing feature requests and WTH’s out there. There are so many worthy ideas and fixes which would improve the existing HA environment. Finally, and this is just me, voice control isn’t my thing. I’ve never really wanted to hear what any computer has to say. And I’m quite certain they don’t want to hear what I have to say to them!

I wish the dev team well with this project, even if it doesn’t really help me much. If nothing else it would be a win for local control over corporate greed.

4 Likes

A lot of languages have word forms. This isn’t really a problem in Rhasspy. You simply wouldn’t put “sypialnia” in your sentences, you would put “Włącz światło w sypialni”. And if you want to allow multiple forms, that’s not a problem either - it would recognize a wrong sentence if you’d utter one but why would you.

2 Likes

I can’t help but think you’re biting off way more than you can chew here and that you should spend your resources elsewhere.

How about a dashboard that’s easy to rearrange and resize?

23 Likes

I like the idea of having a local voice assistant. Howerver, I’d say there are other things that should be priority for home assistant, like revamping the entire auto renerated dashboard. The area card is so limited right now that I don’t even bother linking my entities to them. Multiroom audio control is still very basic although it’s getting better with music assistant…

13 Likes

Another multilingual household here (German and Russian, living in an English speaking country).
I had great hopes in Amazon Echo but they’re primarily used to play music for us, and fail almost every time I am trying to control something. And often this is because of how I named a device or entity, and couldn’t remember the exact name.
So, naming things in a useful way and then also naming them in multiple languages is probably what is going to keep me up at night in this year of voice.