Speech-to-Phrase brings voice home - Voice chapter 9

Welcome to Voice chapter 9 🎉 part of our long-running series following the development of open voice.

We’re still pumped from the launch of the Home Assistant Voice Preview Edition at the end of December. It sold out 23 minutes into our announcement - wow! We’ve been working hard to keep it in stock at all our distributors.

Today, we have a lot of cool stuff to improve your experience with Voice PE or any other Assist satellite you’re using. This includes fully local and offline voice control that can be powered by nearly any Home Assistant system.

Dragon NaturallySpeaking was a popular speech recognition program introduced in 1997. To run this software you needed at least a 133 MHz Pentium processor, 32 MB of RAM, and Windows 95 or later. Nearly thirty years later, Speech-to-Text is much better, but needs orders of magnitude more resources.

Incredible technologies are being developed in speech processing, but it’s currently unrealistic for a device that costs less than $100 to take real advantage of them. It’s possible, of course, but running the previously recommended Speech-to-Text tool, Whisper, on a Raspberry Pi 4 takes at least 5 seconds to turn your speech into text, with varying levels of success. This is why we ended up recommending at least an Intel N100 to run your voice assistant fully locally. That stung. Our opt-in analytics shows over 50% of the Home Assistant OS users are running their homes on affordable, low-powered machines like the Home Assistant Green or a Raspberry Pi.

What’s more, advancing the development of Whisper is largely in the hands of OpenAI, as we don’t have the resources required to add languages to that tool. We could add every possible language to Home Assistant, but if any single part of our voice pipeline lacks language support, it renders voice unusable for that language. As a result, many widely spoken languages were unsupported for local voice control.

This left many users unable to use voice to control their smart home without purchasing extra hardware or services. We’re changing this today with the launch of a key new piece of our voice pipeline.

Voice for the masses

Speech-to-Phrase is based on old, almost ancient, voice technology by today’s standards. Instead of the ability to transcribe virtually any speech into text, it is limited to a set of pre-trained phrases. Speech-to-Phrase will automatically generate the phrases and fine-tune a model based on the devices, areas, and sentence triggers in your Home Assistant server - 100% locally and offline.

The result: speech transcribed in under a second on a Home Assistant Green or Raspberry Pi 4. The Raspberry Pi 5 processes commands seven times faster, clocking in at 150 milliseconds per command!

With great speed comes some limitations. Speech-to-Phrase only supports a subset of Assist’s voice commands, and more open-ended things like shopping lists, naming a timer, and broadcasts are not usable out of the box. Really any commands that can accept random words (wildcards) will not work. For the same reasons, Speech-to-Phrase is intended for home control only and not LLMs.

The most important home control commands are supported, including turning lights on and off, changing brightness and color, getting the weather, setting timers, and controlling media players. Custom sentences can also be added to trigger things not covered by the current commands, and we expect the community will come up with some clever new ways to use this tech.

All you need to get started with voice

Speech-to-Phrase is launching with support for English, French, German, Dutch, Spanish, and Italian - covering nearly 70% of Home Assistant users. Nice. Unlike the local Speech-to-Text tools currently available, adding languages to Speech-to-Phrase is much easier. This means many more languages will be available in future releases, and we would love your help adding them!

We’re working on updating the Voice wizard to include Speech-to-Phrase. Until then, you need to install the add-on manually:

Building an Open Voice Ecosystem

When we launched Home Assistant Voice Preview Edition, we didn’t just launch a product; we kickstarted an ecosystem. We did this by open-sourcing all parts and ensuring that the voice experience built into Home Assistant is not tied to a single product. Any voice assistant built for the Open Home ecosystem can take advantage of all this work. Even your DIY ones!

With ESPHome 2025.2, which we’re releasing next week, any ESPHome-based voice assistant will support making broadcasts (more on that below), and they will also be able to use our new voice wizard to ensure new users have everything they need to get started.

This will include updates for the $13 Atom Echo and ESP32-S3-Box-3 devices that we used for development during the Year of the Voice!

New broadcast feature in action with Atom and Box 3

Large language model improvements

We aim for Home Assistant to be the place for experimentation with AI in the smart home. We support a wide range of models, both local and cloud-based, and are constantly improving the different ways people can interact with them. We’re always running benchmarks to track the best models, and make sure our changes lead to an improved experience.

If you set up Assist, Home Assistant’s built-in voice assistant, and configure it to use an LLM, you might have noticed some new features landing recently. One major change was the new “prefer handling commands locally” setting, which always attempts to run commands with the built-in conversation agent before it sends it off to an LLM. We noticed many easy-to-run commands were being sent to an LLM, which can slow down things and waste tokens. If Home Assistant understands the command (e.g., turn on the lights), it will perform the necessary action, and only passes it on to your chosen LLM if it doesn’t understand the command (e.g., what’s the air quality like now).

Adding the above features made us realize that LLMs need to understand the commands handled locally. Now, the conversation history is shared with the LLM. The context allows you to ask the LLM for follow-up questions that refer to recent commands, regardless of whether they helped process the request.

Left: without shared conversations. Right: Shared conversations enable GPT to understand context.

Reducing the time to first word with streaming

When experimenting with larger models, or on slower hardware, LLM’s can feel sluggish. They only respond once the entire reply is generated, which can take frustratingly long for lengthy responses (you’ll be waiting a while if you ask it to tell you an epic fairy tale).

In Home Assistant 2025.3 we’re introducing support for LLMs to stream their response to the chat, allowing users to start reading while the response is being generated. A bonus side effect is that commands are now also faster: they will be executed as soon as they come in, without waiting for the rest of the message to be complete.

Streaming is coming initially for Ollama and OpenAI.

Model Context Protocol brings Home Assistant to every AI

In November 2024, Anthropic announced the Model Context Protocol (MCP). It is a new protocol to allow LLMs to control external services. In this release, contributed by Allen Porter, Home Assistant can speak MCP.

Using the new Model Context Protocol integration, Home Assistant can integrate external MCP servers and make their tools available to LLMs that Home Assistant talks to (for your voice assistant or in automations). There is quite a collection of MCP servers, including wild ones like scraping websites (tutorial), file server access, or even BlueSky.

With the new Model Context Protocol server integration, Home Assistant’s LLM tools can be included in other AI apps, like the Claude desktop app (tutorial). If agentic AI takes off, your smart home will be ready to be integrated.

Thanks Allen!

Expanding Voice Capabilities

We keep enhancing the capabilities of the built-in conversation agent of Home Assistant. With the latest release, we’re unlocking two new features:

“Broadcast that it’s time for dinner”

The new broadcast feature lets you quickly send messages to the other Assist satellites in your home. This makes it possible to announce it’s time for dinner, or announce battles between your children 😅.

“Set the temperature to 19 degrees”

Previously Assist could only tell you the temperature, but now it can help you change the temperature of your HVAC system. Perfect for changing the temperature while staying cozy under a warm blanket.

Home Assistant phones home: analog phones are back!

Two years ago, we introduced the world’s most private voice assistant: an analog phone! Users can pick it up to talk to their smart home, and only the user can hear the response. A fun feature we’re adding today is that Home Assistant can now call your analog phone!

Analog phones are great when you want to notify a room, instead of an entire home. For instance, when the laundry is done, you can notify someone in the living room, but not the office. Also since the user needs to pick up the horn to receive the call, you will know if your notification was received.

Have your Home Assistant give you a call

If you’re using an LLM as your voice assistant, you can also start a conversation from a phone call. You can provide the opening sentence and via a new “extra system prompt” option, provide extra context to the LLM to interpret the response from the user. For example,

  • Extra system context: garage door cover.garage_door was left open for 30 minutes. We asked the user if it should be closed
  • Assistant: should the garage door be closed?
  • User: sure

Thanks JaminH for the contribution.

Wyoming improvements

Wyoming is our standard for linking together all the different parts needed to build a voice assistant. Home Assistant 2025.3 will add support for announcements to Wyoming satellites, making them eligible for the new broadcast feature too.  

We’re also adding a new microWakeWord add-on (the same wake word engine running on Voice PE!) that can be used as an alternative to openWakeWord. As we collect more real-world samples from our Wake Word Collective, the models included in microWakeWord will be retrained and improved.

đŸ«” Help us bring choice to voice!

We’ve said it before, and we’ll say it again—the era of open voice has begun, and the more people who join us, the better it gets. Home Assistant offers many ways to start with voice control, whether by building your own Assist hardware or getting a Home Assistant Voice Preview Edition. With every update, you’ll see new features, and you’ll get to preview the future of voice today.

A huge thanks to all the language leaders and contributors helping to shape open voice in the home! There are many ways to get involved, from translating or sharing voice samples to building new features—learn more about how you can contribute here. Another great way to support development is by subscribing to Home Assistant Cloud, which helps fund the Open Home projects that power voice.


This is a companion discussion topic for the original entry at https://www.home-assistant.io/blog/2025/02/13/voice-chapter-9-speech-to-phrase
8 Likes

Oh, I remember this! and if I recall correctly, one had to go through a few training sessions as well. It worked fairly well, but not enough to use it regularly (for me at least); Indeed things have come a long way since then :).

1 Like

Awesome! Voice just keeps getting better and better. Thanks for everyone’s hard work!

Any ETA on when the PCB design files will be released for the voice pe? I’ve been eagerly waiting :smile:

When are we going to be able to redirect voice responses to built-in commands to other speakers?

3 Likes

Could you start hosting videos off of YouTube?
I constantly get the “Sign in to confirm you’re not a bot” message and there is no way to view the video once that happens. For some reason they don’t even link to the video you’re trying to watch - but they do link to related videos ahh.

1 Like

Not logged into YT. Cannot start video. So I am a robot.:robot:

1 Like

Great idea with the voice to phrase.

But after few hours I have removed it again.

It is too limited.

It does not understand “set name to 50 %”. I have to say “set name brightness” to 50%. That is too geeky.

My window venesian blinds. I can “open both blinds” and “close both blinds” (both blinds is a name). But I cannot set a cover to a position.

And that practically means that the two most used voice functions either works geeky and does not work at all. So I am back to the normal mode.

I will be following improvements but for now it is too limited

Speech-to-Phrase is a game changer! It works so much better than faster-whisper in my testing. Previously it took me 10 attempts to turn on the lamp with weird name. Now I can turn it on/off on 1st try (in some cases it can fail too).

Whisper was the weakest part of local voice pipeline, now I think the weakest part is microWakeWord. While it works better than openWakeWord, it’s far behind of ok google. It could be related to my satellite setup though. I am looking for improvements for microWakeWord.


You can’t do fallback to LLM with Speech-to-Phrase. I think a good workaround would be support of 2 voice pipelines at the same time:

  1. Assist with Speech-to-Phrase with ok nabu wakeword
  2. Assist with faster-whisper and LLM with alexa wakeword

:bulb:

And option to set any media player for responses would be nice to have too.

I recently started a thread asking about custom or additional wake words for microWakeWord (specifically for Voice PE, but others as well!), without having to “take control” and mess around with ESPHome config files.

Has there been any progress on this? When can we expect to see additional wake words or custom wake words (even if training is a complex task) implemented into Voice PE?

By the way, this is my only complaint right now - everything about Voice PE is darn near perfect. Keep up the great work!

Is it possible to do the same thing that was implemented for intents?

STP should know “This spoken phrase matches one of the ones I was trained on.” or not
 and if not, then send the audio to the secondary processor (Whisper, or Nabu Casa Cloud, or whatever else).

Surely that’s a better solution than having two wake words?

Personally, I don’t mind using Nabu Casa Cloud for STT since it’s fast, reliable, and (I’m hoping!) at least a little bit secure. :slight_smile: I can’t go back to a predefined phrase vocabulary after using full STT for so long. The only way I would use STP is if there was built-in fallback to standard STT.

Maybe it would be possible in the future :man_shrugging:

Surely that’s a better solution than having two wake words?

I think having two wake words is a good workaround and it would make assist more flexible.

I already had a GPU that I was using for other services (Frigate, etc) so I happened across this issue and managed to get whisper running on my GPU with the distil-medium.en whisper model. I’m getting 1-2s response times, and it feels snappy while being all local.

It would be nice if there was an pre-built container tag for it, but it works!

Yes, videos hosted off YouTube - PLEASE. Personally I’d love to see on Peertube (TILvids or HA instance?),but really, anywhere -OK maybe anywhere other than maybe Twitch - would be better.

Maybe it could be a 3 stage process. Go to next when current stage doesn’t work:

  1. Recognized by speech to phrase
  2. Recognized by Assist
  3. Send to LLM.
2 Likes

Watch the release party video, they explained there why that it is not possible. You can not have something after STP.

2 Likes

A Pi 4 is good for getting your feet wet in local voice assistants, but not good for performance use. I find the low quality Piper voices tolerable, but I want to use medium or high quality ones for a production system, so I’m going to build something else to host Whisper (which is not the OpenAI version, by the way), Piper, and Ollama running a tuned AI model. I mean, Tony Stark certainly wasn’t running JARVIS on cheap ARM hardware


1 Like

If you think you need Tony Stark and Jarvis to turn on a light switch then go for it.
Still doesn’t change the fact that its Whisper that forces the need for at least a N100


PS RTF of pre-trained models — sherpa 1.3 documentation has the Realtime Factor of running various models on a Pi4 vs num-threads.
For $5 more the equiv Pi5 will be x2 faster as the Pi4 for 8% price save loses > 100% performance but as said its not true this can not be done for less than $100

Also just because the OpenAi models have been quantised and running with a 10 second beamsearch and various other hacks doesn’t mean the source is not the original OpenAi models, it just means the WER is likely higher than original.

Has anyone been able to get Conversations to work properly when using ChatGPT with a VOIP phone? It doesn’t work for me, as per this issue: 2024.10 breaks VOIP integration · Issue #128372 · home-assistant/core · GitHub
(the Announce action works fine however)

I really hope the streaming direction is expanded this year.
It would be great if Piper (or other tts systems) could process streaming data for onward transmission to satellite in streaming mode.
It would greatly improve the responsiveness of the system

I made a docker image for wyoming-whisper and wyoming-piper that runs on Nvidia GPUs.

4 Likes