Year of the Voice - Chapter 2: Let's talk

This year is Home Assistant’s Year of the Voice. It is our goal for 2023 to let users control Home Assistant in their own language. Today we’re presenting Chapter 2, our second milestone in building towards this goal.

In Chapter 1, we focused on intents – what the user wants to do. Today, the Home Assistant community has translated common smart home commands and responses into 45 languages, closing in on the 62 languages that Home Assistant supports.

For Chapter 2, we’ve expanded beyond text to now include audio; specifically, turning audio (speech) into text, and text back into speech. With this functionality, Home Assistant’s Assist feature is now able to provide a full voice interface for users to interact with.

A voice assistant also needs hardware, so today we’re launching ESPHome support for Assist and; to top it off: we’re launching the World’s Most Private Voice Assistant. Keep reading to see what that entails.

To watch the video presentation of this blog post, including live demos, check the recording of our live stream.

Composing Voice Assistants

The new Assist Pipeline integration allows you to configure all components that make up a voice assistant in a single place.

For voice commands, pipelines start with audio. A speech-to-text system determines the words the user speaks, which are then forwarded to a conversation agent. The intent is extracted from the text by the agent and executed by Home Assistant. At this point, “turn on the light” would cause your light to turn on 💡. The last part of the pipeline is text-to-speech, where the agent’s response is spoken back to you. This may be a simple confirmation (“Turned on light”) or the answer to a question, such as “Which lights are on?”

Screenshot of the new Assist configuration in Home Assistant.

With the new Voice Assistant settings page users can create multiple assistants, mixing and matching voice services. Want a U.S. English assistant that responds with a British accent? No problem. What about a second assistant that listens for Dutch, German, or French voice commands? Or maybe you want to throw ChatGPT in the mix. Create as many assistants as you want, and use them from the Assist dialog as well as voice assistant hardware for Home Assistant.

Interacting with many different services means that many different things can go wrong. To help users figure out what went wrong, we’ve built extensive debug tooling for voice assistants into Home Assistant. You can always inspect the last 10 interactions per voice assistant.

Screenshot of the new Assist debug tool.

Voice Assistant powered by Home Assistant Cloud

The Home Assistant Cloud subscription, besides end-to-end encrypted remote connection, includes state of the art speech-to-text and text-to-speech services. This allows your voice assistant to speak 130+ languages (including dialects like Peruvian Spanish) and is extremely fast to respond. Sample:

As a subscriber, you can directly start using voice in Home Assistant. You will not need any extra hardware or software to get started.

In addition to high quality speech-to-text and text-to-speech for your voice assistants, you will also be supporting the development of Home Assistant itself.

Join Home Assistant Cloud today

The fully local voice assistant

With Home Assistant you can be guaranteed two things: there will be options and one of those options will be local. With our voice assistant that’s no different.

Piper: our new model for high quality local text-to-speech

To make quality text-to-speech running locally possible, we’ve had to create our own text-to-speech system that is optimized for running on a Raspberry Pi 4. It’s called Piper.

Piper uses modern machine learning algorithms for realistic-sounding speech but can still generate audio quickly. On a Raspberry Pi 4, Piper can generate 2 seconds of audio with only 1 second of processing time. More powerful CPUs, such as the Intel Core i5, can generate 17 seconds of audio in the same amount of time. Sample:

For more samples, see the Piper website

An add-on with Piper is available now for Home Assistant with over 40 voices across 18 languages, including: Catalan, Danish, German, English, Spanish, Finnish, French, Greek, Italian, Kazakh, Nepali, Dutch, Norwegian, Polish, Brazilian Portuguese, Ukrainian, Vietnamese, and Chinese. Voices for Piper are trained from open audio datasets, many of which come from free audiobooks read by volunteers. If you’re interested in contributing your voice, let us know!

Local speech-to-text with OpenAI Whisper

Whisper is an open source speech-to-text model created by OpenAI that runs locally. Since its release in 2022, Whisper has been improved by the open source community to run on less powerful hardware by projects such as whisper.cpp and faster-whisper. In less than a year of progress, Whisper is now capable of providing speech-to-text for dozens of languages on small servers and single-board computers!

An add-on using faster-whisper is available now for Home Assistant. On a Raspberry Pi 4, voice commands can take around 7 seconds to process with about 200 MB of RAM used. An Intel Core i5 CPU or better is capable of sub-second response times and can run larger (and more accurate) versions of Whisper.

Wyoming: the voice assistant glue

Voice assistants share many common functions, such as speech-to-text, intent-recognition, and text-to-speech. We created the Wyoming protocol to provide a small set of standard messages for talking to voice assistant services, including the ability to stream audio.

Wyoming allows developers to focus on the core of a voice service without having to commit to a specific networking stack like HTTP or MQTT. This protocol is compatible with the upcoming version 3.0 of Rhasspy, so both projects can share voice services.

With Wyoming, we’re trying to kickstart a more interoperable open voice ecosystem that makes sharing components across projects and platforms easy. Developers and scientists wishing to experiment with new voice technologies need only implement a small set of messages to integrate with other voice assistant projects.

The Whisper and Piper add-ons mentioned above are integrated into Home Assistant via the new Wyoming integration. Wyoming services can also be run on other machines and still integrate into Home Assistant.

ESPHome powered voice assistants

ESPHome is our software for microcontrollers. Instead of programming, users define how their sensors are connected in a YAML file. ESPHome will read this file and generate and install software on your microcontroller to make this data accessible in Home Assistant.

Today we’re launching support for building voice assistants using ESPHome. Connect a microphone to your ESPHome device, and you can control your smart home with your voice. Include a speaker and the smart home will speak back.

We’ve been focusing on the M5STACK ATOM Echo for testing and development. For $13 it comes with a microphone and a speaker in a nice little box. We’ve created a tutorial to turn this device into a voice remote directly from your browser!

Tutorial: create a $13 voice remote for Home Assistant.

ESPHome Voice Assistant documentation.

World’s Most Private Voice Assistant

If you were designing the world’s most private voice assistant, what features would it have? To start, it should only listen when you’re ready to talk, rather than all the time. And when it responds, you should be the only one to hear it. This sounds strangely familiar…🤔

A phone! No, not the featureless rectangle you have in your pocket; an analog phone. These great creatures once ruled the Earth with twisty cords and unique looks to match your style. Analog phones have a familiar interface that’s hard to beat: pick up the phone to listen/speak and put it down when done.

With Home Assistant’s new Voice-over-IP integration, you can now use an “old school” phone to control your smart home!

By configuring off-hook autodial, your phone will automatically call Home Assistant when you pick it up. Speak your voice command or question, and listen for the response. The conversation will continue as long as you please: speak more commands/questions, or simply hang up. Assign a unique voice assistant/pipeline to each VoIP adapter, enabling dedicated phones for specific languages.

We’ve focused our initial efforts on supporting the Grandstream HT801 Voice-over-IP box. It works with any phone with an RJ11 connector, and connects directly to Home Assistant. There is no need for an extra server.

Tutorial: create your own World’s Most Private Voice Assistant

Give your voice assistant personality using the OpenAI integration.

Some links on this page are affiliate links and purchases using these links support the Home Assistant project.


This is a companion discussion topic for the original entry at https://www.home-assistant.io/blog/2023/04/27/year-of-the-voice-chapter-2/
9 Likes

Nice. How do those of us not running HA OS install Piper? Is there going a docker image for us?

8 Likes

Who’s going to be the first person to hook up a badge worn on your shirt that when you tap it it chirps and listens for your voice … sounds familiar!

12 Likes

Wow, an ESP speaker for $13 seems like such a steal! Out of curiosity, does anyone know if there’s a way to make something like the M5 work using a keyword wake up instead of pressing the button? These are small enough to hide them around the place easily as long as we can wake them using words instead of pressing a button!

5 Likes

https://hub.docker.com/u/rhasspy

4 Likes

Wow so much text and options for “The Voice” (let’s hope it’s more positive than the Dutch variant :-).

It seems the [ATOM Echo Smart Speaker Development Kit | m5stack-store]
Is sold out most places.

Any good alternative?

1 Like

Can the voip integration use the touchto e buttons of a phone to trigger automations?

2 Likes

Will this work for a PCI or USB audio device connected to HAOS? I’ve been frustrated with Genie and trying to get the Google SDK to work (can’t find device registration anymore), so it’d be rad if I could use the USB speakerphone I have connected for that purpose, especially if it can be made to listen for a hot phrase of some kind.

Awesome - thank you.

When you have already started to talk about hardware I have to ask what’s are your plans? Because after 1,5 year of playing with and using Rhasspy I can say that ESP-32 with a simple microphone (like the Atom Echo) works … but you need to be in the environment without much ambient noise, talk in the direction of the microphone and not be too far. Anything outside of this norm degrades output very quickly. So it’s good for development, but IMHO makes it a poor replacement for something like Amazon Alexa. There are reasons why Alexa is using a 4 microphone array (I think) and probably DSP for audio processing, and IMHO ESP-32 isn’t up to task (performance wise). Maybe ESP-32-S3, but last time (almost a year ago) I checked there were no open source libraries available for necessary audio processing (far-field voice recognition, beamforming, noise cancelation). So, back to my original question, what are your plans for audio processing before it is sent to STT?

7 Likes

Yep :slightly_smiling_face:

5 Likes

We’re looking into 2+ mic hardware solutions that can do beamforming, noise suppression, auto gain control, etc. There are plenty of boards that do this, but we also need to run a good open source wake word model on it and have it be trainable with open source, local tooling.

24 Likes

Phenomenal and really exciting work! Do you all see the possibility of supporting RTSP streams as an input method for Assist? A lot of people already have cameras around that could function quite well for this.

3 Likes

What about voice between humans?

We currently lack two-way audio for things like doorbell cameras.

8 Likes

I watched the video then immediately bought 2 m5 atom echoes, a Grandstream and an old-timey phone.

1 Like

Huh, I have ESP32-S3 Box with that proprietary “hi ESP” stuff on it. Wonder if I can use it with ESPHome for control, and how wake word will be assigned (looks like there’s no mechanism for it yet).

any idea on how to pass model/language in docker-compose. the below is not working

## wyoming
  whisper:
    container_name: whisper
    image: rhasspy/wyoming-whisper
    volumes:
      - /root/docker/whisper:/data
    environment:
      - TZ=Australia/Adelaide
      - model=medium-int8
      - language=en
    restart: unless-stopped
    network_mode: cust_bridge
    ports:
      - 10300:10300

  piper:
    container_name: piper
    image: rhasspy/wyoming-piper
    volumes:
      - /root/docker/piper:/data
    environment:
      - TZ=Australia/Adelaide
      - voice=en-us-lessac-medium
    restart: unless-stopped
    network_mode: cust_bridge
    ports:
      - 10200:10200

Awesome development!

Will I be able to connect HA to my telefone system through SIP, so I can call my home assistant from anywhere to have a chat?

2 Likes

This looks awesome! Can’t wait to play around with it! Though I’ll admit my main wish is that someone will make the voice of JARVIS available for it!

2 Likes

Instead of environment variables you have to put in the command line options for voice, model,… under the command: tag

version: "3"
services:
## wyoming
  whisper:
    container_name: whisper
    command: --model medium --language nl
    image: rhasspy/wyoming-whisper
    volumes:
      - ./whisper:/data
    environment:
      - TZ=Europe/Brussels
    restart: unless-stopped
    ports:
      - 10300:10300

  piper:
    container_name: piper
    image: rhasspy/wyoming-piper
    command: --voice nl-nathalie-x-low
    volumes:
      - ./piper:/data
    environment:
      - TZ=Europe/Brussels
    restart: unless-stopped
    ports:
      - 10200:10200
8 Likes