On device wake word on ESP32-S3 is here - Voice: Chapter 6

2023’s Year of the Voice built a solid foundation for letting users control Home Assistant by speaking in their own language.

We continue with improvements to Assist, including:

Oh, and “one more thing”: on-device, open source wake word detection in ESPHome! 🥳🥳🥳

Check out this video of the new microWakeWord system running on an ESP32-S3-BOX-3 alongside one doing wake word detection inside Home Assistant:

On-device vs. streaming wake word

microWakeWord

Thanks to the incredible microWakeWord created by Kevin Ahrendt, ESPHome can now perform wake word detection on devices like the ESP32-S3-BOX-3. You can install it on your S3-BOX-3 today to try it out.

Back in Chapter 4, we added wake word detection using openWakeWord. Unfortunately, openWakeWord was too large to run on low power devices like S3-BOX-3. So we chose to run wake word detection inside Home Assistant instead.

Doing wake word detection in HA allows tiny devices like the M5 ATOM Echo Development Kit to simply stream audio and let all of the processing happen elsewhere. This is great, as it allows low-powered devices using a simple ESP32 chip to be transformed into a voice assistant even if they do not pack the necessary power to detect wake words. The downside is that adding more voice assistants requires more CPU usage in HA as well as more network traffic.

Enter microWakeWord. After listening to an interview with Paulus Schoutsen (founder of Home Assistant) on the Self Hosted podcast, Kevin Ahrendt created a model based on Google’s Inception neural network. As an existing contributor to ESPHome, Kevin was able to get this new model running on the ESP32-S3 chip inside the S3-BOX-3! (It also works on the, now discontinued, S3-BOX and S3-BOX-Lite)

Kevin has trained three models for the launch of microWakeWord:

  • “okay nabu”
  • “hey jarvis”
  • “alexa”

You can try these out yourself now by following the ESP32-S3-BOX tutorial. Changing the default “okay nabu” wake word will require adjusting your ESPHome configuration and recompiling the firmware, which may take a long time and requires a machine with more than 2GB of RAM.

We’re grateful to Kevin for developing microWakeWord, and making it a part of the open home!

Sentence trigger responses

Adding custom sentences to Assist is as easy as adding a sentence trigger to an automation. This allows you to trigger any action in Home Assistant with whatever sentences you want.

Now with the new conversation response action in HA 2024.2, you can also customize the response spoken or printed back to you. Using templating, your response can refer to the current state of your home.

You can also refer to wildcards in your sentence trigger. For example, the sentence trigger:

could have the response:

Playing {{ trigger.slots.album }} by {{ trigger.slots.artist }}

in addition to calling a media service.

You can experiment now with sentence triggers, and custom conversation responses in our automation editor by clicking here:

Improved errors and debugging

Assist users know the phrase “Sorry, I couldn’t understand that” all too well. This generic error message was given for a variety of reasons, such as:

  • The sentence didn’t match any known intent
  • The device/area names didn’t match
  • There weren’t any devices of a specific type in an area (lights, windows, etc.)

Starting in HA 2024.2, Assist provides different error messages for each of these cases.

Now if you encounter errors, you will know where to start looking! The first thing to check is that your device is exposed to Assist. Some types of devices, such as lights, are exposed by default. Other, like locks, are not and must be manually exposed.

Once your devices are exposed, make sure you’ve added an appropriate alias so Assist will know exactly how you’ll be referring to them. Devices and areas can have multiple aliases, even in multiple languages, so everyone’s preference can be accommodated.

If you are still having problems, the Assist debug tool has also been improved. Using the tool, you see how Assist is interpreting a sentence, including any missing pieces.

Our community language leaders are hard at work translating sentences for Assist. If you have suggestions for new sentences to be added, please create an issue on the intents repository or drop us a line at [email protected]

Thank you

Thank you to the Home Assistant community for subscribing to Home Assistant Cloud to support voice and development of Home Assistant, ESPHome and other projects in general.

Thanks to our language leaders for extending the sentence support to all the various languages.


This is a companion discussion topic for the original entry at https://www.home-assistant.io/blog/2024/02/21/voice-chapter-6
9 Likes

Having trouble getting wake-word-voice-assistant for s3box3 to compile, it dies at:

Compiling .pioenvs/esp32-s3-box-3-5ab9e0/components/esp-tflite-micro/tensorflow/lite/micro/micro_interpreter_context.o
xtensa-esp32s3-elf-g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
*** [.pioenvs/esp32-s3-box-3-5ab9e0/components/esp-tflite-micro/tensorflow/lite/micro/flatbuffer_utils.o] Error 1
xtensa-esp32s3-elf-g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
*** [.pioenvs/esp32-s3-box-3-5ab9e0/components/esp-tflite-micro/tensorflow/lite/micro/fake_micro_context.o] Error 1
========================= [FAILED] Took 171.92 seconds =========================

What are you compiling on? I read in another thread that someone was having the same issue due to lack of resources in the RPi.

So I’ve update HA to 2024.2.2 (Core), ESPHome 2024.2.0 and the Firmware on my S3 Box. Unfortunately I don’t have the option to process the wake word on the device (S3 Box3).
image
What am I missing? Any hints?

Did you update the yaml for the Box from the Github repo?

No, just used the UI to update. So I thought I’ll give it a try and start from scratch and reinstall my S3 Box. Using the WebTool on the " ESP32-S3-BOX voice assistant" Page and got this:


Installation got stuck on “Preparing Installation”.

As for the yaml file, do I need to replace it manually on the server?

No you don’t have to. I just did mine manually because I couldn’t find any other way to get the device to show up in the ESPhome add-on dashboard. Doing it from the ESPhome Projects site should use the latest version with the on-board wake word… I would expect…

Ok, I’ve validated the install via ESPHome UI, and it seems to be using the up2date yaml file.
Tried to install it via ESPHome Projects Site and got the same “undefined install” message.
Cleaned build files, reinstalled, still no option to set the wake word on device in the UI…

Perhaps try copying the yaml locally instead of asking it to grab it from Github.

Can the microWakeWord run on the M5Stack Official ATOM Echo Smart Speaker Development Kit? or does it not have enough memory?

It doesn’t appear to have the SPRAM needed.

Actually the original firmware missing the switch for the on device wakeword, and I cannot see any additional part that is connected to microwakeword in the config.

That’s interesting… It’s been updated / reverted since I copied the code this morning (Western Australian time).

Here is my version for the Gen 1 box (ESP32-S3-Box, not Box3). In this you can see the Wake Word stuff has been added. You should be able to transfer those bits to the Box3 version.

---
substitutions:
  name: esp32-s3-box
  friendly_name: ESP32 S3 Box
  loading_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/loading_320_240.png
  idle_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/idle_320_240.png
  listening_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/listening_320_240.png
  thinking_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/thinking_320_240.png
  replying_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/replying_320_240.png
  error_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/error_320_240.png

  loading_illustration_background_color: '000000'
  idle_illustration_background_color: '000000'
  listening_illustration_background_color: 'FFFFFF'
  thinking_illustration_background_color: 'FFFFFF'
  replying_illustration_background_color: 'FFFFFF'
  error_illustration_background_color: '000000'

  voice_assist_idle_phase_id: '1'
  voice_assist_listening_phase_id: '2'
  voice_assist_thinking_phase_id: '3'
  voice_assist_replying_phase_id: '4'
  voice_assist_not_ready_phase_id: '10'
  voice_assist_error_phase_id: '11'  
  voice_assist_muted_phase_id: '12'

  micro_wake_word_model: hey_jarvis

esphome:
  name: ${name}
  friendly_name: ${friendly_name}
  name_add_mac_suffix: true
  platformio_options:
    board_build.flash_mode: dio
  project:
    name: esphome.voice-assistant
    version: "1.0"
  min_version: 2023.11.5
  on_boot:
      priority: 600
      then: 
        - script.execute: draw_display
        - delay: 30s
        - if:
            condition:
              lambda: return id(init_in_progress);
            then:
              - lambda: id(init_in_progress) = false;
              - script.execute: draw_display

esp32:
  board: esp32s3box
  flash_size: 16MB
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"

psram:
  mode: octal
  speed: 80MHz

external_components:
  - source: github://pr#5230
    components: esp_adf
    refresh: 0s

api:
  on_client_connected:
    - script.execute: draw_display
  on_client_disconnected:
    - script.execute: draw_display

ota:
logger:
  hardware_uart: USB_SERIAL_JTAG

dashboard_import:
  package_import_url: github://esphome/firmware/voice-assistant/esp32-s3-box.yaml@main

wifi:
  ap:
  on_connect:
    - script.execute: draw_display
    - delay: 5s # Gives time for improv results to be transmitted 
    - ble.disable:  
  on_disconnect:
    - script.execute: draw_display
    - ble.enable:
  ssid: !secret wifi_ssid_not_1
  password: !secret wifi_password_not_1
  manual_ip:
    static_ip: 192.168.30.250
    gateway: 192.168.30.1
    subnet: 255.255.255.0

improv_serial:

esp32_improv:
  authorizer: none

button:
  - platform: factory_reset
    id: factory_reset_btn
    name: Factory reset

binary_sensor:
  - platform: gpio
    pin:
      number: GPIO1
      inverted: true
    name: "Mute"
    disabled_by_default: true
    entity_category: diagnostic

  - platform: gpio
    pin:
      number: GPIO0
      mode: INPUT_PULLUP
      inverted: true
    name: Top Left Button
    disabled_by_default: true
    entity_category: diagnostic
    on_multi_click:
      - timing:
          - ON for at least 10s
        then:
          - button.press: factory_reset_btn
          
output:
  - platform: ledc
    pin: GPIO45
    id: backlight_output

light:
  - platform: monochromatic
    id: led
    name: LCD Backlight
    entity_category: config
    output: backlight_output
    restore_mode: RESTORE_DEFAULT_ON
    default_transition_length: 250ms

esp_adf:

microphone:
  - platform: esp_adf
    id: box_mic

speaker:
  - platform: esp_adf
    id: box_speaker

micro_wake_word:
  model: ${micro_wake_word_model}
  on_wake_word_detected: 
    - voice_assistant.start

voice_assistant:
  id: va
  microphone: box_mic
  speaker: box_speaker
  use_wake_word: true
  noise_suppression_level: 2
  auto_gain: 31dBFS
  volume_multiplier: 2.0
  vad_threshold: 3
  on_listening:
    - lambda: id(voice_assistant_phase) = ${voice_assist_listening_phase_id};
    - script.execute: draw_display
  on_stt_vad_end:
    - lambda: id(voice_assistant_phase) = ${voice_assist_thinking_phase_id};
    - script.execute: draw_display
  on_tts_stream_start:
    - lambda: id(voice_assistant_phase) = ${voice_assist_replying_phase_id};
    - script.execute: draw_display
  on_tts_stream_end:
    - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
    - script.execute: draw_display
  on_end:
    - if:
        condition:
          and:
            - switch.is_off: mute
            - lambda: return id(wake_word_engine_location).state == "On device";
        then:
          - wait_until:
              not:
                voice_assistant.is_running:
          - micro_wake_word.start:
  on_error:
    - if:
        condition:
          lambda: return !id(init_in_progress);
        then:
          - lambda: id(voice_assistant_phase) = ${voice_assist_error_phase_id};  
          - script.execute: draw_display
          - delay: 1s
          - if:
              condition:
                switch.is_off: mute
              then:
                - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
              else:
                - lambda: id(voice_assistant_phase) = ${voice_assist_muted_phase_id};
          - script.execute: draw_display
  on_client_connected: 
    - if:
        condition:
          switch.is_off: mute
        then:
          - wait_until:
              not: ble.enabled
          - if:
              condition: 
                lambda: return id(wake_word_engine_location).state == "In Home Assistant";
              then:
                - lambda: id(va).set_use_wake_word(true);
                - voice_assistant.start_continuous:
          - if:
              condition: 
                lambda: return id(wake_word_engine_location).state == "On device";
              then:
                - micro_wake_word.start
          - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
        else:
          - lambda: id(voice_assistant_phase) = ${voice_assist_muted_phase_id};
    - lambda: id(init_in_progress) = false;
    - script.execute: draw_display
  on_client_disconnected:
    - if:
        condition: 
          lambda: return id(wake_word_engine_location).state == "In Home Assistant";
        then:
          - lambda: id(va).set_use_wake_word(false);
          - voice_assistant.stop:
    - if:
        condition: 
          lambda: return id(wake_word_engine_location).state == "On device";
        then:
          - micro_wake_word.stop
    - lambda: id(voice_assistant_phase) = ${voice_assist_not_ready_phase_id};  
    - script.execute: draw_display

script:
  - id: draw_display
    then:
      - if:
          condition:
            lambda: return !id(init_in_progress);
          then:
            - if:
                condition:
                  wifi.connected:
                then:
                  - if:
                      condition:
                        api.connected:
                      then:
                        - lambda: |
                            switch(id(voice_assistant_phase)) {
                              case ${voice_assist_listening_phase_id}:
                                id(s3_box_lcd).show_page(listening_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_thinking_phase_id}:
                                id(s3_box_lcd).show_page(thinking_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_replying_phase_id}:
                                id(s3_box_lcd).show_page(replying_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_error_phase_id}:
                                id(s3_box_lcd).show_page(error_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_muted_phase_id}:
                                id(s3_box_lcd).show_page(muted_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_not_ready_phase_id}:
                                id(s3_box_lcd).show_page(no_ha_page);
                                id(s3_box_lcd).update();
                                break;
                              default:
                                id(s3_box_lcd).show_page(idle_page);
                                id(s3_box_lcd).update();
                            }
                      else:
                        - display.page.show: no_ha_page
                        - component.update: s3_box_lcd
                else:
                  - display.page.show: no_wifi_page
                  - component.update: s3_box_lcd
          else:
            - display.page.show: initializing_page
            - component.update: s3_box_lcd

switch:
  - platform: template
    name: Mute
    id: mute
    optimistic: true
    restore_mode: RESTORE_DEFAULT_OFF
    entity_category: config
    on_turn_off:
      - if:
          condition:
            lambda: return !id(init_in_progress);
          then:      
            - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
            - if:
                condition:
                  not:
                    - voice_assistant.is_running
                then:
                  - if:
                      condition: 
                        lambda: return id(wake_word_engine_location).state == "In Home Assistant";
                      then:
                        - lambda: id(va).set_use_wake_word(true);
                        - voice_assistant.start_continuous
                  - if:
                      condition: 
                        lambda: return id(wake_word_engine_location).state == "On device";
                      then:
                        - micro_wake_word.start
            - script.execute: draw_display
    on_turn_on:
      - if:
          condition:
            lambda: return !id(init_in_progress);
          then:      
            - lambda: id(va).set_use_wake_word(false);
            - voice_assistant.stop
            - micro_wake_word.stop
            - lambda: id(voice_assistant_phase) = ${voice_assist_muted_phase_id};
            - script.execute: draw_display

select:
  - platform: template
    entity_category: config
    name: Wake word engine location
    id: wake_word_engine_location
    optimistic: True
    restore_value: True
    options:
      - In Home Assistant
      - On device
    initial_option: On device
    on_value:
      - wait_until:
          lambda: return id(voice_assistant_phase) == ${voice_assist_muted_phase_id} || id(voice_assistant_phase) == ${voice_assist_idle_phase_id};
      - if:
          condition:
            lambda: return x == "In Home Assistant";
          then:
            - micro_wake_word.stop
            - delay: 500ms
            - if:
                condition:
                  switch.is_off: mute
                then:
                  - lambda: id(va).set_use_wake_word(true);
                  - voice_assistant.start_continuous:
      - if:
          condition:
            lambda: return x == "On device";
          then:
            - lambda: id(va).set_use_wake_word(false);
            - voice_assistant.stop
            - delay: 500ms
            - micro_wake_word.start

globals:
  - id: init_in_progress
    type: bool
    restore_value: no
    initial_value: 'true'
  - id: voice_assistant_phase
    type: int
    restore_value: no
    initial_value: ${voice_assist_not_ready_phase_id}

image:
  - file: ${error_illustration_file}
    id: casita_error
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${idle_illustration_file}
    id: casita_idle
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${listening_illustration_file}
    id: casita_listening
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${thinking_illustration_file}
    id: casita_thinking
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${replying_illustration_file}
    id: casita_replying
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${loading_illustration_file}
    id: casita_initializing
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: https://github.com/esphome/firmware/raw/main/voice-assistant/error_box_illustrations/error-no-wifi.png
    id: error_no_wifi
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: https://github.com/esphome/firmware/raw/main/voice-assistant/error_box_illustrations/error-no-ha.png
    id: error_no_ha
    resize: 320x240
    type: RGB24
    use_transparency: true

color:
  - id: idle_color
    hex: ${idle_illustration_background_color}
  - id: listening_color
    hex: ${listening_illustration_background_color}
  - id: thinking_color
    hex: ${thinking_illustration_background_color}
  - id: replying_color
    hex: ${replying_illustration_background_color}
  - id: loading_color
    hex: ${loading_illustration_background_color}
  - id: error_color
    hex: ${error_illustration_background_color}

spi:
  clk_pin: 7
  mosi_pin: 6

display:
  - platform: ili9xxx
    id: s3_box_lcd
    model: S3BOX
    data_rate: 40MHz
    cs_pin: 5
    dc_pin: 4
    reset_pin: 48
    update_interval: never
    pages:
      - id: idle_page
        lambda: |-
          it.fill(id(idle_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_idle), ImageAlign::CENTER);
      - id: listening_page
        lambda: |-
          it.fill(id(listening_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_listening), ImageAlign::CENTER);
      - id: thinking_page
        lambda: |-
          it.fill(id(thinking_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_thinking), ImageAlign::CENTER);
      - id: replying_page
        lambda: |-
          it.fill(id(replying_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_replying), ImageAlign::CENTER);
      - id: error_page
        lambda: |-
          it.fill(id(error_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_error), ImageAlign::CENTER);
      - id: no_ha_page
        lambda: |-
          it.image((it.get_width() / 2), (it.get_height() / 2), id(error_no_ha), ImageAlign::CENTER);
      - id: no_wifi_page
        lambda: |-
          it.image((it.get_width() / 2), (it.get_height() / 2), id(error_no_wifi), ImageAlign::CENTER);
      - id: initializing_page
        lambda: |-
          it.fill(id(loading_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_initializing), ImageAlign::CENTER);
      - id: muted_page
        lambda: |-
          it.fill(Color::BLACK);

the on device wake word seems to be included only in the “firmware/wake-word-voice-assistant/esp32-s3-box-3.yaml”. You need to change the path to the correct firmware path (this is for the S3.BOX3):

esphome.voice-assistant: github://esphome/firmware/wake-word-voice-assistant/esp32-s3-box-3.yaml@main

instead of
esphome.voice-assistant: github://esphome/firmware/voice-assistant/esp32-s3-box-3.yaml@main

and voilá the setting appears in the UI and you have on device wake word detection

2 Likes

Does the trick

In the video covering voice chapter 6 there was a idea presented to use the in device wake word for some on device logic.

I really liked the idea, so the question ist:
Can you detect multiple (custom) wake words on one device to toggle different peripherals connected the same esphome node?

For simple logic this could safe from the need to have the commands interpreted via whisper.

What would I need to do to train for my own personal wakeword. Is that possible at all in a reasonable home environment?

Ah I see, I was looking at the wrong code vs what I had downloaded earlier… :man_facepalming:

Glad you got it sorted.

If you watch the blog video you will see / hear what is required. Basically you need a powerful PC / GPU and a few days up your sleeve.

Really great to see this added, all we need now is some wake word stuff built in to the HA companion app for android and I could viably ditch my Alexa devices.

It will also be interesting to see how these perform with multiple devices in the vicinity when speaking - they mentioned on the stream that they were looking at adding a “fastest wins” kind of logic, but not sure how useful that would be if you say something like “turn on the light” (using room awareness) and a device in another room hears and responds a fraction faster…

1 Like