On device wake word on ESP32-S3 is here - Voice: Chapter 6 - Blog

February 2024

brendon1

Having trouble getting wake-word-voice-assistant for s3box3 to compile, it dies at:

Compiling .pioenvs/esp32-s3-box-3-5ab9e0/components/esp-tflite-micro/tensorflow/lite/micro/micro_interpreter_context.o
xtensa-esp32s3-elf-g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
*** [.pioenvs/esp32-s3-box-3-5ab9e0/components/esp-tflite-micro/tensorflow/lite/micro/flatbuffer_utils.o] Error 1
xtensa-esp32s3-elf-g++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
*** [.pioenvs/esp32-s3-box-3-5ab9e0/components/esp-tflite-micro/tensorflow/lite/micro/fake_micro_context.o] Error 1
========================= [FAILED] Took 171.92 seconds =========================

2 replies

February 2024 ▶ brendon1

sparkydave Guidance Counsellor

What are you compiling on? I read in another thread that someone was having the same issue due to lack of resources in the RPi.

1 reply

February 2024

Kraken

So I’ve update HA to 2024.2.2 (Core), ESPHome 2024.2.0 and the Firmware on my S3 Box. Unfortunately I don’t have the option to process the wake word on the device (S3 Box3).

What am I missing? Any hints?

2 replies

February 2024

sparkydave Guidance Counsellor

Did you update the yaml for the Box from the Github repo?

github.com

esphome/firmware/blob/93ffdcca2790f4d5e60386e4b3262f956ef74d8f/voice-assistant/esp32-s3-box-3.yaml

---
substitutions:
  name: esp32-s3-box-3
  friendly_name: ESP32 S3 Box 3
  loading_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/loading_320_240.png
  idle_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/idle_320_240.png
  listening_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/listening_320_240.png
  thinking_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/thinking_320_240.png
  replying_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/replying_320_240.png
  error_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/error_320_240.png

  loading_illustration_background_color: "000000"
  idle_illustration_background_color: "000000"
  listening_illustration_background_color: "FFFFFF"
  thinking_illustration_background_color: "FFFFFF"
  replying_illustration_background_color: "FFFFFF"
  error_illustration_background_color: "000000"

  voice_assist_idle_phase_id: "1"
  voice_assist_listening_phase_id: "2"

This file has been truncated. show original

2 replies

February 2024 ▶ sparkydave

Kraken

No, just used the UI to update. So I thought I’ll give it a try and start from scratch and reinstall my S3 Box. Using the WebTool on the " ESP32-S3-BOX voice assistant" Page and got this:

Installation got stuck on “Preparing Installation”.

As for the yaml file, do I need to replace it manually on the server?

2 replies

February 2024

sparkydave Guidance Counsellor

No you don’t have to. I just did mine manually because I couldn’t find any other way to get the device to show up in the ESPhome add-on dashboard. Doing it from the ESPhome Projects site should use the latest version with the on-board wake word… I would expect…

1 reply

February 2024 ▶ sparkydave

Kraken

Ok, I’ve validated the install via ESPHome UI, and it seems to be using the up2date yaml file.
Tried to install it via ESPHome Projects Site and got the same “undefined install” message.
Cleaned build files, reinstalled, still no option to set the wake word on device in the UI…

1 reply

February 2024 ▶ Kraken

sparkydave Guidance Counsellor

Perhaps try copying the yaml locally instead of asking it to grab it from Github.

February 2024

theone

Can the microWakeWord run on the M5Stack Official ATOM Echo Smart Speaker Development Kit? or does it not have enough memory?

2 replies

February 2024

sparkydave Guidance Counsellor

It doesn’t appear to have the SPRAM needed.

February 2024 ▶ sparkydave

Lajos

Actually the original firmware missing the switch for the on device wakeword, and I cannot see any additional part that is connected to microwakeword in the config.

1 reply

February 2024

sparkydave Guidance Counsellor

That’s interesting… It’s been updated / reverted since I copied the code this morning (Western Australian time).

Here is my version for the Gen 1 box (ESP32-S3-Box, not Box3). In this you can see the Wake Word stuff has been added. You should be able to transfer those bits to the Box3 version.

---
substitutions:
  name: esp32-s3-box
  friendly_name: ESP32 S3 Box
  loading_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/loading_320_240.png
  idle_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/idle_320_240.png
  listening_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/listening_320_240.png
  thinking_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/thinking_320_240.png
  replying_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/replying_320_240.png
  error_illustration_file: https://github.com/esphome/firmware/raw/main/voice-assistant/casita/error_320_240.png

  loading_illustration_background_color: '000000'
  idle_illustration_background_color: '000000'
  listening_illustration_background_color: 'FFFFFF'
  thinking_illustration_background_color: 'FFFFFF'
  replying_illustration_background_color: 'FFFFFF'
  error_illustration_background_color: '000000'

  voice_assist_idle_phase_id: '1'
  voice_assist_listening_phase_id: '2'
  voice_assist_thinking_phase_id: '3'
  voice_assist_replying_phase_id: '4'
  voice_assist_not_ready_phase_id: '10'
  voice_assist_error_phase_id: '11'  
  voice_assist_muted_phase_id: '12'

  micro_wake_word_model: hey_jarvis

esphome:
  name: ${name}
  friendly_name: ${friendly_name}
  name_add_mac_suffix: true
  platformio_options:
    board_build.flash_mode: dio
  project:
    name: esphome.voice-assistant
    version: "1.0"
  min_version: 2023.11.5
  on_boot:
      priority: 600
      then: 
        - script.execute: draw_display
        - delay: 30s
        - if:
            condition:
              lambda: return id(init_in_progress);
            then:
              - lambda: id(init_in_progress) = false;
              - script.execute: draw_display

esp32:
  board: esp32s3box
  flash_size: 16MB
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"

psram:
  mode: octal
  speed: 80MHz

external_components:
  - source: github://pr#5230
    components: esp_adf
    refresh: 0s

api:
  on_client_connected:
    - script.execute: draw_display
  on_client_disconnected:
    - script.execute: draw_display

ota:
logger:
  hardware_uart: USB_SERIAL_JTAG

dashboard_import:
  package_import_url: github://esphome/firmware/voice-assistant/esp32-s3-box.yaml@main

wifi:
  ap:
  on_connect:
    - script.execute: draw_display
    - delay: 5s # Gives time for improv results to be transmitted 
    - ble.disable:  
  on_disconnect:
    - script.execute: draw_display
    - ble.enable:
  ssid: !secret wifi_ssid_not_1
  password: !secret wifi_password_not_1
  manual_ip:
    static_ip: 192.168.30.250
    gateway: 192.168.30.1
    subnet: 255.255.255.0

improv_serial:

esp32_improv:
  authorizer: none

button:
  - platform: factory_reset
    id: factory_reset_btn
    name: Factory reset

binary_sensor:
  - platform: gpio
    pin:
      number: GPIO1
      inverted: true
    name: "Mute"
    disabled_by_default: true
    entity_category: diagnostic

  - platform: gpio
    pin:
      number: GPIO0
      mode: INPUT_PULLUP
      inverted: true
    name: Top Left Button
    disabled_by_default: true
    entity_category: diagnostic
    on_multi_click:
      - timing:
          - ON for at least 10s
        then:
          - button.press: factory_reset_btn
          
output:
  - platform: ledc
    pin: GPIO45
    id: backlight_output

light:
  - platform: monochromatic
    id: led
    name: LCD Backlight
    entity_category: config
    output: backlight_output
    restore_mode: RESTORE_DEFAULT_ON
    default_transition_length: 250ms

esp_adf:

microphone:
  - platform: esp_adf
    id: box_mic

speaker:
  - platform: esp_adf
    id: box_speaker

micro_wake_word:
  model: ${micro_wake_word_model}
  on_wake_word_detected: 
    - voice_assistant.start

voice_assistant:
  id: va
  microphone: box_mic
  speaker: box_speaker
  use_wake_word: true
  noise_suppression_level: 2
  auto_gain: 31dBFS
  volume_multiplier: 2.0
  vad_threshold: 3
  on_listening:
    - lambda: id(voice_assistant_phase) = ${voice_assist_listening_phase_id};
    - script.execute: draw_display
  on_stt_vad_end:
    - lambda: id(voice_assistant_phase) = ${voice_assist_thinking_phase_id};
    - script.execute: draw_display
  on_tts_stream_start:
    - lambda: id(voice_assistant_phase) = ${voice_assist_replying_phase_id};
    - script.execute: draw_display
  on_tts_stream_end:
    - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
    - script.execute: draw_display
  on_end:
    - if:
        condition:
          and:
            - switch.is_off: mute
            - lambda: return id(wake_word_engine_location).state == "On device";
        then:
          - wait_until:
              not:
                voice_assistant.is_running:
          - micro_wake_word.start:
  on_error:
    - if:
        condition:
          lambda: return !id(init_in_progress);
        then:
          - lambda: id(voice_assistant_phase) = ${voice_assist_error_phase_id};  
          - script.execute: draw_display
          - delay: 1s
          - if:
              condition:
                switch.is_off: mute
              then:
                - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
              else:
                - lambda: id(voice_assistant_phase) = ${voice_assist_muted_phase_id};
          - script.execute: draw_display
  on_client_connected: 
    - if:
        condition:
          switch.is_off: mute
        then:
          - wait_until:
              not: ble.enabled
          - if:
              condition: 
                lambda: return id(wake_word_engine_location).state == "In Home Assistant";
              then:
                - lambda: id(va).set_use_wake_word(true);
                - voice_assistant.start_continuous:
          - if:
              condition: 
                lambda: return id(wake_word_engine_location).state == "On device";
              then:
                - micro_wake_word.start
          - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
        else:
          - lambda: id(voice_assistant_phase) = ${voice_assist_muted_phase_id};
    - lambda: id(init_in_progress) = false;
    - script.execute: draw_display
  on_client_disconnected:
    - if:
        condition: 
          lambda: return id(wake_word_engine_location).state == "In Home Assistant";
        then:
          - lambda: id(va).set_use_wake_word(false);
          - voice_assistant.stop:
    - if:
        condition: 
          lambda: return id(wake_word_engine_location).state == "On device";
        then:
          - micro_wake_word.stop
    - lambda: id(voice_assistant_phase) = ${voice_assist_not_ready_phase_id};  
    - script.execute: draw_display

script:
  - id: draw_display
    then:
      - if:
          condition:
            lambda: return !id(init_in_progress);
          then:
            - if:
                condition:
                  wifi.connected:
                then:
                  - if:
                      condition:
                        api.connected:
                      then:
                        - lambda: |
                            switch(id(voice_assistant_phase)) {
                              case ${voice_assist_listening_phase_id}:
                                id(s3_box_lcd).show_page(listening_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_thinking_phase_id}:
                                id(s3_box_lcd).show_page(thinking_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_replying_phase_id}:
                                id(s3_box_lcd).show_page(replying_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_error_phase_id}:
                                id(s3_box_lcd).show_page(error_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_muted_phase_id}:
                                id(s3_box_lcd).show_page(muted_page);
                                id(s3_box_lcd).update();
                                break;
                              case ${voice_assist_not_ready_phase_id}:
                                id(s3_box_lcd).show_page(no_ha_page);
                                id(s3_box_lcd).update();
                                break;
                              default:
                                id(s3_box_lcd).show_page(idle_page);
                                id(s3_box_lcd).update();
                            }
                      else:
                        - display.page.show: no_ha_page
                        - component.update: s3_box_lcd
                else:
                  - display.page.show: no_wifi_page
                  - component.update: s3_box_lcd
          else:
            - display.page.show: initializing_page
            - component.update: s3_box_lcd

switch:
  - platform: template
    name: Mute
    id: mute
    optimistic: true
    restore_mode: RESTORE_DEFAULT_OFF
    entity_category: config
    on_turn_off:
      - if:
          condition:
            lambda: return !id(init_in_progress);
          then:      
            - lambda: id(voice_assistant_phase) = ${voice_assist_idle_phase_id};
            - if:
                condition:
                  not:
                    - voice_assistant.is_running
                then:
                  - if:
                      condition: 
                        lambda: return id(wake_word_engine_location).state == "In Home Assistant";
                      then:
                        - lambda: id(va).set_use_wake_word(true);
                        - voice_assistant.start_continuous
                  - if:
                      condition: 
                        lambda: return id(wake_word_engine_location).state == "On device";
                      then:
                        - micro_wake_word.start
            - script.execute: draw_display
    on_turn_on:
      - if:
          condition:
            lambda: return !id(init_in_progress);
          then:      
            - lambda: id(va).set_use_wake_word(false);
            - voice_assistant.stop
            - micro_wake_word.stop
            - lambda: id(voice_assistant_phase) = ${voice_assist_muted_phase_id};
            - script.execute: draw_display

select:
  - platform: template
    entity_category: config
    name: Wake word engine location
    id: wake_word_engine_location
    optimistic: True
    restore_value: True
    options:
      - In Home Assistant
      - On device
    initial_option: On device
    on_value:
      - wait_until:
          lambda: return id(voice_assistant_phase) == ${voice_assist_muted_phase_id} || id(voice_assistant_phase) == ${voice_assist_idle_phase_id};
      - if:
          condition:
            lambda: return x == "In Home Assistant";
          then:
            - micro_wake_word.stop
            - delay: 500ms
            - if:
                condition:
                  switch.is_off: mute
                then:
                  - lambda: id(va).set_use_wake_word(true);
                  - voice_assistant.start_continuous:
      - if:
          condition:
            lambda: return x == "On device";
          then:
            - lambda: id(va).set_use_wake_word(false);
            - voice_assistant.stop
            - delay: 500ms
            - micro_wake_word.start

globals:
  - id: init_in_progress
    type: bool
    restore_value: no
    initial_value: 'true'
  - id: voice_assistant_phase
    type: int
    restore_value: no
    initial_value: ${voice_assist_not_ready_phase_id}

image:
  - file: ${error_illustration_file}
    id: casita_error
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${idle_illustration_file}
    id: casita_idle
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${listening_illustration_file}
    id: casita_listening
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${thinking_illustration_file}
    id: casita_thinking
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${replying_illustration_file}
    id: casita_replying
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: ${loading_illustration_file}
    id: casita_initializing
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: https://github.com/esphome/firmware/raw/main/voice-assistant/error_box_illustrations/error-no-wifi.png
    id: error_no_wifi
    resize: 320x240
    type: RGB24
    use_transparency: true
  - file: https://github.com/esphome/firmware/raw/main/voice-assistant/error_box_illustrations/error-no-ha.png
    id: error_no_ha
    resize: 320x240
    type: RGB24
    use_transparency: true

color:
  - id: idle_color
    hex: ${idle_illustration_background_color}
  - id: listening_color
    hex: ${listening_illustration_background_color}
  - id: thinking_color
    hex: ${thinking_illustration_background_color}
  - id: replying_color
    hex: ${replying_illustration_background_color}
  - id: loading_color
    hex: ${loading_illustration_background_color}
  - id: error_color
    hex: ${error_illustration_background_color}

spi:
  clk_pin: 7
  mosi_pin: 6

display:
  - platform: ili9xxx
    id: s3_box_lcd
    model: S3BOX
    data_rate: 40MHz
    cs_pin: 5
    dc_pin: 4
    reset_pin: 48
    update_interval: never
    pages:
      - id: idle_page
        lambda: |-
          it.fill(id(idle_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_idle), ImageAlign::CENTER);
      - id: listening_page
        lambda: |-
          it.fill(id(listening_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_listening), ImageAlign::CENTER);
      - id: thinking_page
        lambda: |-
          it.fill(id(thinking_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_thinking), ImageAlign::CENTER);
      - id: replying_page
        lambda: |-
          it.fill(id(replying_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_replying), ImageAlign::CENTER);
      - id: error_page
        lambda: |-
          it.fill(id(error_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_error), ImageAlign::CENTER);
      - id: no_ha_page
        lambda: |-
          it.image((it.get_width() / 2), (it.get_height() / 2), id(error_no_ha), ImageAlign::CENTER);
      - id: no_wifi_page
        lambda: |-
          it.image((it.get_width() / 2), (it.get_height() / 2), id(error_no_wifi), ImageAlign::CENTER);
      - id: initializing_page
        lambda: |-
          it.fill(id(loading_color));
          it.image((it.get_width() / 2), (it.get_height() / 2), id(casita_initializing), ImageAlign::CENTER);
      - id: muted_page
        lambda: |-
          it.fill(Color::BLACK);

1 reply

February 2024 ▶ sparkydave

Kraken

the on device wake word seems to be included only in the “firmware/wake-word-voice-assistant/esp32-s3-box-3.yaml”. You need to change the path to the correct firmware path (this is for the S3.BOX3):

esphome.voice-assistant: github://esphome/firmware/wake-word-voice-assistant/esp32-s3-box-3.yaml@main

instead of
esphome.voice-assistant: github://esphome/firmware/voice-assistant/esp32-s3-box-3.yaml@main

and voilá the setting appears in the UI and you have on device wake word detection

2 replies

February 2024

Lajos

Does the trick

February 2024

imelc

In the video covering voice chapter 6 there was a idea presented to use the in device wake word for some on device logic.

I really liked the idea, so the question ist:
Can you detect multiple (custom) wake words on one device to toggle different peripherals connected the same esphome node?

For simple logic this could safe from the need to have the commands interpreted via whisper.

February 2024

wigster

What would I need to do to train for my own personal wakeword. Is that possible at all in a reasonable home environment?

1 reply

February 2024

sparkydave Guidance Counsellor

Ah I see, I was looking at the wrong code vs what I had downloaded earlier…

Glad you got it sorted.

February 2024 ▶ wigster

sparkydave Guidance Counsellor

If you watch the blog video you will see / hear what is required. Basically you need a powerful PC / GPU and a few days up your sleeve.

February 2024

swifty

Really great to see this added, all we need now is some wake word stuff built in to the HA companion app for android and I could viably ditch my Alexa devices.

It will also be interesting to see how these perform with multiple devices in the vicinity when speaking - they mentioned on the stream that they were looking at adding a “fastest wins” kind of logic, but not sure how useful that would be if you say something like “turn on the light” (using room awareness) and a device in another room hears and responds a fraction faster…

February 2024 ▶ sparkydave

brendon1

Haos esxi vm on an Intel atom c3758, wondering if it could be another time lack of avx instructions are being a problem.

Edit: Just took a look on the console, apparently it was hitting an out of memory condition with 4Gb, trying again with 6Gb of ram did the trick.

February 2024

mdb17

My daughter was not cooperating and letting me watch the entire stream, is it possible to edit the screen to include like a scene selector or something like that?

February 2024 ▶ Kraken

balloob Founder of Home Assistant

Our bad, it was broken yesterday but we have fixed it now. It works again.

February 2024 ▶ Kraken

balloob Founder of Home Assistant

We initially had pushed an update to the existing boxes. However, because compilation requires a lot more memory and time (42 minutes on a Green!), we decided to revert that. People need to either update their package in the YAML or do a fresh install + adopt.

We are looking into being able to ship pre-compiled libraries, which should reduce update time and memory needs.

February 2024 ▶ brendon1

balloob Founder of Home Assistant

microWakeWord depends on TensorFlow Lite for microcontrollers, which requires 2GB+ free memory for compilation. We’re working on reducing that need. In the meanwhile, if your system can’t finish the compile, consider using the browser installer.

February 2024 ▶ theone

ginandbacon

I watched the live stream. It has to be an ESP32-S3 and requires PSRAM as the previous poster already stated. The atom echo is a vanilla ESP32. The developer who came up with this said any ESP32-S3 with PSRAM should work now but 7 haven’t seen anyone posting saying they got it working. I’ve got an S3 but no mic to wire to it. Seems like removing the display stuff and changing the GPIO pins is all that’s needed but I’m sure someone will be posting some yaml here soon for some board.

February 2024

Kraken

After using on device wake word (S3-Box3) for a couple days, it feels snappier and recognizes the wake word even from 3 meters away. Which was not the case using the HA wake word detection. It may be a placebo, but I like it! Will be trying to reuse my old snips satellites (pi zero with 2-mics-pi-hat) for HA Assistant, and hope for a better recognition ratio. HA Voice with wake word is amazing. Thanks for all the hard work people have been putting into this!

February 2024

donburch888

@synesthesiam Firstly, THANKS MIKE ! Each step we are getting closer to replacing Alexa and Google devices.

I assume that those of us using Rhasspy should have or be swapping to HA Voice Assist - and seeking support here. And that the Rhasspy project (forum, github and docs) will focus on the non-HA uses for Rhasspy 3 as a generic toolkit ?
Chapter 5 positioned the ESP32-S3-BOX-3 in the middle between the cheap M5 Atom Echo and a RasPi - both price wise and CPU power … but chapter 6 adds microWakeWord which really narrows the gap between ESP32-S3-BOX-3 and RasPi Zero 2W with reSpeaker 2-mic HAT. Is there much benefit to RasPi Zero to justify the extra cost ?
There is still, of course, RasPi 4/5 as a higher end satellite option for those wanting to do more, and don’t mind a bigger budget.
Which is your recommendation for anyone setting up a new satellite ?
Personally I still think Nabu Casa should produce their own voice satellite (as they did to produce Home Assistant Blue, Yellow and Green) … though ESP32-S3-BOX-3 comes pretty close for now. In particular I think combining voice assistant and media player makes sense.

1 reply

February 2024

Kraken

From my experience, the 2-mic HAT is superior to the S3Box3, at least as long as the S3Box3 within HA uses only one of the two build in microphones. The willow project (https://heywillow.io/) has fully integrated the S3Box3 and the voice recognition is quite impressive. But for now I am quite happy with the wake word detection on the S3 BOX3.

Regarding the general STT Part, I favor the approach the snips voice platform took with generating/traning an individual model based on the users needs. Only the used intents has been trained which helped to keep false positives low, especially in a multi language environment.
Thinking of something like a (nabu casa) service that generates/train an individual model based on the exposed entities and custom sentences / automations at your local HA instance. Although, to be honest, this sounds more like a deprecated approach that will be useful for low end devices. With AI and the increasing need for local AI processing Power, the way to go may be a dedicated GPU (e.g. CUDA) at home (e.g. Pocket AI | Portable GPU | NVIDIA RTX A500 | ADLINK , https://coral.ai/).

3 replies

February 2024

dproffer

Finding the ‘best box’ for a home automation server has been something I’ve noodled on (and spend far too much coin and brain cycles on ) Post link below from five years ago give an insight to my dogged clinging…

The Thunderbolt connected device you cite is interesting, to my wasted dazy, I’ve had a couple Thunderbolt 2 and 3 external PCI based NVIDIA GPU attached to HA servers in past. Newer motherboards with Thunderbolt on them are now becoming more available in cost effective form factors. Back to the NVIDIA device you cite, while the amazing work to decrease the memory requirements for AI models is going on at a very rapid pace, I’m not sure that a device with 4 GB of memory will be in the realm of possibilities to support the pipelines of, IMHO, the 3 to 5 models that will make for a truly useful 100% (though I think believe you will want at least your ‘Jarvis’ model to have public internet based knowledge (RAG)) local AI assistant (STT, TTS, video feed processing, and Jarvis overall ‘smarts’. Today and IMHO for the near future (a couple years at least) these will need on the order of 16 GB of memory in an AI MCU.

The ‘other’ factor for a home automation AI box is power efficiency IMHO. I’m still in the power realm of 150 to 500 watts for an Intel/AMD/NVIDIA home automation box. As I said, the new mini motherboards with Intel 13 gen + and AMD newer laptop CPU’s with thunderbolt bring the non-GPU power down significantly. Unfortunately, from my experience so far the NVIDIA GPU side of the Thunderbolt wire with enough RAM and cores, CUDA and AMD based AI MCU’s still idle in high teens and easily hits 200 watts during processing. I will skip the ‘significant’ ‘significant other’ factor of fan noise in this discussion .

All this blab, coin expenditure and my ‘way back’ post below that was looking for a ultimate home automation server with AI smarts, MCU virtualization and 100% local only processing possibility by me, bring me finally to my ‘point’ AKA, what I am recommending that folks keep and sharp eye on (and adopt now if you are a bit of a ‘leading edge’ experimenter) :

The current Apple M silicon based Mac Mini’s (I recommend the M2, M3 or upcoming M3 MCU’s over the M1 due to the 16 bit floating point hardware support of the M2 and above MCU’s) are the machine that will be the ones to meets the objectives I cite : 100 % local AI control, multi-VM capable and all at a power consumption well well under 100 watt continuous.

The MCU architecture of the Apple M silicon (ARM based) today lets you run multiple Linux VM’s with extremely high VM efficiency using the UTM hypervisors available. The open source work to date and upcoming announcements to allow AI LLM models to run efficiently using the Apple ‘Metal’ (read this as Apple’s CUDA) GPU and MLPU layers is as ground breaking as NVIDIA’s CUDA was 6 to 7 years ago. Unless you are ‘off the grid’ powering your home automation and can ignore power code (how many years till those panels are 'paid off?) , the Apple hardware power abilities for ‘the win’.

I can virtualize with UTM to the same level (and an argument can be made today ‘better’ due to the Apple MCU Rosetta software ability to emulate many CPU’s including Intel/AMD (yes it is slow today, but you really do not need it as Linux is moving to ARM 100% code faster than most any code work today) ) as Proxmox can manage multiple VM and LXC’s.

As I said, give it a look, the price/power point of a 32 GB Apple Mac M2 or M3 Mini is ‘it’ for your next home automation Home Assistant server .

February 2024

donburch888

My understanding is that, while the demo for the reSpeaker 2-mic HAT uses all the fancy features, the driver and source code supplied by seeed do not. I recall someone commenting in particular that the reSpeaker driver uses only one mic.

This disconnect between the advertises “features” and what is actually supported; and the fact seeed stopped updating their drivers over 5 years ago; have let me to stop recommending any seeed product. Unfortunately the several clone products (such as the Adafruit one) also use the seeed driver.

I note that the ESP32-S3-BOX demo also oversells the device. Consider that these companies are not in the business of making and selling end product - these are demos to show the potential to system integrators. For this reason I am doubtful that Espressif will be keen to ramp up production of the ESP32-S3-BOX sufficiently to supply the demand for HA Voice Assist users.

February 2024

ginandbacon

Has anyone has got this to work on a device other than the S3 box? During the Livestream at the 25 minute mark, the ESPHome contributed who came up with this said it will work with any ESP32-S3 with PSRAM and a microphone at a minimum BUT towards the very end one of the other devs said one of the main ESPHome devs said they had been having “fun” with the S3 because of the various models (not sure exactly what he meant) and to post any boards or devices that work. I will just wait and see. I was thinking about the m5stack S3 camera, which does have a mic but I’m sure it’s terrible. At least it could be used for something else if it didn’t.

I also see that Expressif came out with a new ESP32-S3 Korvo development board but hard to spend that much on something that we don’t even know if it works or not. Seems like it would work great IF it works. You would think it would, it has a 3.5mm output, and 3 microphones but I’m not dealing with returns to AliExpress if it doesn’t work with HA… I was just wondering because the S3 box variants are made by Expressif also, you just can’t get one (at retail at least). I think 2 of the 3 models are discontinued anyways

ESP32-S3-Korvo-1 Development Board Espressif Systems AIoT
https://a.aliexpress.com/_mLpKRj8

3 replies

February 2024 ▶ Kraken

ginandbacon

When watching the kivestream they showed the microwakeword architecture and it really seems like the GPU is used in the last step the most based on their conversation, or at least as much as I understood. That last step was audio of dinner parties taken by someone, along with a lot of other random real world stuff without the wake word. I believe the only requirements is the computer you generate the wake word on needs 2GB of RAM but probably takes longer. When not using a GPU it took 2 days and when sticking a mid level Nvidia card in shaved half a day off, so about a 12 hour difference roughly. I have no idea what his computer specs or what OS the developer was using though.

Regarding AI, it really seems to depend on what you’re doing regarding GPU and performance gain. Things change at a rapid pace although Nvidia has an advantage. The guy in the YouTube link below ran every LLM available on a raspberry pi 5 with 8GB of RAM with the OS being on an external SSD via USB 3.0. It’s response times are pretty ridiculous but 8GB of memory is ideal for 7 billion parameters so roughly a 1GB of RAM per 1 billion parameters according to documentation, if you don’t have enough RAM for the model size it won’t open as he tried 13 billion parameters and it wouldn’t work but the same 7 billion paramter model did. Regarding the Google Coral, I know it helps a ton of you use Frigate for facial detection but I’m not sure what else uses it. I’m sure some integrations I don’t use do. It doesn’t help with LLM’s, at least not the Coral due to lack of VRAM. The guy in the video below tested it and said that was the reason it didn’t make any difference. That’s not to say a future model won’t improve LLM performance significantly though.

I built a voice pipeline using a custom integration that had to be added to HACs called “Extended OpenAI conversation” and it allows you to define what you want to expose to Open AI or any LLM based on Open AI. The default allows only exposed (by exposed I mean exposed to a voice assistant) so you can find tune the results and tell it how to behave and answer. The nice thing is you can run 2 commands at once and syntax doesn’t have to be 100 percent correct. One example was me saying “un pause shield” and my shield TV started playing. I don’t have any scripts, switches or aliases with “un pause” in them, it just worked. I’ve mainly using Assist Microphone, (supported add-on) which works with any USB microphone. I’m using a dedicated speakerphone and it works amazingly well, HA doesn’t have to listen 24/7 but obviously has to plug into the HA server. Then my android phone and some m5stack atom echos with tailbait’s (small battery made by m5stack) but I only turn those in when needed or use the button instead of openeakeword because it hogs resources. I’m looking into a local LLM because API calls are only free to a certain point and obviously all local that way.

Microwakeword architecture

Raspberry pi 5 running and testing every LLM (according to the guy who posted the video).

I Ran Advanced LLMs on the Raspberry Pi 5!

OpenAi pipeline

February 2024

sparkydave Guidance Counsellor

I’ve spent a lot of time struggling to get this board to work. My issue is that it isn’t recognising it’s own onboard PSRAM… I’ve contacted the seller but…China.

Once I get it to see the PSRAM (which I think is simply an ESPhome code / config thing) it should work.

In the meantime I’ve ordered some of these which apparently do work. I’ll be able to confirm once they arrive.

February 2024

sparkydave Guidance Counsellor

That does look pretty cool. Who is going to risk the cost to test it out…?

1 reply

February 2024 ▶ ginandbacon

balloob Founder of Home Assistant

Over on the ESPHome discord in the #machine-learning channel, a user got it working with the esp32-s3-devkitc-1-wroom-1-N16R8 (link to message).

February 2024

Arh

I have it working on a waveshare S3 mini

with Microphone only version of the firmware from here.

I have tried the firmware with the speaker and it appears to not recognise the wake word with that firmware. I have not had a lot of time to play yet with the speaker. I have also ordered some N16R8 S3’s as the memory size is important apparently. This is not my field of expertise but I am making progress.

2 replies

February 2024

sparkydave Guidance Counsellor

That is very similar to the one I’m trying to get working (ESP32-S3FH4R2) but with no luck. BigBobbas has been helping me but the ESPhome log shows the device not seeing that it has PSRAM…

UPDATE: I re-tried using the same GPIO as this example and it works now. There was obviously some strange conflict with the GPIO I had selected. So now I can safely say that the boards I linked earlier do work.

February 2024 ▶ Arh

ginandbacon

Thanks! The developer who mentioned that it didn’t work on any S3 with PSRAM did mention memory differences between models being one of the issues. I thought might be different versions had memory from different manufacturers or something (not my area of expertise either) but it sounds like he was meant it requires a specific amount of PSRAM since that board only has 2MB with 8MB being ideal and possibly 4MB but I’ll stick to 8MB as the price differences is maybe 2 dollars if there is even an option to choose the same model with different amounts

Also thanks to the other posters and links, it’s good to know the devkit and wroom-1 appear to work as long as they have enough PSRAM. It really sounds like that’s the deciding factor but obviously more boards need to be tested. They did mention to post any boards/models users get working. I’m guessing Discord is the place to post that information if you do get it to work on a board that hasn’t already been confirmed. Thanks again!

February 2024

timmo001

I’ve no idea if this is a only my device thing, but the wake word doesnt appear to respond over time, this isnt a new issue it was happening on the old build without local wake word.

Appears to happen over time and I have to restart the device. I’ve not seen it reported on the issue trackers so I’m hedging more towards it being an issue with my s3box3.

Theres also an audible “pop” every now and then, I assume this is the microphone becoming active and is normal, but may or may not be related

February 2024

sender

box link to aliexpress: sold out

1 reply

February 2024 ▶ sparkydave

ginandbacon

Well, I just ordered one. I’ll let everyone know how it works out. On paper it should work but we all know that doesn’t always on out. I just happened to search Amazon and they have them in the US store for the same price. My main issue was ordering from AliExpress and having to deal with a return if it didn’t work but Amazon will take anything back so if it doesn’t work I’ll just send it back for a refund. only 7 left in stock. Not sure about the UK store.

ESP32-S3-Korvo-1 Development Board
https://a.co/d/1xahvv4

1 reply

February 2024 ▶ sender

letsautomate

If you live in the UK then you can buy an ESP32-S3-BOX-3 from my store.

I only have a limited amount and no idea how popular they are going to be. I hope people will understand it’s best price I can do it for with all of the effort that’s gone into the site etc.

https://letsautomate.shop

March 2024

AlanMW

I attempted to compile the code for the s3-box-3 and got the following compile termination, any ideas?

...
Compiling .pioenvs/esp32-voice-node-5a9788/src/esphome/components/micro_wake_word/micro_wake_word.o
Compiling .pioenvs/esp32-voice-node-5a9788/src/esphome/components/network/util.o
In file included from src/esphome/components/micro_wake_word/micro_wake_word.cpp:1:
src/esphome/components/micro_wake_word/micro_wake_word.h:19:10: fatal error: tensorflow/lite/core/c/common.h: No such file or directory
 #include <tensorflow/lite/core/c/common.h>
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
*** [.pioenvs/esp32-voice-node-5a9788/src/esphome/components/micro_wake_word/micro_wake_word.o] Error 1
========================= [FAILED] Took 288.57 seconds =========================

1 reply

March 2024

AlanMW

Removing all the yaml and trying again fixed it and it compiled. After adding my esp32-s3 to HA should I expect to be able to add it to my assist pipeline at the bottom under wake word? It just says I don’t have a wake word engine setup yet.

March 2024 ▶ ginandbacon

ginandbacon

This is not mostly my work and could use some attention to detail for the included h files used for Arduino by esspressif specifically for the korvo-1. Works for the S3 korvo-1 though.

substitutions:
  friendly_name: korvo 

esphome:
  name: korvo
  friendly_name: ${friendly_name}
  name_add_mac_suffix: true
  platformio_options:
    board_build.flash_mode: dio
    upload_speed: 460800
  project:
    name: esphome.voice-assistant
    version: "1.0"
  min_version: 2023.11.1
  on_boot:
    - priority: 600
      then:
        - light.turn_on:
            id: led_ring
            brightness: 70%
            effect: connecting

esp32:
  board: esp32-s3-devkitc-1
  framework:
    type: esp-idf
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
      CONFIG_AUDIO_BOARD_CUSTOM: "y"
      CONFIG_ESP32_S3_KORVO1_BOARD: "y"
    components:
      - name: esp32_s3_korvo1_board
        source: github://abmantis/esphome_custom_audio_boards@main
        refresh: 0s

psram:
  mode: octal
  speed: 80MHz

external_components:
  - source: github://pr#5230
    components: esp_adf
    refresh: 0s

ota:
logger:
api:
  on_client_connected:
    then:
      - if:
          condition:
            switch.is_on: use_wake_word
          then:
            - delay: 1s
            - voice_assistant.start_continuous:
            - delay: 1s
            - voice_assistant.stop:
            - delay: 2s
            - voice_assistant.start_continuous:
            - script.execute: reset_led
  on_client_disconnected:
    then:
      - light.turn_on:
          id: led_ring
          blue: 0%
          red: 100%
          green: 100%
          brightness: 50%
          effect: connecting

dashboard_import:
  package_import_url: github://esphome/firmware/voice-assistant/esp32-s3-korvo1.yaml@main

wifi:
  use_address: 192.168.0.xx
  ap:
  on_connect:
    then:
      - delay: 5s # Gives time for improv results to be transmitted
      - ble.disable:
  on_disconnect:
    then:
      - ble.enable:

improv_serial:

esp32_improv:
  authorizer: none

button:
  - platform: factory_reset
    id: factory_reset_btn
    name: Factory reset

esp_adf:
  board: esp32s3korvo1

microphone:
  - platform: esp_adf
    id: korvo_mic

speaker:
  - platform: esp_adf
    id: korvo_speaker

voice_assistant:
  id: voice_asst
  microphone: korvo_mic
  speaker: korvo_speaker
  noise_suppression_level: 4
  auto_gain: 10dBFS
  volume_multiplier: 1
  use_wake_word: false
  on_listening:
    - light.turn_on:
        id: led_ring
        blue: 100%
        red: 0%
        green: 0%
        brightness: 100%
        effect: wakeword
  on_tts_start:
    - light.turn_on:
        id: led_ring
        blue: 0%
        red: 0%
        green: 100%
        brightness: 50%
        effect: pulse
  on_end:
    - delay: 100ms
    - wait_until:
        not:
          speaker.is_playing:
    - script.execute: reset_led
  on_error:
    - light.turn_on:
        id: led_ring
        blue: 0%
        red: 100%
        green: 0%
        brightness: 100%
        effect: none
    - delay: 1s
    - script.execute: reset_led
    - script.wait: reset_led
    - lambda: |-
        if (code == "wake-provider-missing" || code == "wake-engine-missing") {
          id(use_wake_word).turn_off();
        }

script:
  - id: reset_led
    then:
      - if:
          condition:
            switch.is_on: use_wake_word
          then:
            - light.turn_on:
                id: led_ring
                blue: 100%
                red: 0%
                green: 0%
                brightness: 30%
                effect: none
          else:
            - light.turn_off: led_ring

switch:
  - platform: gpio
    id: pa_ctrl
    pin: GPIO38
    name: "${friendly_name} Speaker Mute"
    restore_mode: ALWAYS_ON

  - platform: template
    name: Use wake word
    id: use_wake_word
    optimistic: true
    restore_mode: RESTORE_DEFAULT_ON
    entity_category: config
    on_turn_on:
      - lambda: id(voice_asst).set_use_wake_word(true);
      - if:
          condition:
            not:
              - voice_assistant.is_running
          then:
            - voice_assistant.start_continuous
      - script.execute: reset_led
    on_turn_off:
      - voice_assistant.stop
      - script.execute: reset_led

light:
  - platform: esp32_rmt_led_strip
    id: led_ring
    name: "${friendly_name} Light"
    pin: GPIO19
    num_leds: 12
    rmt_channel: 0
    rgb_order: GRB
    chipset: ws2812
    default_transition_length: 0s
    effects:
      - pulse:
          name: "Pulse"
          transition_length: 0.5s
          update_interval: 0.5s
      - addressable_twinkle:
          name: "Working"
          twinkle_probability: 5%
          progress_interval: 4ms
      - addressable_color_wipe:
          name: "Wakeword"
          colors:
            - red: 0%
              green: 50%
              blue: 0%
              num_leds: 12
          add_led_interval: 20ms
          reverse: false
      - addressable_color_wipe:
          name: "Connecting"
          colors:
            - red: 60%
              green: 60%
              blue: 60%
              num_leds: 12
            - red: 60%
              green: 60%
              blue: 0%
              num_leds: 12
          add_led_interval: 100ms
          reverse: true

binary_sensor:
  - platform: template
    name: "${friendly_name} Volume Up"
    id: btn_volume_up
  - platform: template
    name: "${friendly_name} Volume Down"
    id: btn_volume_down
  - platform: template
    name: "${friendly_name} Set"
    id: btn_set
  - platform: template
    name: "${friendly_name} Play"
    id: btn_play
  - platform: template
    name: "${friendly_name} Mode"
    id: btn_mode
    on_multi_click:
      - timing:
          - ON for at least 10s
        then:
          - button.press: factory_reset_btn
  - platform: template
    name: "${friendly_name} Record"
    id: btn_record
    on_press:
      - voice_assistant.start:
      - light.turn_on:
          id: led_ring
          brightness: 100%
          effect: "Wakeword"
    on_release:
      - voice_assistant.stop:
      - light.turn_off:
          id: led_ring

sensor:
  - id: button_adc
    platform: adc
    internal: true
    pin: 8
    attenuation: 11db
    update_interval: 15ms
    filters:
      - median:
          window_size: 5
          send_every: 5
          send_first_at: 1
      - delta: 0.1
    on_value_range:
      - below: 0.55
        then:
          - binary_sensor.template.publish:
              id: btn_volume_up
              state: ON
      - above: 0.65
        below: 0.92
        then:
          - binary_sensor.template.publish:
              id: btn_volume_down
              state: ON
      - above: 1.02
        below: 1.33
        then:
          - binary_sensor.template.publish:
              id: btn_set
              state: ON
      - above: 1.43
        below: 1.77
        then:
          - binary_sensor.template.publish:
              id: btn_play
              state: ON
      - above: 1.87
        below: 2.15
        then:
          - binary_sensor.template.publish:
              id: btn_mode
              state: ON
      - above: 2.25
        below: 2.56
        then:
          - binary_sensor.template.publish:
              id: btn_record
              state: ON
      - above: 2.8
        then:
          - binary_sensor.template.publish:
              id: btn_volume_up
              state: OFF
          - binary_sensor.template.publish:
              id: btn_volume_down
              state: OFF
          - binary_sensor.template.publish:
              id: btn_set
              state: OFF
          - binary_sensor.template.publish:
              id: btn_play
              state: OFF
          - binary_sensor.template.publish:
              id: btn_mode
              state: OFF
          - binary_sensor.template.publish:
              id: btn_record
              state: OFF

June 2024

tommychase

I just got my Box3, when its working I like it. Though it seems to cut off or misundertand words. For instnace when I say “turn on the den” I get an error saying something like “can’t find the device Din” or that.

My main concern however is the wake word “Ok Nabu” is only working around 20% of the time the first time and usually have to say between 3-5 times to get it to wake.

Any one have any suggestions?

Thank you
Tommy

June 2024

AleXSR700

Still surprised about all the satellite talks when basically everyone has a mobile phone and most have tablets. So why spend cash on some satellites if all we really need is wake word support in the companion app.
Would even work on Android TVs, Android cars etc. etc.

No cost, lots of processing power and readily available everywhere.

2 replies

July 2024

Hedda

I guess you do not have kids or other family members that do not always carry their phone everywhere (if they even have one, which I am sure most small kids do not have). Even I who do carry my phone almost all the time personally still use our existing Google Nest / Google smart speakers a lot for hand-free voice control.

I believe that most common usecases are when and where handsfree operation makes sense, like example in the kitchen while your hands are busy, with usecases like controlling lights (brightness or ON/OFF), set/operate timers and reminders or alarms, adding stuff to shopping lists or to-do list, and music controls.

Regardless, there are several usecase reasons that appeal to mainstream users and that is why Google and Amason have each sold more than 500,000 Google Nest / Google Home and Amazon Echo / Alexa smart speakers each ao far.

Check out result in this wish list poll once you done it yourself:

1 reply

July 2024 ▶ Hedda

AleXSR700

Not sure what you are trying to show me with that poll. It is about features, not hardware. And wake word support is one of the top priorities there.

I do not have kids but I do not need kids to know that I have multiple Android and iOS devices in my household. And I do not need to be the owner of the iOS devices to be able to use them to use voice control.

So the point is, that if the companion app supported wake words, anybody in my home could enter any room that has any Android or iOS device lying around and could give voice commands.
The device just needs to be in hearing distance.

And you are quoting sales for echo, alexa etc. woth 500 k. 500 k units is not that much. And very few people here want to share all their data with big tech.
So a lot of people are buying more or less expensive satellite hardware. They are fun to play with but they have little future. They will lie around somewhere in a year or two because they are too bulky or too slow. Or because people realize that they need to buy one per room because they are not as mobile as all our Android and iOS phones and tablets.

So, sure, you can buy lots of dedicated hardware for a task that really is just a mic and speaker. Or you could use what everybody owns and most people even have old devices lying around (I still have my Samsung S2 and S6edge). So my wife and I would currently own >6 perfectly fine “satellites” in the form of phones and tablets. Just waiting to be used as local voice controls.
Cost for 6 ESP S3 boxes? Couple of hundred euros. For what? A big, bulky satellite with inferior screen and sound compared to a mobile phone or tablet

July 2024

donburch888

Alex, you do what works best for you. It’s great if you prefer to use the Android app rather than setup satellite devices. That is why there are several options.

Personally I find that getting my phone out, logging in to the phone, and starting the app before I can turn a light on or off … is more painful than getting out of my chair and walking to the light switch. But speaking a command seems so easy … all i need is for it to work reliably

1 reply

July 2024 ▶ donburch888

AleXSR700

The idea is to use wake words on the locked phone.
Or have devices like old phones or tablets remain unlocked.
Or speaking to my TV while I am watching.

You can already control all your devices with the locked phone by using the tiles. Now it is “just” necessary to add wake words

And the computing power of a 50 € tablet is much higher than that of a 50 € esp device. Mic and speaker are also better. Imagine just hanging a bunch of firehd tablets on your walls and speaking to them. Would look much nicer than esp devices and offer nice big screens and great touch control

2 replies

July 2024 ▶ AleXSR700

Merc

I think it would not hurt to use some old devices like that. Does not mean one can’t buy as many devices as one likes as well.

May 24

nick.clark7

Does anyone know why the home assist voice preview edition hardware doesn’t seem to support microWakeWord?

I have both the home assistant voice preview edition and an M5stack atom echo, the M5 seems far more accurate at respond to wake words when using local wake word processing, it performs much worse using HA for wake word processing, on par with the home assistant voice preview edition

Both devices seem to have the same eap32-s3 esp32.

June 13

Skyman1

actually you can create your own ESP32s3 with microphones and audio amplifier for about 10€ (but you have to know how to solder) and the result is good. Spending more than 40€ is not worth it…

June 22

ginandbacon

I’ve had a lot of luck with Wyoming Satellite on Android using Termux for on device wake words on Android. Works great on my Pixel 8a running a beta version of Android 16 and the nspanel pro 120 which is lower power ARM and Android 8 I believe. About the only issue was getting it to start at boot so I just used some tasker equipment to launch it on boot. You get to control the mic gain which is nice and the mics are already meant for far field communication on phones at least and most people have an older model sitting around somewhere. Don’t need the companion app installed either.