Voice Assistant - No Sound

Mortalitas · July 13, 2024, 11:20am

Hi,

Can I please have some help with my ESPHome voice assistant?

It’s detecting my wake word, but I can’t hear any response through the speaker. Although, the ESPHome logs suggest that it’s trying to output something.

I’ve tried swapping out the hardware to see if I have faulty boards. I’ve tried different configurations, including a shared I2S bus, and separate buses.
Text to speech works if I pick a different speaker and don’t use speech to text to trigger it.
My logs show that speech to text also works when talking to this satellite that won’t make sound.

Assist Configuration

ESPHome Configuration

substitutions:
  name: esphome-voice-satellite-dev
  friendly_name: ESPHome Voice Satellite Dev

esphome:
  name: ${name}
  friendly_name: ${friendly_name}
  name_add_mac_suffix: false
  platformio_options:
    board_build.flash_mode: dio
  project:
    name: "dan.voice_assistant"
    version: '1.0'
  min_version: 2023.11.5

esp32:
  board: esp32-s3-devkitc-1
  variant: esp32s3      # This shouldn't be needed.
  flash_size: 16MB
  framework:
    type: esp-idf             #arduino
    version: recommended #4.4.6
    sdkconfig_options:
      CONFIG_ESP32S3_DEFAULT_CPU_FREQ_240: "y"
      CONFIG_ESP32S3_DATA_CACHE_64KB: "y"
      CONFIG_ESP32S3_DATA_CACHE_LINE_64B: "y"
      CONFIG_AUDIO_BOARD_CUSTOM: "y"

psram:
  mode: octal
  speed: 80MHz

# Enable logging
logger:

# Enable Home Assistant API
api:
  on_client_connected:
    then:
      - delay: 50ms
#     - light.turn_off: led_ww
      - micro_wake_word.start:
  on_client_disconnected:
    then:
      - voice_assistant.stop: 

# Allow Over-The-Air updates
ota:
  - platform: esphome
    password: !secret ota_password


# Allow provisioning Wi-Fi via serial
improv_serial:

wifi:
  ssid: !secret wifi_iot_ssid
  password: !secret wifi_iot_password
  # Set up a wifi access point
  ap: {}

# In combination with the `ap` this allows the user
# to provision wifi credentials to the device via WiFi AP.
captive_portal:

dashboard_import:
  package_import_url: github://esphome/firmware/esphome-web/esp32s3.yaml@v2
  import_full_config: true

# Sets up Bluetooth LE (Only on ESP32) to allow the user
# to provision wifi credentials to the device.
esp32_improv:
  authorizer: none

# To have a "next url" for improv serial
web_server:


i2s_audio:
  - id: i2s_mic
    i2s_lrclk_pin: GPIO3    #WS 
    i2s_bclk_pin: GPIO5     #SCK
  - id: i2s_speaker
    i2s_lrclk_pin: GPIO6    #LRC 
    i2s_bclk_pin: GPIO7     #BLCK
  #id: i2s_main
  #i2s_lrclk_pin: GPIO7
  #i2s_bclk_pin: GPIO6
  #access_mode: duplex

microphone:
  - platform: i2s_audio
    id: va_mic
    i2s_audio_id: i2s_mic
    adc_type: external
    i2s_din_pin: GPIO4        # SD Pin of INMP441 Microphone
    channel: left             # worked without this?
    pdm: false
    bits_per_sample: 32 bit

speaker:
  - platform: i2s_audio
    id: va_speaker
    i2s_audio_id: i2s_speaker
    dac_type: external
    i2s_dout_pin: GPIO8       # DIN Pin of the MAX98357A Audio Amplifier
    mode: mono

micro_wake_word:
  on_wake_word_detected:
    # then:
    - voice_assistant.start:
        wake_word: !lambda return wake_word;
        silence_detection: true    # defaults to true.
#    - light.turn_on:
#        id: led_ww           
#        red: 30%
#        green: 30%
#        blue: 70%
#        brightness: 60%
#        effect: fast pulse 
  model: hey_jarvis

voice_assistant:
#  use_wake_word: false
  id: va
  microphone: va_mic
  auto_gain: 31dBFS
  noise_suppression_level: 2
  volume_multiplier: 2.0            #2.0
  speaker: va_speaker
  on_stt_end:
       then: 
#         - light.turn_off: led_ww
  on_error:
          - micro_wake_word.start:  
  on_end:
        then:
#          - light.turn_off: led_ww
          - wait_until:
              not:
                voice_assistant.is_running:
          - micro_wake_word.start:

ESPHome logs
“Hey Jarvis, what’s the time?”

[18:56:56][D][micro_wake_word:363]: Wake word sliding average probability is 0.574 and most recent probability is 0.957
[18:56:56][D][micro_wake_word:129]: Wake Word Detected
[18:56:56][D][micro_wake_word:178]: State changed from DETECTING_WAKE_WORD to STOP_MICROPHONE
[18:56:56][D][micro_wake_word:135]: Stopping Microphone
[18:56:56][D][micro_wake_word:178]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[18:56:56][D][esp-idf:000]: I (4556305) I2S: DMA queue destroyed
[18:56:56]
[18:56:56][D][micro_wake_word:178]: State changed from STOPPING_MICROPHONE to IDLE
[18:56:56][D][voice_assistant:504]: State changed from IDLE to START_MICROPHONE
[18:56:56][D][voice_assistant:510]: Desired state set to START_PIPELINE
[18:56:56][D][voice_assistant:221]: Starting Microphone
[18:56:56][D][ring_buffer:024]: Created ring buffer with size 16384
[18:56:56][D][voice_assistant:504]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[18:56:56][D][esp-idf:000]: I (4556311) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4
[18:56:56]
[18:56:56][D][voice_assistant:504]: State changed from STARTING_MICROPHONE to START_PIPELINE
[18:56:56][D][voice_assistant:275]: Requesting start...
[18:56:56][D][voice_assistant:504]: State changed from START_PIPELINE to STARTING_PIPELINE
[18:56:56][D][voice_assistant:525]: Client started, streaming microphone
[18:56:56][D][voice_assistant:504]: State changed from STARTING_PIPELINE to STREAMING_MICROPHONE
[18:56:56][D][voice_assistant:510]: Desired state set to STREAMING_MICROPHONE
[18:56:56][D][voice_assistant:627]: Event Type: 1
[18:56:56][D][voice_assistant:630]: Assist Pipeline running
[18:56:56][D][voice_assistant:627]: Event Type: 3
[18:56:56][D][voice_assistant:641]: STT started
[18:56:57][D][voice_assistant:627]: Event Type: 11
[18:56:57][D][voice_assistant:781]: Starting STT by VAD
[18:56:58][D][voice_assistant:627]: Event Type: 12
[18:56:58][D][voice_assistant:785]: STT by VAD end
[18:56:58][D][voice_assistant:504]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE
[18:56:58][D][voice_assistant:510]: Desired state set to AWAITING_RESPONSE
[18:56:58][D][voice_assistant:504]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE
[18:56:58][D][esp-idf:000]: I (4558783) I2S: DMA queue destroyed
[18:56:58]
[18:56:58][D][voice_assistant:504]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE
[18:57:04][D][voice_assistant:627]: Event Type: 4
[18:57:04][D][voice_assistant:655]: Speech recognised as: " What's the time?"
[18:57:04][D][voice_assistant:627]: Event Type: 5
[18:57:04][D][voice_assistant:660]: Intent started
[18:57:06][D][voice_assistant:627]: Event Type: 6
[18:57:06][D][voice_assistant:627]: Event Type: 7
[18:57:06][D][voice_assistant:683]: Response: "Sorry, I am not aware of any device called time?"
[18:57:06][D][voice_assistant:627]: Event Type: 98
[18:57:06][D][voice_assistant:768]: TTS stream start
[18:57:06][D][esp-idf:000][speaker_task]: I (4567203) I2S: DMA Malloc info, datalen=blocksize=512, dma_buf_count=8
[18:57:06]
[18:57:06][D][voice_assistant:627]: Event Type: 2
[18:57:06][D][voice_assistant:717]: Assist Pipeline ended
[18:57:06][D][i2s_audio.speaker:206]: Started I2S Audio Speaker
[18:57:09][D][voice_assistant:627]: Event Type: 99
[18:57:09][D][voice_assistant:776]: TTS stream end
[18:57:09][D][voice_assistant:375]: End of audio stream received
[18:57:09][D][voice_assistant:504]: State changed from STREAMING_RESPONSE to RESPONSE_FINISHED
[18:57:09][D][voice_assistant:510]: Desired state set to RESPONSE_FINISHED
[18:57:10][D][i2s_audio.speaker:210]: Stopping I2S Audio Speaker
[18:57:10][D][i2s_audio.speaker:222]: Stopped I2S Audio Speaker
[18:57:10][D][voice_assistant:407]: Speaker has finished outputting all audio
[18:57:10][D][voice_assistant:504]: State changed from RESPONSE_FINISHED to IDLE
[18:57:10][D][voice_assistant:510]: Desired state set to IDLE
[18:57:10][D][micro_wake_word:178]: State changed from IDLE to START_MICROPHONE
[18:57:10][D][micro_wake_word:116]: Starting Microphone
[18:57:10][D][micro_wake_word:178]: State changed from START_MICROPHONE to STARTING_MICROPHONE
[18:57:10][D][esp-idf:000]: I (4570425) I2S: DMA Malloc info, datalen=blocksize=1024, dma_buf_count=4
[18:57:10]
[18:57:10][D][micro_wake_word:178]: State changed from STARTING_MICROPHONE to DETECTING_WAKE_WORD

Hardware

Guides and Resources I used

I thought this bug might be relevant, but others seem to have resolved the issue, while I have not.

github.com/esphome/issues

2024.5.0 Voice assistant, no speaker sounds

opened 10:09AM - 15 May 24 UTC

hugobloem

### The problem I am using the ESP32-S3-BOX (non 3) firmware from esphome/firmw…are. However, after updating to esphome 2024.5 I get no voice return. The text on the display does come up correctly and opening the audio link in a browser plays the audio as normal. ### Which version of ESPHome has the issue? 2024.5.0 ### What type of installation are you using? Home Assistant Add-on ### Which version of Home Assistant has the issue? 2024.5 ### What platform are you using? ESP32 ### Board _No response_ ### Component causing the issue _No response_ ### Example YAML snippet _No response_ ### Anything in the logs that might be useful for us? ```txt [11:05:18][D][voice_assistant:591]: Speech recognised as: "Tell me a joke." [11:05:18][D][text_sensor:064]: 'text_request': Sending state 'Tell me a joke.' [11:05:18][W][component:237]: Component voice_assistant took a long time for an operation (240 ms). [11:05:18][W][component:238]: Components should block for at most 30 ms. [11:05:18][D][voice_assistant:563]: Event Type: 5 [11:05:18][D][voice_assistant:596]: Intent started [11:05:19][D][voice_assistant:563]: Event Type: 6 [11:05:19][D][voice_assistant:563]: Event Type: 7 [11:05:19][D][voice_assistant:619]: Response: "I'm here to assist with your smart home. How can I help you today?" [11:05:19][D][text_sensor:064]: 'text_response': Sending state 'I'm here to assist with your smart home. How can I help you today?' [11:05:19][D][voice_assistant:563]: Event Type: 98 [11:05:19][D][voice_assistant:704]: TTS stream start [11:05:19][D][esp-idf:000][speaker_task]: I (258604) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8 [11:05:19][D][esp-idf:000][speaker_task]: I (258612) I2S: I2S0, MCLK output by GPIO2 [11:05:19][D][esp-idf:000][speaker_task]: I (258618) ESP32_S3_BOX: I2S0, MCLK output by GPIO0 [11:05:19][D][esp-idf:000][speaker_task]: I (258622) AUDIO_PIPELINE: link el->rb, el:0x3d85c2c8, tag:raw, rb:0x3d85c438 [11:05:19][D][esp-idf:000][speaker_task]: I (258629) AUDIO_ELEMENT: [raw-0x3d85c2c8] Element task created [11:05:19][D][esp-idf:000][speaker_task]: I (258635) AUDIO_ELEMENT: [i2s-0x3d85c024] Element task created [11:05:19][D][esp-idf:000][speaker_task]: I (258640) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:8064151 Bytes, Inter:63740 Bytes, Dram:63740 Bytes [11:05:19][D][esp-idf:000][i2s]: I (258646) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1 [11:05:19][D][esp-idf:000][i2s]: I (258648) I2S_STREAM: AUDIO_STREAM_WRITER [11:05:19][D][esp-idf:000][speaker_task]: I (258652) AUDIO_PIPELINE: Pipeline started [11:05:20][W][component:237]: Component voice_assistant took a long time for an operation (280 ms). [11:05:20][W][component:238]: Components should block for at most 30 ms. [11:05:20][D][voice_assistant:563]: Event Type: 8 [11:05:20][D][voice_assistant:639]: Response URL: "http://192.168.1.102:8123/api/tts_proxy/8e80ff9caa1ef21e0bcaaea38ac66211b3483bab_en-gb_2cdeae300d_tts.microsoft.wav" [11:05:20][D][voice_assistant:439]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE [11:05:20][D][voice_assistant:445]: Desired state set to STREAMING_RESPONSE [11:05:20][D][voice_assistant:563]: Event Type: 2 [11:05:20][D][voice_assistant:653]: Assist Pipeline ended [11:05:21][D][esp-idf:000][speaker_task]: W (260212) AUDIO_PIPELINE: There are no listener registered [11:05:21][D][esp-idf:000][speaker_task]: I (260219) AUDIO_PIPELINE: audio_pipeline_unlinked [11:05:21][D][esp-idf:000][speaker_task]: W (260226) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE [11:05:21][D][esp-idf:000][speaker_task]: I (260235) I2S: DMA queue destroyed [11:05:21][D][esp-idf:000][speaker_task]: W (260243) AUDIO_ELEMENT: [filter] Element has not create when AUDIO_ELEMENT_TERMINATE [11:05:21][D][esp-idf:000][speaker_task]: W (260251) AUDIO_ELEMENT: [raw] Element has not create when AUDIO_ELEMENT_TERMINATE [11:05:21][D][esp-idf:000][speaker_task]: I (260291) I2S: DMA Malloc info, datalen=blocksize=2048, dma_buf_count=8 [11:05:21][D][esp-idf:000][speaker_task]: I (260299) I2S: I2S0, MCLK output by GPIO2 [11:05:21][D][esp-idf:000][speaker_task]: I (260309) ESP32_S3_BOX: I2S0, MCLK output by GPIO0 [11:05:21][D][esp-idf:000][speaker_task]: I (260317) AUDIO_PIPELINE: link el->rb, el:0x3d85c2c8, tag:raw, rb:0x3d85c438 [11:05:21][D][esp-idf:000][speaker_task]: I (260325) AUDIO_ELEMENT: [raw-0x3d85c2c8] Element task created [11:05:21][D][esp-idf:000][speaker_task]: I (260333) AUDIO_ELEMENT: [i2s-0x3d85c024] Element task created [11:05:21][D][esp-idf:000][speaker_task]: I (260338) AUDIO_PIPELINE: Func:audio_pipeline_run, Line:359, MEM Total:8064243 Bytes, Inter:63832 Bytes, Dram:63832 Bytes [11:05:21][D][esp-idf:000][i2s]: I (260345) AUDIO_ELEMENT: [i2s] AEL_MSG_CMD_RESUME,state:1 [11:05:21][D][esp-idf:000][i2s]: I (260348) I2S_STREAM: AUDIO_STREAM_WRITER [11:05:21][D][esp-idf:000][speaker_task]: I (260350) AUDIO_PIPELINE: Pipeline started [11:05:25][D][voice_assistant:563]: Event Type: 99 [11:05:25][D][voice_assistant:712]: TTS stream end [11:05:25][D][voice_assistant:310]: End of audio stream received [11:05:25][D][voice_assistant:439]: State changed from STREAMING_RESPONSE to RESPONSE_FINISHED [11:05:25][D][voice_assistant:445]: Desired state set to RESPONSE_FINISHED [11:05:27][D][esp-idf:000][speaker_task]: W (266700) AUDIO_PIPELINE: There are no listener registered [11:05:27][D][esp-idf:000][speaker_task]: I (266707) AUDIO_PIPELINE: audio_pipeline_unlinked [11:05:27][D][esp-idf:000][speaker_task]: W (266716) AUDIO_ELEMENT: [i2s] Element has not create when AUDIO_ELEMENT_TERMINATE [11:05:27][D][esp-idf:000][speaker_task]: I (266723) I2S: DMA queue destroyed ``` ### Additional information _No response_

Coldness00 · August 25, 2024, 8:40pm

I have same issue

will35 · August 26, 2024, 12:02pm

Hi

do you use 5V pinout on ESP2-S3 for MAX98357 Vin ?
You have to solder the pad here

Mortalitas · August 31, 2024, 11:27am

Thanks. I hear a loud crackling sound between powering the voice assistant and the first time it speaks. But that’s progress.

Although, in order for the voice assistant to hear me I have to muffle the speaker with my hands.

Timu5 · September 27, 2024, 4:00pm

I’m facing same issue, nothing seems to help.

I think there is a issue with reading tts wav file, my log:

[13:50:41][D][voice_assistant:715]: Response URL: "http://192.168.0.160:8123/api/tts_proxy/bb4c44570a13d4d6785b9dd975a41a337846fa48_pl_2c82848529_tts.google_en_com.wav"
[13:50:41][D][voice_assistant:514]: State changed from IDLE to STREAMING_RESPONSE
[13:50:41][D][voice_assistant:520]: Desired state set to STREAMING_RESPONSE
[13:50:41][D][voice_assistant:381]: End of audio stream received
[13:50:41][D][voice_assistant:514]: State changed from STREAMING_RESPONSE to RESPONSE_FINISHED
[13:50:41][D][voice_assistant:520]: Desired state set to RESPONSE_FINISHED
[13:50:42][D][i2s_audio.speaker:228]: Started I2S Audio Speaker
[13:50:42][D][voice_assistant:637]: Event Type: 2
[13:50:42][D][voice_assistant:729]: Assist Pipeline ended
[13:50:42][D][i2s_audio.speaker:233]: Stopping I2S Audio Speaker
[13:50:42][D][i2s_audio.speaker:242]: Stopped I2S Audio Speaker
[13:50:42][D][voice_assistant:417]: Speaker has finished outputting all audio
[13:50:42][D][voice_assistant:514]: State changed from RESPONSE_FINISHED to IDLE
[13:50:42][D][voice_assistant:520]: Desired state set to IDLE

I doubt it can download wav file within 1 millisecond.