Voice responses are cut off/compressed at the beginning when using ESP32-S3-BOX-3

SpencerDub · December 15, 2023, 2:33am

I set up an ESP32-S3-BOX-3 as a voice assistant using this tutorial. It responds to the “ok nabu” wakeword and I can give it voice commands to turn on or off lights. So far, so good!

However, when it is playing back a voice response, such as saying “Turned off switch” after I tell it to turn off a switch, the first beat of the playback is compressed/cut off. Instead of “Turned off switch,” I hear something more like “td off switch.”

How can I fix this so I can hear the entire voice response?

I’m using the Home Assistant Cloud voice assistant, and all of this is happening within my home network.

brendon1 · December 16, 2023, 7:33pm

Got me beat, I’m just getting a click instead of audio on the one I just setup.

adynis · December 19, 2023, 12:25am

Similar… I hear that “click” and then just the last part of the sound (between last 80% to last 10% from the sentence sent to the ESP32-S3-BOX-3 internal speaker). I was thinking to find a solution to redirect the sound to a media player (because anyhow the internal speaker has a VERY LOW volume level )

fabianluque · February 7, 2024, 8:35pm

Same here, a clicking sound and then the response is cut off slightly at the beginning.

Were you able to fix it?

SpencerDub · February 7, 2024, 8:41pm

Sadly, not yet.

Nikku · March 20, 2024, 12:41pm

Seeing the same behavior here, it appears to be most noticeable on short strings, like “Turned on light”. If I use a custom intent I have to ask for the weather forecast, which generates a long response string, it doesn’t appear to cut off the audio. Not sure if this is some sort of buffer/activation issue where it isn’t giving the audio DAC enough time to engage before trying to play the audio? I couldn’t find much else to look at in configs for Piper or on the yml for the assistant.

EDIT: Just tested and it actually does happen on long strings. The first word or two is “squished” or garbled as it seems to try and speak it really fast, and then the speech goes back to normal speed. I also noticed that this doesn’t happen when using the Cloud TTS, seems to only happen when using Piper. Additionally when using assist from my phone this does not occur at all with Piper, it works fine with the same replies.

Rich37804 · March 20, 2024, 2:17pm

I Believe this is a streaming issue. I put my satellites on the separate IOT network and also gave them and the machine running HAAS priority on the network. Ive almost completely eliminated this issue.

Nikku · March 20, 2024, 2:42pm

Hm, the machine is on a dedicated network, running on a Ryzen 9 7950x with 4 cores assigned to it at 5ghz, and 8 GB of DDR5 Ram. The network is a Unifi WiFi 6 network, so I don’t see any reason it would have any latency. My phone is also on the same network when I test and I don’t encounter the issue there. I additionally run Willow on the same network with satellites also on the same network and it doesn’t have any issue with streaming audio replies, so it seems specific to something going on with the firmware (at least that’s what process of elimination seems to be pointing to ). Not a big deal to me for usage, just figured I’d give some info if it would help Nabu/the community get to the bottom of the issue.

donburch888 · April 15, 2024, 7:34am

For what it’s worth … I’m getting the same thing with my RasPi’s running wyoming-satellite and MPD as the media_player … so not limited to your ESP hardware.

Curiously the first unit I have been testing with has no problem speaking “This is a test” - but one device compresses the first two words, and the other plays the whole message in one burst

pepe59 · April 15, 2024, 9:21am

Has anyone found a solution to send audio feedback to another player?

donburch888 · April 18, 2024, 1:12am

Actually I found it’s fairly easy if you already have the other player setup as a media_player.

Call the tts.speak Service with target of tts.piper (to maintain the same voice), and Media Player entity is the media_player device you want to hear the Message on.

That assumes you know which media_player device to send the message to, so if you have several satellites that may require you to translate from the (internal) device_id to the appropriate (user-friendly) media player device name.

My issue was that I didn’t have another media_player, and Wyoming satellite doesn’t provide one, or any way to send a message to the satellite’s speaker. I still have to work out where to put my chime.wav so that media_player.play_media will find it.

SpencerDub · May 2, 2024, 9:57pm

Since updating the device to ESPHome 2024.4.2, the frequency of clipping has been greatly reduced for me. It’s not zero, but it’s significantly lower!

Mjbk · November 14, 2024, 1:54pm

Any update on this? I’m using Piper in TTS to send media to a chrome display and the first few words are always cut off. Even when using the cache the issue is present so I’m not sure what’s going wrong.