I set up an ESP32-S3-BOX-3 as a voice assistant using this tutorial. It responds to the “ok nabu” wakeword and I can give it voice commands to turn on or off lights. So far, so good!
However, when it is playing back a voice response, such as saying “Turned off switch” after I tell it to turn off a switch, the first beat of the playback is compressed/cut off. Instead of “Turned off switch,” I hear something more like “td off switch.”
How can I fix this so I can hear the entire voice response?
I’m using the Home Assistant Cloud voice assistant, and all of this is happening within my home network.
Similar… I hear that “click” and then just the last part of the sound (between last 80% to last 10% from the sentence sent to the ESP32-S3-BOX-3 internal speaker). I was thinking to find a solution to redirect the sound to a media player (because anyhow the internal speaker has a VERY LOW volume level )
Seeing the same behavior here, it appears to be most noticeable on short strings, like “Turned on light”. If I use a custom intent I have to ask for the weather forecast, which generates a long response string, it doesn’t appear to cut off the audio. Not sure if this is some sort of buffer/activation issue where it isn’t giving the audio DAC enough time to engage before trying to play the audio? I couldn’t find much else to look at in configs for Piper or on the yml for the assistant.
EDIT: Just tested and it actually does happen on long strings. The first word or two is “squished” or garbled as it seems to try and speak it really fast, and then the speech goes back to normal speed. I also noticed that this doesn’t happen when using the Cloud TTS, seems to only happen when using Piper. Additionally when using assist from my phone this does not occur at all with Piper, it works fine with the same replies.
I Believe this is a streaming issue. I put my satellites on the separate IOT network and also gave them and the machine running HAAS priority on the network. Ive almost completely eliminated this issue.
Hm, the machine is on a dedicated network, running on a Ryzen 9 7950x with 4 cores assigned to it at 5ghz, and 8 GB of DDR5 Ram. The network is a Unifi WiFi 6 network, so I don’t see any reason it would have any latency. My phone is also on the same network when I test and I don’t encounter the issue there. I additionally run Willow on the same network with satellites also on the same network and it doesn’t have any issue with streaming audio replies, so it seems specific to something going on with the firmware (at least that’s what process of elimination seems to be pointing to ). Not a big deal to me for usage, just figured I’d give some info if it would help Nabu/the community get to the bottom of the issue.
For what it’s worth … I’m getting the same thing with my RasPi’s running wyoming-satellite and MPD as the media_player … so not limited to your ESP hardware.
Curiously the first unit I have been testing with has no problem speaking “This is a test” - but one device compresses the first two words, and the other plays the whole message in one burst
Actually I found it’s fairly easy if you already have the other player setup as a media_player.
Call the tts.speak Service with target of tts.piper (to maintain the same voice), and Media Player entity is the media_player device you want to hear the Message on.
That assumes you know which media_player device to send the message to, so if you have several satellites that may require you to translate from the (internal) device_id to the appropriate (user-friendly) media player device name.
My issue was that I didn’t have another media_player, and Wyoming satellite doesn’t provide one, or any way to send a message to the satellite’s speaker. I still have to work out where to put my chime.wav so that media_player.play_media will find it.