Streaming support for Wyoming TTS

The new release introduced enough internal changes to implement this functionality. This is especially relevant during the transition period, until TTS engines emerge that support these features independently. Now is the perfect time to start implementing this for cloud solutions.

The custom integration works on a simple principle: a stream of text chunks is combined into sentences, which are then sent for synthesis. The audio data is stored in a buffer and transmitted to HA.

This is already functional in the interface, though not fully supported for ESP satellites yet. However, synthesis occurs synchronously with the text response, so the satellite starts responding immediately after. This allows Piper to handle long responses on weak hardware, even if the RTF is 1, it also eliminates the problem of interrupting the session after waiting 5 seconds for a response. In the future, developers will enable ESP satellites to stream voice responses.

This is a very early version I put together overnight, and much may need to be reworked. If you are skilled in working with voice integrations, feel free to fork and improve this idea.

In addition to external servers, you can proxy Piper from the add-on by specifying the container name (core-piper) and port (10200) during configuration.

The repository also contains a fix for Piper to switch from temporary files to working with RAW output.

6 Likes

This is amazing. I hope the HA Devs pick this up and implement into HA. Cant wait to see snappy voice streaming responses :slight_smile:

1 Like