New near-native solution: GitHub - eslavnov/ttmg_server: Talk To Me Goose Server
TLDR; Read below to learn how to get responses from TTS under 3 seconds even for huge texts.
I’ve been playing with my HAVPE devices and I love them, but I noticed that they don’t handle announcing long TTS responses that well. For example, if you ask chatgpt to tell you a story, you either end up timeouting or, if you manually increase the timeout, you can wait for dozens of seconds before getting a response. This happens because everything is sequential: you first wait for a response from chatgpt, then pass the whole response to the TTS and wait again for it to generate a long audio response.
But we know that LLMs can stream their responses, same goes for some TTS systems - so, hypothetically speaking, we could stream LLM’s response (before it’s even finished) into a TTS engine and save a bunch of time. I’ve written a small prototype that does exactly that and it seems to be working surprisingly well (takes on average only 3 seconds to start an audio stream).
Right now it supports OpenAI as an LLM provider. For TTS options, it supports OpenAI and Google Cloud. To make it work with home assistant (including voice devices), you need to run a python script and create a couple of automations, all details are available here: GitHub - eslavnov/llm-stream-tts: Stream LLMs responses directly into your TTS engine of choice
Basically, when you start a sentence with the defined words, it would switch to this streaming pipeline, which is perfect for stories, audiobooks, summaries, etc.
It’s still a very early work-in-progress, but I am curious to hear your thoughts!