Streaming LLM's responses into TTS for near-instant responses (works with HAVPE!)

New near-native solution: GitHub - eslavnov/ttmg_server: Talk To Me Goose Server

TLDR; Read below to learn how to get responses from TTS under 3 seconds even for huge texts.
I’ve been playing with my HAVPE devices and I love them, but I noticed that they don’t handle announcing long TTS responses that well. For example, if you ask chatgpt to tell you a story, you either end up timeouting or, if you manually increase the timeout, you can wait for dozens of seconds before getting a response. This happens because everything is sequential: you first wait for a response from chatgpt, then pass the whole response to the TTS and wait again for it to generate a long audio response.

But we know that LLMs can stream their responses, same goes for some TTS systems - so, hypothetically speaking, we could stream LLM’s response (before it’s even finished) into a TTS engine and save a bunch of time. I’ve written a small prototype that does exactly that and it seems to be working surprisingly well (takes on average only 3 seconds to start an audio stream).

Right now it supports OpenAI as an LLM provider. For TTS options, it supports OpenAI and Google Cloud. To make it work with home assistant (including voice devices), you need to run a python script and create a couple of automations, all details are available here: GitHub - eslavnov/llm-stream-tts: Stream LLMs responses directly into your TTS engine of choice

Basically, when you start a sentence with the defined words, it would switch to this streaming pipeline, which is perfect for stories, audiobooks, summaries, etc.

It’s still a very early work-in-progress, but I am curious to hear your thoughts!

4 Likes

Great idea! Would this potentially also work with Local LLM’s? (like when running Ollama with a model and running Piper and Whisper also locally? *I have these things running on a laptop, and HA running on a RPi4)

ollama supports streaming, so it should be possible. Whisper does not support streaming (I think?), but it will still benefit from splitting long responses into sentences. So it will probably be a bit slower than something like Google Cloud TTS (assuming everything else is equal), but still faster than the current situation.

On a side note, I’ve just added support for ElevenLabs!

Whisper needs to gain streaming support.

I’ve updated my solution to create a near-native real-time streaming, see the new version here: GitHub - eslavnov/ttmg_server: Talk To Me Goose Server

It also works with Piper now!