Talk To Me Goose - Stream LLM responses into TTS in real-time (HAVPE)

gyrga · February 9, 2025, 6:45pm

Heya, folks! I’m excited to present Talk To Me Goose, a project designed to improve the way your Home Assistant Voice Preview Edtion (HAVPE) devices handle long responses from LLMs (Large Language Models) by streaming their responses into your TTS engine of choice in real time.

Do your HAVPE devices sometimes fail to respond, and then work just fine with the next command? Have you tried asking your HAVPE device to tell you a story, only for it to fail silently? Have you already managed to increase HAVPE’s timeout, but now you need to wait like 15 seconds to hear a long response? Then this solution is for you!

The Problem:

HAVPE devices often struggle to handle long responses. For instance, when you ask an LLM for something like a story, it can take too long to generate the response, leading to timeouts in the device (e.g., “Connection timed out before data was ready!”). Even with a longer timeout, you can still face frustrating delays as the response has to be fully generated before it gets played.

The Solution:

Talk To Me Goose solves this problem by streaming the LLM’s response directly into your TTS engine, ensuring near-instant audio output. As the LLM generates its response, the text is streamed token-by-token, and as soon as a sentence is complete, it’s passed to the TTS engine. This means you’ll hear the response in just a few seconds, even if it’s a long one!

Additionally, just like the original OpenAI integration, Talk To Me Goose can handle tool calls and commands like “set a timer for X minutes” seamlessly, delivering both the LLM’s output and relevant commands with no delays.

Key Features:

Real-time streaming of LLM responses into your TTS engine.
As native as it gets: handles commands like “set a timer”, other tool calls and message history.
Supports multiple TTS engines.
Easy to set up with Home Assistant and HAVPE devices.
Works with multiple HAVPE devices.

The Solution Includes:

TTMG Server – The core of the system, facilitating the real-time streaming from the LLM to the TTS engine, re-encoding the audio into flac and streaming it to your HAVPE devices.
TTMG Conversation: Home Assistant integration for sending requests to the LLM and managing conversations.
Patches to the HAVPE official firmware to make it work with the TTMG Server.

This system enables smooth integration between Home Assistant, HAVPE devices and your LLM/TTS setup, ensuring fast and uninterrupted voice responses.

Supported LLMs and TTS Engines:

LLMs:

OpenAI (e.g., GPT-3, GPT-4)

TTS Engines:

OpenAI
Google Cloud
ElevenLabs
Wyoming-piper

Getting Started:

Visit the full README for detailed installation instructions, limitations, configuration examples, flows/endpoints description, etc.

gyrga · February 12, 2025, 7:51pm

I’ve just added support for local conversation responses, so now they work with the new pipeline too!

Herian · February 13, 2025, 2:06pm

Nice! I’m gonna try it out as soon as i have time.
Any chances to use it with gemini too?

madmobmurphy · February 19, 2025, 4:10pm

Hi,

I’ve gone through the Git repository and documentation, but as a complete beginner, I still have a few questions.

First of all, this seems to be exactly what I was looking for, and I’m really excited to try it out!

However, I’m unsure about the installation process in HASS. Can I install this using the Terminal add-on and Linux commands? Also, when adding the code to HAVPE, if I make a mistake and need to perform the regular reset process, will everything still function properly with the added code?

Apologies if these questions seem basic for a project like this—I’m still learning my way around. I really appreciate any guidance you can provide.

Thanks in advance for your help!

gyrga · February 21, 2025, 5:23pm

Heya,

Installation-wise it should not matter how exactly you install it as long as you find a way to execute the setup script. Keep in mind that you will also have to run the server after the installation or create a systemd service file for it to run automatically (see the example in the repo).
HAVPE changes are handled as custom components in your Esphome config. If you remove the lines that load those components, you will revert back to stock esphome firmware. To restore the original factory firmware you can use this link. If you use the provided generate_esphome_config.py there is no risk of bricking the device or doing any other irreversible changes.

gyrga · February 21, 2025, 5:24pm

not yet, but we are tracking this request here: [Please vote] New feature: support Gemini as an LLM provider · Issue #7 · eslavnov/ttmg_server · GitHub

Smiie-2 · February 22, 2025, 4:34pm

Where would I find the logs from the server?

gyrga · February 22, 2025, 4:54pm

Depends on how you run it: if you run it directly it will output some logs to stdout (your screen), and if you run it as a systemd service then something like journalctl -u ttmg.service will do the trick (replace ttmg.service with the actual service name).

Smiie-2 · February 22, 2025, 6:09pm

Edit: Ignore this, error was behind the keyboard Had the wrong assistant selected after have a few to many browser windows open.

~~The log is just spamming this. And based on the tts I get back, my speach is not getting passed to the LLM.~~

~~TTS I got back was along the lines of “I dont know what to say if you dont talk to me”~~

</s> <s>INFO 19:05:27: HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>INFO 19:05:27: Getting LLM response...</s> <s>

~~Based on the logs from ESPHOME, the speach is getting captured correctly though.~~

</s> <s>[19:05:16][D][light:036]: 'voice_assistant_leds' Setting:</s> <s>[19:05:16][D][light:051]: Brightness: 28%</s> <s>[19:05:16][D][light:109]: Effect: 'Listening For Command'</s> <s>[19:05:17][D][voice_assistant:641]: Event Type: 12</s> <s>[19:05:17][D][voice_assistant:806]: STT by VAD end</s> <s>[19:05:17][D][voice_assistant:515]: State changed from STREAMING_MICROPHONE to STOP_MICROPHONE</s> <s>[19:05:17][D][voice_assistant:522]: Desired state set to AWAITING_RESPONSE</s> <s>[19:05:17][D][voice_assistant:515]: State changed from STOP_MICROPHONE to STOPPING_MICROPHONE</s> <s>[19:05:17][D][light:036]: 'voice_assistant_leds' Setting:</s> <s>[19:05:17][D][light:051]: Brightness: 28%</s> <s>[19:05:17][D][light:109]: Effect: 'Thinking'</s> <s>[19:05:17][D][voice_assistant:515]: State changed from STOPPING_MICROPHONE to AWAITING_RESPONSE</s> <s>[19:05:17][D][voice_assistant:515]: State changed from AWAITING_RESPONSE to AWAITING_RESPONSE</s> <s>[19:05:18][D][power_supply:033]: Enabling power supply.</s> <s>[19:05:18][D][power_supply:033]: Enabling power supply.</s> <s>[19:05:18][D][power_supply:033]: Enabling power supply.</s> <s>[19:05:19][D][voice_assistant:641]: Event Type: 4</s> <s>[19:05:19][D][voice_assistant:669]: Speech recognised as: " What time is it?"</s> <s>[19:05:19][D][voice_assistant:641]: Event Type: 5</s> <s>[19:05:19][D][voice_assistant:674]: Intent started</s> <s>[19:05:19][D][voice_assistant:641]: Event Type: 6</s> <s>[19:05:19][D][voice_assistant:641]: Event Type: 7</s> <s>[19:05:19][D][voice_assistant:697]: Response: "7:05 PM"</s> <s>[19:05:19][D][light:036]: 'voice_assistant_leds' Setting:</s> <s>[19:05:19][D][light:051]: Brightness: 28%</s> <s>[19:05:19][D][light:109]: Effect: 'Replying'</s> <s>[19:05:19][D][voice_assistant:641]: Event Type: 8</s> <s>[19:05:19][D][voice_assistant:717]: Response URL: "http://192.168.0.148:8888/play/47770edf614b60bb2b7a9cbfd4b6614a.flac"</s> <s>[19:05:19][D][voice_assistant:515]: State changed from AWAITING_RESPONSE to STREAMING_RESPONSE</s> <s>[19:05:19][D][voice_assistant:522]: Desired state set to STREAMING_RESPONSE</s> <s>[19:05:19][D][media_player:073]: 'Home Assistant Voice 0963c6' - Setting</s> <s>[19:05:19][D][media_player:080]: Media URL: http://192.168.0.148:8888/play/47770edf614b60bb2b7a9cbfd4b6614a.flac</s> <s>[19:05:19][D][media_player:086]: Announcement: yes</s> <s>[19:05:19][D][speaker_media_player:420]: State changed to ANNOUNCING</s> <s>[19:05:19][D][voice_assistant:641]: Event Type: 2</s> <s>[19:05:19][D][voice_assistant:731]: Assist Pipeline ended</s> <s>[19:05:19][D][ring_buffer:034][ann_read]: Created ring buffer with size 1000000</s> <s>[19:05:19][D][power_supply:033]: Enabling power supply.</s> <s>[19:05:19][D][speaker_media_player.pipeline:114]: Reading FLAC file type</s> <s>[19:05:20][D][speaker_media_player.pipeline:124]: Decoded audio has 2 channels, 24000 Hz sample rate, and 16 bits per sample</s> <s>[19:05:21][D][voice_assistant:515]: State changed from STREAMING_RESPONSE to IDLE</s> <s>[19:05:21][D][voice_assistant:522]: Desired state set to IDLE`Preformatted text`</s> <s>

gyrga · February 22, 2025, 6:36pm

Let me know how it works for you or if you need any help setting it up!

felixschneider · April 15, 2025, 10:32am

Hey @gyrga,
I love the idea and was thinking about this for the few days so I was happy to find that other people are already working on that!
In the README of your repository you state that the developers of Home Assistant. Would you mind sharing one or two GitHub repository paths where I could see the commits and state regarding this topic? Thanks