Delay in voice assistant output

davideberle · September 28, 2025, 12:04pm

For my Voice PE I set it up using whisper-fast, OpenAI gpt-4o-mini, ElevenLabs.

It’s quite slow.

The logs show me, as an example for asking about a red wine pairing:
asked about a red wine pairing:

SST: 2.83s
NLP: 4.52s
TTS: 0.01s

However, I also measured with a stop watch from the time the voice regonition LED stopped until the voice actually started playing: 14.4s

Where does this 2X gap come from? What can I do to improve those 7s that don’t show up in the log?

Thanks!

NathanCu · September 28, 2025, 12:14pm

You don’t say where you load those. But making some educated guesses…

While I don’t know where you’re running whisper… That 2 second sst time tells me it’s probably Cpu local. For that hw acceleration can pull it subsecond.

Oai… Cloud. Eleven labs, cloud. I’m not sure if the way you put them together support streaming. For this one… Go local you have 2 round trips t the cloud. And if you have to pick one… While elevenlabs voices sound great. It slows my voice pipe to a crawl.

Probably a combination of model thinking, cloud lag + not streaming the responses but you’ll have ton experiment to figure it out.

Go to settings > voice assistant > pick your assistant and hit the three dot menu to the right and pick debug to get details for every step.

davideberle · September 28, 2025, 12:52pm

Hi Nathan, yes that makes sense, but I need the quality, otherwise it’s useless and I can rather pull out my iPhone and open ChatGPT. My question is: why do the logs show one number for the seconds, and using my stopwatch is 2X longer than what the logs show? Because this way it’s hard to debug, as I’m losing 7s somewhere invisible in the logs. Thx, David

NathanCu · September 28, 2025, 12:59pm

That’s cloud lag…

There are other options than eleven labs.

You’re round tripping the conversation TWICE

That responsiveness costs money. To eliminate lag.

local
beefy npu/GPU on the gear doing the work.

Theres no way around it.

Also when I run Friday with a cloud agent Ive never had less than a 5 second turn around. (using oai speech models that I know stream) to break that I’ll need 100% local.

davideberle · September 28, 2025, 1:15pm

Makes sense! I thought that that lag would appear in the logs. GPT told me that there’s a reported lag of the Voice PE where it waits to start playing the TTS, but I guess that’s not it.

I have the HA Green, so that’s all I’ve got in terms of hardware. I’ll play around with it though, thanks

NathanCu · September 28, 2025, 1:29pm

And you verified that from what third party?

Read Friday’s party if you’re interested what it takes for local inference and what you can do.

With a green and no acceleration - you can expect to be successful with speech to phrase for home control - not for casual conversation.

mchk · September 28, 2025, 1:51pm

The ElevenLabs component does not support streaming. The async_stream_tts_audio method must be implemented.