Home Assistant cloud STT latency

Hello! I started to experiment with speech to text and assistant with a wake word. I added local assist microphone/speaker Jabra 410 that seems to be super good for the purpose. Home Assistant Cloud recognizes speech very well although Finnish is not very good in this. Unfortunately, there is a problem that in practice prevents using STT. The response time is in general about 15 seconds, which is not usable. BUT, can this be caused by something else than Nabu Casa cloud? In some case the delay is only maybe 2 seconds, but that happens only occasionally. Is there something I can do to end the speech recording sooner locally? Maybe that is the issue.

I think the voice activity detection for some reason waits until timeout, before any action. How can I change this behaviour?

1 Like

I’ve noted this issue and raised it on github - although not on homeassistant/core:
Poor end of speech detection · Issue #20 · rhasspy/wyoming-satellite · GitHub.
Can you describe what hardware, config, pipeline and wake word you are using.

I’m not sure which part of the pipeline is responsible for end-of speech detection it could be the satellite or the TTS engine. That should determine where we raise the issue. Sorry for the tag @synesthesiam, perhaps you could suggest where we would best raise this issue?

In the meantime playing around with audio enhancements helped me.
Mine were exactly 15 seconds, so it seems that is a defined limit/timeout.

My hardware is Intel i3 with home assistant supervised. There should be enough power for the wake word engine. I keep always Home Assistant up-to-date. There is nothing else heavy happening in the server, except normal Zigbee events. All processing is local, no satellites.Here is the pipeline:

Language: Finnish
STT: home assistant cloud
TTS: home assistant cloud
Wake word engine: openwakeword
Audio: Assist microphone

I would very much like to find a way to reduce this 15 second latency. I have a fiber with 10 ms to nearest Cloudflare - that is not an issue.

I have now a clue of the 15 second behaviour. When I speak from 5 cm distance, the response is immediate. When I am 3 meters away, there is always the 15 second wait, even when the room is totally quiet. The sentences are in any case detected right.
I am using Jabra 410 hockey puck. It has a very sensitive microphone that can cover a large room easily. I think that the recording threshold must be set right and I have no idea where to start looking for the setting. Is it in local Assist Microphone or in wakeword detection?

had you find any solution of that issue ?
i`m facing the same

Try increasing the noise suppression level. The voice activity detection inside HA isn’t very good at the moment, so any kind of ambient noise (or a lower quality microphone) will cause it to thinking you’re still speaking.

1 Like

No, I did not find. I tried various settings and none of them had any effect. Then I decided to stop wasting time. In addition, my family hated speech recognition.

2 Likes

maybe with stereo mic (2xINMP441) we can achieve more quality of voice recognition.
Pls vote for that feature if you also interested Add support for Stereo Microphones in addition to select Left/Right · Issue #2562 · esphome/feature-requests · GitHub

1 Like