Please help! Voice Assistant getting progressively slower each request

I’ve been tearing my hair out over this and could really use some advice if possible.

I have Voice Assistant: Preview Edition setup with the Ollama integration. my server only has an iGPU, so I’m using a tiny model (llama 3.2:1b). I have ensured the model is not cold-starting & has the model pre-loaded, prefers handling messages locally, and has the minimum context size possible (2048, with 0 message history). I have also ensured the model is not trying to control Assist.

On a fresh reload of wyoming/VA integrations, a simple “What is your name?” command takes 1 second.

I’ll then ask it 10 commands, asking it to turn on/off various lights. When a light doesn’t exist, and the command is sent to the LLM, I get strange JSON responses back, which are voiced back to me. For example:

{“name”: “HassLightSet”, “parameters”: {“brightness”:50,“name”:“Completely fake lights”}}

(note: this only happens when “prefer handling commands locally” is enabled. otherwise, I get sane responses back.)

After that, I’ll try to ask it “What is your name?” again. This time, it takes 6 seconds.

If I don’t reload my integrations again, this delay will steadily increase. It’s gotten to 20 seconds before, at which point I felt like breaking it.

Other things I’ve ruled out:

  • Ollama being slow (if I copy paste my prompt directly to Ollama, the response is lightning fast)
  • Slowness with STT/TTS (Debug runs without the preview edition are lightning fast)

Here is a debug log of the slow run:

stage: done
run:
  pipeline: 7fP4q0xRkM8Z2nLJ9aBvQeYt
  language: en
  conversation_id: 8QZpF6k1YJrW0D9mS2HnA5tL
  satellite_id: assist_satellite.voice_unit_7cD1A9xQeM
  tts_output:
    token: R9QwXJ6A1bZcL4TnY8P2Vd.flac
    url: /api/tts_proxy/R9QwXJ6A1bZcL4TnY8P2Vd.flac
    mime_type: audio/flac
    stream_response: true
events:
  - type: run-start
    data:
      pipeline: 7fP4q0xRkM8Z2nLJ9aBvQeYt
      language: en
      conversation_id: 8QZpF6k1YJrW0D9mS2HnA5tL
      satellite_id: assist_satellite.voice_unit_7cD1A9xQeM
      tts_output:
        token: R9QwXJ6A1bZcL4TnY8P2Vd.flac
        url: /api/tts_proxy/R9QwXJ6A1bZcL4TnY8P2Vd.flac
        mime_type: audio/flac
        stream_response: true
    timestamp: "2026-02-12T18:42:00.092965+00:00"
  - type: stt-start
    data:
      engine: stt.faster_whisper
      metadata:
        language: en
        format: wav
        codec: pcm
        bit_rate: 16
        sample_rate: 16000
        channel: 1
    timestamp: "2026-02-12T18:42:00.093165+00:00"
  - type: stt-vad-start
    data:
      timestamp: 1050
    timestamp: "2026-02-12T18:42:01.159069+00:00"
  - type: stt-vad-end
    data:
      timestamp: 2690
    timestamp: "2026-02-12T18:42:02.782167+00:00"
  - type: stt-end
    data:
      stt_output:
        text: " What is your name?"
    timestamp: "2026-02-12T18:42:02.963807+00:00"
  - type: intent-start
    data:
      engine: conversation.ollama_conversation
      language: en
      intent_input: " What is your name?"
      conversation_id: 8QZpF6k1YJrW0D9mS2HnA5tL
      device_id: D4A9n8k2WZxPq7m0FJ6L1S5C
      satellite_id: assist_satellite.voice_unit_7cD1A9xQeM
      prefer_local_intents: true
    timestamp: "2026-02-12T18:42:02.964065+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        role: assistant
        content: I
    timestamp: "2026-02-12T18:42:08.388485+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: " am"
    timestamp: "2026-02-12T18:42:08.421001+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: " Homer"
    timestamp: "2026-02-12T18:42:08.449793+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: ","
    timestamp: "2026-02-12T18:42:08.483069+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: " a"
    timestamp: "2026-02-12T18:42:08.512579+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: " Home"
    timestamp: "2026-02-12T18:42:08.542226+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: " Assistant"
    timestamp: "2026-02-12T18:42:08.573931+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: "."
    timestamp: "2026-02-12T18:42:08.602148+00:00"
  - type: intent-progress
    data:
      chat_log_delta:
        content: ""
    timestamp: "2026-02-12T18:42:08.631625+00:00"
  - type: intent-end
    data:
      processed_locally: false
      intent_output:
        response:
          speech:
            plain:
              speech: I am Homer, a Home Assistant.
              extra_data: null
          card: {}
          language: en
          response_type: action_done
          data:
            targets: []
            success: []
            failed: []
        conversation_id: 8QZpF6k1YJrW0D9mS2HnA5tL
        continue_conversation: false
    timestamp: "2026-02-12T18:42:08.632073+00:00"
  - type: tts-start
    data:
      engine: tts.piper
      language: en_GB
      voice: en_GB-jenny_dioco-medium
      tts_input: I am Homer, a Home Assistant.
      acknowledge_override: false
    timestamp: "2026-02-12T18:42:08.632243+00:00"
  - type: tts-end
    data:
      tts_output:
        media_id: media-source://tts/-stream-/R9QwXJ6A1bZcL4TnY8P2Vd.flac
        token: R9QwXJ6A1bZcL4TnY8P2Vd.flac
        url: /api/tts_proxy/R9QwXJ6A1bZcL4TnY8P2Vd.flac
        mime_type: audio/flac
    timestamp: "2026-02-12T18:42:08.632690+00:00"
  - type: run-end
    data: null
    timestamp: "2026-02-12T18:42:08.632908+00:00"
started: 2026-02-12T18:42:00.092Z
finished: 2026-02-12T18:42:08.632Z

The big delay is in receiving a first token back from Ollama. But since I’ve already proven Ollama is running fine, and that the first command is also lightning fast - I think HA is filling additional context and/or a history of messages, even though I’ve told it not to.

Any help would truly truly be appreciated. For now, I’m giving up on this altogether.

1 Like

You have not unless you’re guaranteeing same conditions nothing wrong with an igpu if you give it enough on the vram side.

Whats the ollama log say for response speed when viewing a session coming from HA (not dropped in) you’ll also see exactly how much context is coming in and how many layers are going where.

Whats the voice log daygum it was hidden behind the text window I’m reading. It. :slight_smile:

Edit - that was ALLL LLM time. (relatively) I need to see the Ollama log - almost certain you’re building context in each session you dont know you’re adding and the slowness is because the context fills and overflows… quickly in ctx:2048 (strange json responses BTW are a symptom of context overflow)

What do you mean by this? Is “no control” mode selected?

If so, then:

  • If “no control” is used, then the promt has dynamic time. The kv cache is practically unused.
  • Every control request leaves data in the chatlog (history). Each request to turn on the light can add 300-500 tokens, which will be added to your initial prompt at each new step.
  • Small models aren’t very smart and will respond with nonsense.

I think now you can understand the reason for your delays and strange responses.

1 Like

Yes - I’ve turned off “Control Home Assistant” via the “Assist” checkbox.
image

I realise I should probably provide some additional context on what I’m trying to achieve.

I want HA to locally process any commands that match an alias and/or automation, without involving the LLM.

If a prompt cannot be handled locally (e.g. “What’s your name”), then it will pass it through to the LLM with minimal (or no) additional context. Essentially, I just want my HA voice to either a) execute useful commands, or b) be a dumb chat bot when it can’t.

It was my understanding that enabling this setting introduces an additional parsing step in the pipeline where HA uses the model to recognise intent. For my model, this introduces significant delays whenever HA can’t interpret the command locally.

Based on what you’re saying - my choices are to a) enable this setting and have permanent significant delays whenever HA tries to use the model to recognise intent, or b) disable this setting, and have my prompts become eventually useless each time the LLM is called?

Please excuse me if I’ve gotten anything wrong.

Thank you for your reply.

There’s nothing particularly useful in the Ollama logs, just details of how long the API call took (maybe because I’m running in a container?).

I’ll reproduce some debug logs when I get some time (probably Sunday).

Pretty much.

Honesty the problem is the model and the horsepower… It’s not really fit for the task. You really need something that can cleanly load 4k to even start and 8k for useful work…

I run a 16g 5070ti for my primary and I still can’t 100% run locally (granted my context is almost an order of magnitude larger but… Same point.). You’d have zero problem running my card. But that’s $…

This problem is purely a function of hardware for you. Sorry. I know it’s not what you want to hear…

c) You can use my integration; I implement various hacks there that I need. One of them is timestamp freezing. This won’t save you from nonsensical answers, but it will speed up processing.
I don’t know if this integration works with ollama api; I haven’t used it in a while. But you can switch to llama.cpp (or lemonade).
If you decide to use it, I recommend installing the component manually from the specified branch.

d) Manually remove the time template from the prompt inside the HA container.

1 Like