The assistant remains in a loop

I ask the assitant: (not yet voice, I type the question:)
“name all lights of the kitchen”
and the answer is:
"The lights in the kitchen are:

  • Licht_Keuken1
  • Licht_Keuken2
  • Licht_Keuken3
  • Licht_Keuken4
  • Licht_Gr_Kamer_Lavabo
  • Licht_Kl_Kamer_Lavabo
  • Licht_Salontafel
  • Licht_Zetel_1
  • Licht_Zetel_2
  • Licht_Zetel_3
  • Licht_Zetel_4
  • Licht_Zolder1
  • Licht_Zolder2_Schuin
  • Licht_Zoldertrap
  • Licht_Inkom_Binnen
  • Licht_Kelder
  • Licht_Vloertraphal
  • Licht_Wandlicht
  • Licht_Zetel_1
  • Licht_Zetel_2
  • Licht_Zetel_3
  • Licht_Zetel_4
  • Licht_Zolder1
  • Licht_Zolder2_Schuin
  • Licht_Zoldertrap
  • Licht_Inkom_Binnen
  • Licht_Kelder
  • Licht_Vloertraphal
  • Licht_Wandlicht
  • Licht_Zetel_1
  • Licht_Zetel_2
  • Licht_Zetel_3
  • Licht_Zetel_4
  • Licht_Zolder1
  • Licht_Zolder2_Schuin
  • Licht_Zoldertrap
  • Licht_Inkom_Binnen
  • Licht_Kelder
  • Licht_Vloertraphal
  • Licht_Wandlicht
  • Licht_Zetel_1
  • Licht_Zetel_2"
    … at a moment it starts to loop…
    it keeps repeating and after a few minutes:
    “Timeout running pipeline”
    Any idea or suggestion? What is wrong?
    I`m running Ollama locally on a Mac mini.

You would maybe get more “Focus” if you Added Ollama, in your Topis header

You haven’t said anything about what model, what GPU or or how much vram is assigned.

Not a large enough model. Poor model choice, or too small of a context can all do this.

Thank you for the feedback. I’m not sure wether it’s valuable, I’m not at full knowledge to discuss that.
But your answer raises a new question:
The question I posed to the AI is an easy one: it needs to understand “lights in the kitchen” and next it should just do a query at HomeAssistant. 5 years ago, without AI, this would be an easy task. Anno 2026 we have AI and we need to wonder about GPU, enough MEM, the size of the model should be big enough, and the type of the model…
I do have moments where I like the AI era, but I remain surprised that we (humans) suddenly accept that a computer can be wrong and that a computer is allowed to act weird.
Can anybody else add an opinion if a GPU, vRAM or model can result in my initial error?

without you providing which model, how much vram or GPU - no they cannot - nor can I. All I can tell you is your issue is a classic presentation of a context overrun.

To tell you what you can successfully load and how big your context can be - this is CRITICAL. so… kinda important.

For instance, to successfully load gpt-oss:20b you BETTER have a card with 16GB of vram or it simply won’t work. I can tell you if you give Mistral a small context window and overrun it… It’ll do EXACTLY what you describe - in fact it won’t shut up unless you kill the ollama runner.

Voice - does not necessarily mean LLM:

This is use local first in your assistant setup - and it tries to use intent and speech to phrase matching first BEFORE sending to an LLM.
Speech-to-Phrase brings voice home - Voice chapter 9 - Home Assistant

But I can tell you for a FACT

Sorry, that answer won’t do you if you’re trying to LLM. This is table stakes ^^^ you will NOT be successful without learning these numbers. The matter as to what model you can successfully load.

For some of the basics to drive HA, you will need a recent gpu with at LEAST 8 (prefer 12,16, or more) GB of VRAM.

I personally do NOT accept these axioms and have VERY strict reigns on my LLMs. Friday (my primary home agent) has incredibly tight controls and audits, and understands the words, “I don’t know that, Boss.” So, it doesn’t wander off in the weeds - at worst case she stops or is a bit overprotective.

What you describe won’t happen in my setup - because the system wouldn’t be allowed to run that way… I have items where that’s a safety issue. Too many errors and I shut the inference engine down…

The quality of response will be directly proportional to the quality of the context given - BUT to even consider that you have to hit the basic bar with model capabilities and resources - yours isn’t behaving like it has those. Once you hit that bar, THEN we can talk about required context.

If your system doesn’t have the requirements for local LLM, maybe a cloud llm or speech to phrase will do?

some notes about models and sizes and speeds n feeds…

oh hey! Look at those Mistral:7b results… ^^^ Sound familliar?

1 Like

Sounds like a very “Bad” written AI Product … I asked Gemini to " Shut the f… Up ! ", and she did with an " Ok! " :joy:

The model is qwen3:1.7b
If I run it in my mac m2 pro with 16GB, I get 77.73 tokens/sec when I ask it to tell a joke. Seems performant, not?
It’s that same MAC that is used for the voice assistant in my HA setup.
I come back to my original thought: I can’t believe that we need such heavy hardware and that we accept the fact that it results in a loop.
How is this evolution?

With 16GB… Of ram or Vram?

LLMs require hardware. Sorry physics. Tensor math… LOTS of tensor math…

This is why LLM vendors charge money.

You accepted it like I said. I do not. You simply have a malfunctioning setup. (looping is an error result) I strongly suspect if you review your ollama logs you’ll see ‘layers pushed to CPU’ and ‘context window exceeded’ warnings and errors… Basically, You have something wrong in either your configuration or your prompting.

Looking at the model first. (this is your immediate problem)

You’ll need A MUCH bigger model. This is why you’re getting what you are. You’re forcing a context in it can’t handle (this will be related to how many entities do you expose)

That 1-2b model is struggling to know what’s up. It also has a tiny context window you’ll overrun immediately. You’ll want something larger. If you’re a qwen fan you’ll need something like the qwen 8b to be successful.

1-2 b param models are ok at some lightweight summary but simply won’t do well controling your home.

You basically have the equivalent of a bass/ski boat being pulled by a trolling motor, asking why your boat crouched under load and can’t pull a water skier. Need a bigger motor.

You’ll ultimately want an 8b model that is a ‘tool user’ and supports at LEAST an 8k context window. That’s something like qwen8b or gpt-oss:20b. If you want advanced results you’ll need a ‘reasoning’ (thinking) capable model and very good tools.

This 5070 is what Friday runs her primary local inference on and it’s STILL not good enough for 100% local for me. Probably one more card generation.

Is that for real, You are a Nerd :joy: :+1:

1 Like

110% true. Also… I thought this was well established by now? :rofl:

1 Like

I assume you don’t need any heating-elements in that room :slight_smile:

It’s surprisingly cool. Running BOTH that beast and the Intel a770xe onboard as many inference jobs as I do on the desk in my office it’s only actually running full power a fraction of its lifetime…

It spools down to ~30w standby very quickly. But I get near instant response from oss:20b, Llama3.2-vision, qwen3-vl, qwen3:8b at high token rates with large context.

Real talk, it’s about 100% more efficient (half the power use of, however that’s said) than my old desktop and I run this one and the NUC that’s my desktop pc now for less than the combined power use of the old ATX tower I had and I don’t play FarCry. (or whatever fps is stressing boxes now. I’m old)

1 Like

I’ll never jump into AI/GPT … well as i feel/think now , im old too, we are suppose to retire :grin:

1 Like