I ask the assitant: (not yet voice, I type the question:)
“name all lights of the kitchen”
and the answer is:
"The lights in the kitchen are:
Licht_Keuken1
Licht_Keuken2
Licht_Keuken3
Licht_Keuken4
Licht_Gr_Kamer_Lavabo
Licht_Kl_Kamer_Lavabo
Licht_Salontafel
Licht_Zetel_1
Licht_Zetel_2
Licht_Zetel_3
Licht_Zetel_4
Licht_Zolder1
Licht_Zolder2_Schuin
Licht_Zoldertrap
Licht_Inkom_Binnen
Licht_Kelder
Licht_Vloertraphal
Licht_Wandlicht
Licht_Zetel_1
Licht_Zetel_2
Licht_Zetel_3
Licht_Zetel_4
Licht_Zolder1
Licht_Zolder2_Schuin
Licht_Zoldertrap
Licht_Inkom_Binnen
Licht_Kelder
Licht_Vloertraphal
Licht_Wandlicht
Licht_Zetel_1
Licht_Zetel_2
Licht_Zetel_3
Licht_Zetel_4
Licht_Zolder1
Licht_Zolder2_Schuin
Licht_Zoldertrap
Licht_Inkom_Binnen
Licht_Kelder
Licht_Vloertraphal
Licht_Wandlicht
Licht_Zetel_1
Licht_Zetel_2"
… at a moment it starts to loop…
it keeps repeating and after a few minutes:
“Timeout running pipeline”
Any idea or suggestion? What is wrong?
I`m running Ollama locally on a Mac mini.
Thank you for the feedback. I’m not sure wether it’s valuable, I’m not at full knowledge to discuss that.
But your answer raises a new question:
The question I posed to the AI is an easy one: it needs to understand “lights in the kitchen” and next it should just do a query at HomeAssistant. 5 years ago, without AI, this would be an easy task. Anno 2026 we have AI and we need to wonder about GPU, enough MEM, the size of the model should be big enough, and the type of the model…
I do have moments where I like the AI era, but I remain surprised that we (humans) suddenly accept that a computer can be wrong and that a computer is allowed to act weird.
Can anybody else add an opinion if a GPU, vRAM or model can result in my initial error?
without you providing which model, how much vram or GPU - no they cannot - nor can I. All I can tell you is your issue is a classic presentation of a context overrun.
To tell you what you can successfully load and how big your context can be - this is CRITICAL. so… kinda important.
For instance, to successfully load gpt-oss:20b you BETTER have a card with 16GB of vram or it simply won’t work. I can tell you if you give Mistral a small context window and overrun it… It’ll do EXACTLY what you describe - in fact it won’t shut up unless you kill the ollama runner.
Sorry, that answer won’t do you if you’re trying to LLM. This is table stakes ^^^ you will NOT be successful without learning these numbers. The matter as to what model you can successfully load.
For some of the basics to drive HA, you will need a recent gpu with at LEAST 8 (prefer 12,16, or more) GB of VRAM.
I personally do NOT accept these axioms and have VERY strict reigns on my LLMs. Friday (my primary home agent) has incredibly tight controls and audits, and understands the words, “I don’t know that, Boss.” So, it doesn’t wander off in the weeds - at worst case she stops or is a bit overprotective.
What you describe won’t happen in my setup - because the system wouldn’t be allowed to run that way… I have items where that’s a safety issue. Too many errors and I shut the inference engine down…
The quality of response will be directly proportional to the quality of the context given - BUT to even consider that you have to hit the basic bar with model capabilities and resources - yours isn’t behaving like it has those. Once you hit that bar, THEN we can talk about required context.
If your system doesn’t have the requirements for local LLM, maybe a cloud llm or speech to phrase will do?
some notes about models and sizes and speeds n feeds…
oh hey! Look at those Mistral:7b results… ^^^ Sound familliar?
The model is qwen3:1.7b
If I run it in my mac m2 pro with 16GB, I get 77.73 tokens/sec when I ask it to tell a joke. Seems performant, not?
It’s that same MAC that is used for the voice assistant in my HA setup.
I come back to my original thought: I can’t believe that we need such heavy hardware and that we accept the fact that it results in a loop.
How is this evolution?
You accepted it like I said. I do not. You simply have a malfunctioning setup. (looping is an error result) I strongly suspect if you review your ollama logs you’ll see ‘layers pushed to CPU’ and ‘context window exceeded’ warnings and errors… Basically, You have something wrong in either your configuration or your prompting.
Looking at the model first. (this is your immediate problem)
You’ll need A MUCH bigger model. This is why you’re getting what you are. You’re forcing a context in it can’t handle (this will be related to how many entities do you expose)
That 1-2b model is struggling to know what’s up. It also has a tiny context window you’ll overrun immediately. You’ll want something larger. If you’re a qwen fan you’ll need something like the qwen 8b to be successful.
1-2 b param models are ok at some lightweight summary but simply won’t do well controling your home.
You basically have the equivalent of a bass/ski boat being pulled by a trolling motor, asking why your boat crouched under load and can’t pull a water skier. Need a bigger motor.
You’ll ultimately want an 8b model that is a ‘tool user’ and supports at LEAST an 8k context window. That’s something like qwen8b or gpt-oss:20b. If you want advanced results you’ll need a ‘reasoning’ (thinking) capable model and very good tools.
It’s surprisingly cool. Running BOTH that beast and the Intel a770xe onboard as many inference jobs as I do on the desk in my office it’s only actually running full power a fraction of its lifetime…
It spools down to ~30w standby very quickly. But I get near instant response from oss:20b, Llama3.2-vision, qwen3-vl, qwen3:8b at high token rates with large context.
Real talk, it’s about 100% more efficient (half the power use of, however that’s said) than my old desktop and I run this one and the NUC that’s my desktop pc now for less than the combined power use of the old ATX tower I had and I don’t play FarCry. (or whatever fps is stressing boxes now. I’m old)