Ollama slow when "Control Home Assistant" is enabled

I have set up a local ollama instance as a conversation agent. When I ask a question, or give a command that HA itself can’t handle, it sends to ollama, but takes a very long time to get an answer. Looking at the logs, it looks like HA sends a POST to /api/chat, and this action can take anything between 2 to 7 minutes with the CPU running at full capacity (all cores). If I disable the control option, I get very fast responses.

Any idea what’s going on here? Is it because the POST action sends all the Home Assistant information and ollama is just taking a long time to process all the data?

I’m trying to figure out the best way to improve that. HA is running on a Raspberry Pi, but ollama is running on an 6 core i7 PC without a GPU. I wanted to figure out whether or not this is the missing piece, before considering spending any money with that.

Any help would be appreciated.

For running a LLM the number of cpu cores or ram isn’t that important. What you really need is vram. A gpu with atleast 8Gb of vram. The difference is dramatic

You are absolutely right. CPU is not good enough. I figured out a way to run Ollama on a remote machine that has an NVIDIA, so now any operations take less than 1.5 seconds. Thanks!

I think you guys are oversimplifying the solution here. Sure a GPU is much better, but should it really be this huge difference, with and without HA control, on a CPU?

I’m having the same issue. I exposed only 1 entity, no aliases, just to test, still extremely slow, over 30s. WIthout HA control, it takes maybe 1-3s to respond to simple prompts. So the problem are not the entities themselves. Sure, they must add some system prompt for this, but can’t a cpu handle it at all?

Also, does anyone know how to set the “num_threads” of the model through the ollama integration? The model they suggested “llama3.1:8b” has it set to 6, so in my case it never utilizes more than 40% of the cpu. In OpenWebUI, I can change that, and whent I set it to all cores, the cpu goes to 100%, and it does seem I get better results

Simply, yes. Because context.

Suddenly when control HA is on there are now a ton of objects the vector dB needs to track and each option relates to every other option exponentially.

So throwing that switch in my instance is instantly throwing about 2500-3000 entities and states to the llm. And how they all relate.

That will munch Vram for lunch.

But there aren’t a ton of objects, that is my point. I only exposed 1 entity to Assist. Ok, it might be sending areas, or some other metadata, but can I possible be that much? I’ll try sniffing what is actually sent to ollama.

If this really is true, then there should be a big, fat, bold warning on the ollama integration, that it can’t be used withou a GPU, at all.

That should go without saying. Without a GPU it’s basically a slug. There’s zero way I’d try to push any control through it at all. I’ve got a successful Intel IPEX-ARC Ollama running on an alchemist based NUC… And I still won’t let it drive my home. (not enough chooch :steam_locomotive:)

The smallest models I’ve considered are the 7-8B parameter fully quantized models like llama3.2:8b (sp?) or the latest phi4 with tool use capability - for those you need at a MINIMUM 12G ram dedicated to the GPU/NPU/Cpu behaving as NPU whatever… To start. Then you go up from there for performance…

Nothing goes without saying. They are recommending models in the docs, without mentioning the hardware at all. I’m running the llama3.1:8b model (as they suggested) on the CPU. As a chatbot it’s usable. I don’t really see why would it go without saying that adding 1 entity to it will make it degrade performance 10-20x, or more. One sigle entity… A footnote in the docs wouldn’t hurt anyone…

If this is really true, then this sentance needs to be in the docs. It would put off a lot of people from even trying and wasting time. Most people don’t have much room to go up… Everything above 16GB is insanely priced. You can’t even get much more than 16GB on a single GPU, regardless of the price. And you can’t expect people to setup multi GPU rigs just to control 25 HA entities. While, without HA control, everything works fine even on the CPU… Something doesn’t add up here.

I still have my doubts that this is actually ok. I repeat, one single entity degrading performance 10-20x, maybe even more. If it’s really sending so much data, just for that 1 single entity, then there is something fundamentaly wrong with this integration.

Does anyone have experience with running other integrations, like Extended OpenAi Conversation, on the CPU? Or is a complete waste of time on CPU, as well? From what I can see, there you can at least control the full prompt (the entities info being sent), so maybe it can be tuned better.

Also, I’m still looking for a way to set “num_threads” through the ollama integration, so at least I give it a try with the CPU fully utilized. Currently it never goes above 40-50%

I came across this today in my initial explorations. Personally it’s a Granite model (granite3.2:2b), but the symptoms are the same. Interestingly, it seems some others have hit this on GPU too - reddit thread https://www.reddit.com/r/homeassistant/comments/1czps3n/ollama_and_home_assistance_very_slow_to_answer/

Sample query, running on 6x 10300T vCPUs:

“what is home assistant?”
Without control enabled:

  • Time to first response character: ~3s
  • Time to finish response: +10s

With control enabled (22 entities) response is awful.

  • Time to first response character: ~70s
  • Time to finish response: +1s?
1 Like

Yeah, I am running into similar delay with a dedicated machine with a GPU for my AI server and HA running on a NUC. My request from other machines to the chatbot is significantly faster than HA requests. Tinkering I will go …

yep, same issue here with two gpu’s in a dedicated system. soon as i tell HA voice to control a single entity response times are 20x slower.