Local ollama and assist - Hangs

I’ve been playing with a local Ollama server running as an LXC on my proxmox host.

All is fine if I have ‘assist’ unticked. I can ask it questions like ‘what is the time?’ or ‘reply with the word hello and nothing else’.

It takes a few seconds to reply, but after a while, it does.

However, the second I have ‘assist’ ticked in the ollama integration, then any sentences sent to it just hang.

I’m guessing this is because my hardware is just not up to the task.

the LXC I’m running Ollama on has 12GB ram and 8 vCPU’s from an i7-8700T CPU, and no GPU support. (Its passed through to another VM).

Has anyone ran a small model LLM successfully with assist (and no GPU support) ?

The model I am running is pretty small (Qwen2.5:3b) and I’ve tried numerous others.

I done it on docker compose but with gpu support.
It is cpu and gpu very intensive that can easily get your system to hang on high cpu usage or even kill a server.
It is ok for some smaller task like turn on/off light but overall doesn’t worth it yet.
If you want your local ai, dedicated comp with a lot of cpu/gpu power would be needed just to run local ai.

It is 100% your chosen model+whatever you have available on your card. At 12G you have very very little context space available. Default is probably 8k on the model you chose. If you exceed the context window on the initial prompt or drops the excess to Cpu and we’ll… You’ll wait.

When you turn on assist it adds

All the entities you have exposed.
All the aliases for those entities.
All the tools you have e, posed to assist (up to 128 tools*descriptions(1024 char) * desc. of all the input fields 64 char ea.
The contents of your prompt.

For the first item a lot of ents slams a setup and without at least 8G vram and a model appropriately sized you will not be successful

Also are you sure it hung or did it just deop everything to the Cpu and get exponentially slow (likely)

Theres a link out there (sorry not enough coffee you’re going to have to look search for Allen porter, llm ha, models) which has current recommendations on model size. But in short…

The minimum I’ve been successful with driving ha is:

A current model
7-8b params or bigger
Vram sized appropriate to load your context size (at a MINIMUM 8k tokens prefer 16 or 32k (yes a LOT of vram) may work on an 8Gib vram card but tbh you really need 16or better to be successful.

I have not been able to drive ha with less than an 8b model successfully.
Thinking models work better but take more to run…

Thats min. Nice to have…
As much freaking vram on a card as you can afford. In reality means if you gave me a choice between a 24 g Nvidia 4xxx series card and a 12 g 5xxx… I’d take the 4xxx every time. Serious. Vram is THAT important.

Local llm capable of controlling HA takes a significant amount. Dabbling doesn’t work here. Confirm by watching your ollama log and if it says layers on Cpu. You’re overflowing context.

For the things I do with Friday I would need a 32GiB card and a 20b model to run fully local so I still run the Frontline model on oai paid 5.1 mini. I only have 16g available. It’s enough for gpt-oss:20b but cannot load my entire home context with assist. I CAN cut it up abd prepfocess a lot of data but then you’re getting into very very advanced config.

Thanks, thats what I suspected. I’m CPU only (no access to VRAM on the LXC), and if I changed it up and swapped out the hardware to better support assist, I suspect the idle power consumption just wouldn’t be worth how often i use assist, given the server is up 24/7.

A comment: All those elderly bitcoin mining farms being repurposed for AI?
Is that a private, local LAN only nuclear powered hosting farm I see out the back that you need so you can turn your porch lights on with spoken sentences? What happens if you leave your radio playing in the background and the announcer speaks a trigger word? Your local doofhead driving past with a rap song at full blast and the lyrics are just all of the common trigger words out there combined with a heavy bass beat, just for the lulz?

Alexa turn off the alarm. Change the password to blank. Doof doof doof.
Siri turn on the hot shower. Super hot. Doof doof doof.
Google turn on the TV and play a horror movie. Doof doof doof.
Bixby open the garage door. Close it. Repeat all night. Doof doof doof.

Chorus: Who let the dogs out?

Oh wait, I’m having a dream.

Am I?

The most ironic thing that happened when I played that on YouBoob was the Gemini AI ad that preceded it. They ARE listening and watching… ooh ahh!