Improve local LLM performance with llama.cpp and custom-conversation

super-qua · September 29, 2025, 4:03pm

A quick guide on how I was able to reduce the LLM response time by more than 50% with a simple change. hope this helps someone.

I am not an expert on llama.cpp - so take the info here with a grain of salt

some background

llama.cpp can do prompt-caching. This is used to not having to process the entire prompt each time new request comes in with a similar prompt.

The way this works is that it starts from the beginning of the prompt and uses the cached version up until the cached prompt and the new prompt differ.

original prompt: a b c d e f g h i j k
new prompt: a b c d e x y z
cache: a b c d e

This also shows the problem if in the default prompt for a request, the date and time are at the top. This way it will only match a very small fraction of the prompt to be retrieved from the cache, just up until the new date and time of the prompt.

changes

I used the excellent GitHub - michelle-avery/custom-conversation: A very customizable version of a Conversation Agent for Home Assistant to connect to my local llama.cpp instance. But this would work with any other component that allows you to edit the prompt.

Either remove the date/time timeplate (in custom-conversation it is in the Base Prompt), or move it as far down as possible.

improvement

in my setup, the LLM responses for qwen3:4b went from around 5 seconds to under 2 seconds.

Note: I also added --cache-reuse 256 as a llama-server parameter - I am not sure if this is required to turn on prompt caching.

sluflyer06 · October 13, 2025, 10:02pm

The only issue with what you’re saying is that everywhere I’ve ever read, very consistently, says home assistant starts every prompt with the current time down to the second which breaks any prompt caching

super-qua · October 14, 2025, 10:56am

That’s why I included the custom conversation component above. It allows you to edit the prompt.

SteveGui · October 23, 2025, 7:15pm

Hi, wondering how you got llama.cpp linked into home assistant?
I was previously using ollama on a windows machine, but have moved to linux ubuntu and llama.cpp.

I’ve been able to connect Open WebUI on the same linux computer running llama.cpp, but haven’t been successful with either extended openai nor custom conversation HACS so far to point to my linux box. Do you have recommendations on how to do it? (all local lan)

super-qua · October 24, 2025, 6:05pm

Sure. I use llama-swap which comes with llama.cpp and therefore llama-server.

It requires a bit config setup compared to ollama, but you don’t have to change it often.

llama-swap can actually do way more than just run and switch llama.cpp servers, but I only use it for this.

On the home assistant side, I use GitHub - michelle-avery/custom-conversation: A very customizable version of a Conversation Agent for Home Assistant and point it to my llama-swap instance.

Let me know if you have any specific questions regarding the config or the setup.

rbn_hln · October 27, 2025, 4:47pm

Hey, found your post by searching a way to add local llama.cpp into home assistant without using ollama.

Can you maybe share your llama.cpp config. Every time I tried to enable Assistant (not only the chat mode), I get the same error: Error generating LLM completion stream.

I tried different models and llama.cpp, but could not get a working config.

Thanks.

super-qua · November 2, 2025, 10:42am

Sure.

Currently I am using qwen3:4b-instruct-2507 from unsloth in Q4_K_M and these are the parameters for llama-server

--temp 1.0 
--min-p 
--top-k 64 
--top-p 0.95
--cache-type-k q8_0 
--cache-type-v q8_0 
--flash-attn on
-hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M
-ngl 99 
--ctx-size 32768
--jinja

side note: I just started testing the relatively new --cache-ram flag. this might improve prompt processing.
--cache-ram -1

rbn_hln · November 4, 2025, 3:47pm

super-qua:

Sure.

Currently I am using qwen3:4b-instruct-2507 from unsloth in Q4_K_M and these are the parameters for llama-server
--temp 1.0 
--min-p 
--top-k 64 
--top-p 0.95
--cache-type-k q8_0 
--cache-type-v q8_0 
--flash-attn on
-hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M
-ngl 99 
--ctx-size 32768
--jinja
side note: I just started testing the relatively new --cache-ram flag. this might improve prompt processing.
--cache-ram -1

Thank you. That is strange. I tried the same model and llama.cpp settings and I am still getting the “Error generating LLM completion stream” Error, when I try it in LLM Agent Mode.

Llama.cpp states it failed to parse tools.

If anyone has an idea let me know.

super-qua · November 4, 2025, 4:17pm

Which integration are you using in home assistant?

rbn_hln · November 4, 2025, 6:04pm

I am using the one you suggested in the first post (michelle-avery/custom-conversation).

super-qua · November 5, 2025, 12:06pm

Interesting… is there anything in the debug log of the integration?

You could also turn on verbose logging (be aware that this prints request and response messages and a lot of potentially private info) for llama.cpp if you add the --verbose flag for debugging

haraldov · December 8, 2025, 7:08pm

Another intergration I have used with llama and qwen4 which work good is: GitHub - skye-harris/hass_local_openai_llm: Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc)

Djagatahel · January 9, 2026, 5:45am

This is no longer true, I pushed a change a few months back for that.
Now the LLM relies on a tool to get the date and time if you the Assist API.

aquartulli · January 16, 2026, 12:57pm

what are your average response times?

I run the ‘Nemotron-3-Nano-30B-A3B-BF16’ model on a DGX Spark with llama.cpp and the skye-harris/hass_local_openai_llm integration.
However, with Assist enabled in my agent, response times are between 10 and 20 seconds. Even when simply asking for the outside temperature.
I have 44 entities exposed.

dougmaitelli · February 9, 2026, 8:16am

I run everything on a Strix Halo with 128Gb unified memory.
Model is Qwen3-30B-A3B-2507-GGUF.

I was originally running Ollama and was getting about 65 t/s. Voice commands to turn light on or off were taking about 5s with about 30-40 entities exposed.

I changed today to run everything on Lemonade Server, so I installed kernel 6.14, all rocm drivers and libraries and now I am getting 120+ t/s and basic actions from home assistant take about 2-3 seconds to execute.

However, with this change I lost the option to use AI Tasks, since this custom integration does not support it

MarkoMarjamaa · February 12, 2026, 12:24pm

You should check also start-of-December versions of Lemonade server. There was some change in rocm/server around 18.12 that dropped my pp (gpt-oss-120b) from 800t/s to 300 t/s. I haven’t checked if they have corrected the problem.