Improve local LLM performance with llama.cpp and custom-conversation

A quick guide on how I was able to reduce the LLM response time by more than 50% with a simple change. hope this helps someone.

I am not an expert on llama.cpp - so take the info here with a grain of salt

some background

llama.cpp can do prompt-caching. This is used to not having to process the entire prompt each time new request comes in with a similar prompt.

The way this works is that it starts from the beginning of the prompt and uses the cached version up until the cached prompt and the new prompt differ.

original prompt: a b c d e f g h i j k
new prompt: a b c d e x y z
cache: a b c d e

This also shows the problem if in the default prompt for a request, the date and time are at the top. This way it will only match a very small fraction of the prompt to be retrieved from the cache, just up until the new date and time of the prompt.

changes

I used the excellent GitHub - michelle-avery/custom-conversation: A very customizable version of a Conversation Agent for Home Assistant to connect to my local llama.cpp instance. But this would work with any other component that allows you to edit the prompt.

Either remove the date/time timeplate (in custom-conversation it is in the Base Prompt), or move it as far down as possible.

improvement

in my setup, the LLM responses for qwen3:4b went from around 5 seconds to under 2 seconds.

Note: I also added --cache-reuse 256 as a llama-server parameter - I am not sure if this is required to turn on prompt caching.

The only issue with what you’re saying is that everywhere I’ve ever read, very consistently, says home assistant starts every prompt with the current time down to the second which breaks any prompt caching

That’s why I included the custom conversation component above. It allows you to edit the prompt.

Hi, wondering how you got llama.cpp linked into home assistant?
I was previously using ollama on a windows machine, but have moved to linux ubuntu and llama.cpp.

I’ve been able to connect Open WebUI on the same linux computer running llama.cpp, but haven’t been successful with either extended openai nor custom conversation HACS so far to point to my linux box. Do you have recommendations on how to do it? (all local lan)

1 Like

Sure. I use llama-swap which comes with llama.cpp and therefore llama-server.

It requires a bit config setup compared to ollama, but you don’t have to change it often.

llama-swap can actually do way more than just run and switch llama.cpp servers, but I only use it for this.

On the home assistant side, I use GitHub - michelle-avery/custom-conversation: A very customizable version of a Conversation Agent for Home Assistant and point it to my llama-swap instance.

Let me know if you have any specific questions regarding the config or the setup.

Hey, found your post by searching a way to add local llama.cpp into home assistant without using ollama.

Can you maybe share your llama.cpp config. Every time I tried to enable Assistant (not only the chat mode), I get the same error: Error generating LLM completion stream.

I tried different models and llama.cpp, but could not get a working config.

Thanks.

Sure.

Currently I am using qwen3:4b-instruct-2507 from unsloth in Q4_K_M and these are the parameters for llama-server

--temp 1.0 
--min-p 
--top-k 64 
--top-p 0.95
--cache-type-k q8_0 
--cache-type-v q8_0 
--flash-attn on
-hf unsloth/Qwen3-4B-Instruct-2507-GGUF:Q4_K_M
-ngl 99 
--ctx-size 32768
--jinja

side note: I just started testing the relatively new --cache-ram flag. this might improve prompt processing.
--cache-ram -1

1 Like

Thank you. That is strange. I tried the same model and llama.cpp settings and I am still getting the ā€œError generating LLM completion streamā€ Error, when I try it in LLM Agent Mode.

Llama.cpp states it failed to parse tools.

If anyone has an idea let me know.

Which integration are you using in home assistant?

I am using the one you suggested in the first post (michelle-avery/custom-conversation).

Interesting… is there anything in the debug log of the integration?

You could also turn on verbose logging (be aware that this prints request and response messages and a lot of potentially private info) for llama.cpp if you add the --verbose flag for debugging

Another intergration I have used with llama and qwen4 which work good is: GitHub - skye-harris/hass_local_openai_llm: Home Assistant LLM integration for local OpenAI-compatible services (llamacpp, vllm, etc)

1 Like

This is no longer true, I pushed a change a few months back for that.
Now the LLM relies on a tool to get the date and time if you the Assist API.

what are your average response times?

I run the ā€˜Nemotron-3-Nano-30B-A3B-BF16’ model on a DGX Spark with llama.cpp and the skye-harris/hass_local_openai_llm integration.
However, with Assist enabled in my agent, response times are between 10 and 20 seconds. Even when simply asking for the outside temperature.
I have 44 entities exposed.