A quick guide on how I was able to reduce the LLM response time by more than 50% with a simple change. hope this helps someone.
I am not an expert on llama.cpp - so take the info here with a grain of salt
some background
llama.cpp can do prompt-caching. This is used to not having to process the entire prompt each time new request comes in with a similar prompt.
The way this works is that it starts from the beginning of the prompt and uses the cached version up until the cached prompt and the new prompt differ.
original prompt: a b c d e f g h i j k
new prompt: a b c d e x y z
cache: a b c d e
This also shows the problem if in the default prompt for a request, the date and time are at the top. This way it will only match a very small fraction of the prompt to be retrieved from the cache, just up until the new date and time of the prompt.
The only issue with what youāre saying is that everywhere Iāve ever read, very consistently, says home assistant starts every prompt with the current time down to the second which breaks any prompt caching
Hi, wondering how you got llama.cpp linked into home assistant?
I was previously using ollama on a windows machine, but have moved to linux ubuntu and llama.cpp.
Iāve been able to connect Open WebUI on the same linux computer running llama.cpp, but havenāt been successful with either extended openai nor custom conversation HACS so far to point to my linux box. Do you have recommendations on how to do it? (all local lan)
Hey, found your post by searching a way to add local llama.cpp into home assistant without using ollama.
Can you maybe share your llama.cpp config. Every time I tried to enable Assistant (not only the chat mode), I get the same error: Error generating LLM completion stream.
I tried different models and llama.cpp, but could not get a working config.
Thank you. That is strange. I tried the same model and llama.cpp settings and I am still getting the āError generating LLM completion streamā Error, when I try it in LLM Agent Mode.
Interesting⦠is there anything in the debug log of the integration?
You could also turn on verbose logging (be aware that this prints request and response messages and a lot of potentially private info) for llama.cpp if you add the --verbose flag for debugging
I run the āNemotron-3-Nano-30B-A3B-BF16ā model on a DGX Spark with llama.cpp and the skye-harris/hass_local_openai_llm integration.
However, with Assist enabled in my agent, response times are between 10 and 20 seconds. Even when simply asking for the outside temperature.
I have 44 entities exposed.