Local LLM for dummies

Yes, this guide is for dummies like me that want to run a local LLM but due to the large number of new technologies/articles/names everywhere I didn’t know where to start. I must say that development is going very fast and what will be written here could be old school tech tomorrow.
This guide is intended to get your local LLM up and running asap before that happens.

I have tried to get this working on Ubuntu but after a week digging through so many tutorials and typing commands in a terminal I gave up and started to look for a Windows solution. 4 hours later I had my local LLM up and running. Please note this has nothing to due with the OS but it says more about me being a dummy.

But here is a quick guide to get you started on Windows/Linux/Mac

What will you need?
Software

Hardware
A computer with a GPU that have at least 8Gb vram. CPU isn’t that important.

Information
My setup
*I tested this on a remote system running Ryzen 7, 32Gb ram, RTX 3060TI with 8Gb vram. *
My context length is at 5400 tokens due to many exposed entities (80). This slows things down a bit. But complete pipeline takes between 2-5 seconds for a answer.

Where to install?
This can be installed om a separate machine in your network or if possible on the same machine that HA runs on.

Installation
Download and install LM Studio
After starting LM Studio you need a LLM model to play with.
Download the suggested model (Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf)
When the download finish load the model and your ready to go. that’s all

Information
Check the small cog next to the model name field and see if all layers are loaded into the GPU’s vram. Running in ram is possible but it will be much, much slower.

At this moment this will be a chatbot to play with.
On the right you have a box called System prompt
Here you can describe the personality of the chatbot, how it should respond and and what it limits are.
Here is an example prompt
You are the smartest chatbot around and you translate all your responses to Dutch. Give answer in the style of Glados. Don’t hold back on irritating answers.

(Have fun with this one)

Information
*To control entities in HA you need a model that can do somethings called Function calling, This will create a response in a form that other applications understand. Often this is in the form of JSON with the openAI schema. The model mentioned here supports Function calling. *
LM Studio has a model search ability and it will search Huggingface If you go to that site directly you have better search options.

Go to LM Studio
Click on the Developer icon on the left
If you have a remote PC then turn Serve to local Network ON otherwise leave it OFF for running on localhost.
Now click the Start Server button.

If you followed the setup instructions you have now also installed Local LLM Conservation in HA and connected the Whisper and Piper pipeline together. Now you have a working system.

Information
The basic actions such as Turn the lights on and What is on in the kitchen will work. Other command may or may not work. These can be added via the system prompt. Another way to get things working was adding it to the config options of Local LLM Conversation and adding an “additional attribute to expose in the context”. For example for de todo lists I’ve added an entry called shopping_list

Troubleshooting
Debugging the pipeline

Command processing errors
In LM Studio, the developer tap.
Turn on all the logging and see what is happening. Most common errors are that the prompt contains more tokens then defined in Context Length. One place to change that is via the cog next to the model name. In My Model you can set this per model as defaults. The log will show you how any tokens the prompt had.

Time out problems
The config options of Local LLM Conversation has an option called ‘Remote Request Timeout’. This has a default of 90 seconds. So if something goes wrong then you have to wait 90 seconds before you get any response. I got this at 6 seconds and the ‘Max tokens to return in response’ at 128 tokens

2 Likes

Some progress on getting more control and more speed

Reducing the System prompt length
If you look at the System prompt you see a Tools place holder

You are 'Nabu', a helpful AI Assistant that controls the devices in a house and make all replies in Dutch. If you do not understand the question then do not answer. Complete the following task as instructed with the information provided only.
The current time and date is {{ (as_timestamp(now()) | timestamp_custom("%I:%M %p on %A %B %d, %Y", "")) }}

Tools: {{ tools | to_json }}

Devices:

{{ tools | to_json }} will generate a big chuck of the prompt filled with API info to make function calls. To see what is actually send to the LLM you have to look in the log of LM Studio.
Now for me, I have quiet a few custom intents that I don’t want to expose but they still show up here. So if I replace {{ tools | to_json }} with the part found in the log between Tools: and Devices: things will still be working but now I have control over which function calls will be active or even add my own.

Each function has its own schema and is described with the {“type”:“function”,“function”: tag and end before the next one

Example
{"type":"function","function":{"name":"HassStartTimer","description":"Starts a new timer","parameters":{"type":"object","properties":{"hours":{"type":"integer","description":""},"minutes":{"type":"integer","description":""},"seconds":{"type":"integer","description":""},"name":{"type":"string","description":""},"conversation_command":{"type":"string","description":""}},"required":[]}}}

By removing function calls that are not needed you make the system prompt shorter and as a result the response faster.

Here is an example to test your function call if you want to add one where I try to add water to my shopping list.

Open assist (via Overview/Assistant in the right top corner) and enter the following in the request box.

<functioncall> {\"name\":\"HassListAddItem\",\"arguments\":{\"name\":\"todo.shopping_list\"}:{\"item\":\"water\"}}"