Yes, this guide is for dummies like me that want to run a local LLM but due to the large number of new technologies/articles/names everywhere I didn’t know where to start. I must say that development is going very fast and what will be written here could be old school tech tomorrow.
This guide is intended to get your local LLM up and running asap before that happens.
I have tried to get this working on Ubuntu but after a week digging through so many tutorials and typing commands in a terminal I gave up and started to look for a Windows solution. 4 hours later I had my local LLM up and running. Please note this has nothing to due with the OS but it says more about me being a dummy.
But here is a quick guide to get you started on Windows/Linux/Mac
What will you need?
Software
- HACS installed
- LM Studio (choose the windows version. Linux/Mac are also available)
- Local LLM conversation - Follow these steps
Hardware
A computer with a GPU that have at least 8Gb vram. CPU isn’t that important.
Information
My setup
*I tested this on a remote system running Ryzen 7, 32Gb ram, RTX 3060TI with 8Gb vram. *
My context length is at 5400 tokens due to many exposed entities (80). This slows things down a bit. But complete pipeline takes between 2-5 seconds for a answer.
Where to install?
This can be installed om a separate machine in your network or if possible on the same machine that HA runs on.
Installation
Download and install LM Studio
After starting LM Studio you need a LLM model to play with.
Download the suggested model (Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf)
When the download finish load the model and your ready to go. that’s all
Information
Check the small cog next to the model name field and see if all layers are loaded into the GPU’s vram. Running in ram is possible but it will be much, much slower.
At this moment this will be a chatbot to play with.
On the right you have a box called System prompt
Here you can describe the personality of the chatbot, how it should respond and and what it limits are.
Here is an example prompt
You are the smartest chatbot around and you translate all your responses to Dutch. Give answer in the style of Glados. Don’t hold back on irritating answers.
(Have fun with this one)
Information
*To control entities in HA you need a model that can do somethings called Function calling, This will create a response in a form that other applications understand. Often this is in the form of JSON with the openAI schema. The model mentioned here supports Function calling. *
LM Studio has a model search ability and it will search Huggingface If you go to that site directly you have better search options.
Go to LM Studio
Click on the Developer icon on the left
If you have a remote PC then turn Serve to local Network ON otherwise leave it OFF for running on localhost.
Now click the Start Server button.
If you followed the setup instructions you have now also installed Local LLM Conservation in HA and connected the Whisper and Piper pipeline together. Now you have a working system.
Information
The basic actions such as Turn the lights on and What is on in the kitchen will work. Other command may or may not work. These can be added via the system prompt. Another way to get things working was adding it to the config options of Local LLM Conversation and adding an “additional attribute to expose in the context”. For example for de todo lists I’ve added an entry called shopping_list
Troubleshooting
Debugging the pipeline
Command processing errors
In LM Studio, the developer tap.
Turn on all the logging and see what is happening. Most common errors are that the prompt contains more tokens then defined in Context Length. One place to change that is via the cog next to the model name. In My Model you can set this per model as defaults. The log will show you how any tokens the prompt had.
Time out problems
The config options of Local LLM Conversation has an option called ‘Remote Request Timeout’. This has a default of 90 seconds. So if something goes wrong then you have to wait 90 seconds before you get any response. I got this at 6 seconds and the ‘Max tokens to return in response’ at 128 tokens