Integration with LocalAI

Do you think this will work with oogabooga (GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models like LLaMA, llama.cpp, GPT-J, OPT, and GALACTICA.)?

There’s an extension which basically mirrors the openapi calls so it should be a drop in replacement… theoretically it should work I think

Yep! (I also have text-gen-webui running here, I just found localAI to be a bit better packaged).

Once I get some more time to work on it and get it going properly (hopefully this weekend) you’ll be able to provide any API endpoint you want, and as long as it provides an OpenAI compatible response it should work.

I’m already doing this with a few other apps, just haven’t had time to hack on it any further yet.

1 Like

Better yet it seems that someone has got a working integration available now - GitHub - drndos/hass-openai-custom-conversation: Conversation support for home assistant using vicuna local llm

Have you tested it out?
I’m about to give it a spin!

EDIT: I can’t seem to get that one working

EDIT2:
I figured out the issue.
home assistant does not let you connect to a non-encrypted endpoint if you’re using https for the home assistant instance.
I threw NPM as a proxy and it works correctly now

1 Like

Have you been able to increase the token size when using the custom openapi addon? I’ve tried increasing all the “Truncate the prompt up to this length” within the text-generation-webui interface but I still an error stating:
Sorry, I had a problem talking to Custom OpenAI compatible server: This model maximum context length is 2048 tokens.

You set the context length within the configuration of your model.
This is how I do it:

  1. Follow Easy Setup - GPU Docker :: LocalAI documentation
  2. You will need quantized models with newer quantization method that have the ending with _k_s or _k_m or _k or _k_l
  3. I open postman and create a POST request to http://localhost:8080/models/apply with raw body of JSON
{
     "id": "huggingface@thebloke__vicuna-13b-v1.5-ggml__vicuna-13b-v1.5.ggmlv3.q3_k_s.bin"
}
  1. Wait for the model to download. Usually the model with more parameters is better. q4 is usually the sweet spot, but 13b q3 should perform better in terms of “quality” than 7b q4/q5. With windows running, I can fit in my RTX3060 with 12GB VRAM - thebloke__vicuna-7b-v1.5-ggml__vicuna-7b-v1.5.ggmlv3.q5_k_m.bin (even q8, but there is very little difference in terms of anything) and thebloke__vicuna-13b-v1.5-ggml__vicuna-13b-v1.5.ggmlv3.q3_k_s.bin
  2. Change the model parameters in thebloke__vicuna-13b-v1.5-ggml__vicuna-13b-v1.5.ggmlv3.q3_k_s.bin.yaml:
backend: llama
context_size: 4000
f16: true 
gpu_layers: 43
low_vram: false
mmap: true
mmlock: true
batch: 512
name: thebloke__vicuna-13b-v1.5-ggml__vicuna-13b-v1.5.ggmlv3.q3_k_s.bin
parameters:
  model: vicuna-13b-v1.5.ggmlv3.q3_K_S.bin
  temperature: 0.2
  top_k: 80
  top_p: 0.7
roles:
  assistant: 'ASSISTANT:'
  system: ''
  user: 'USER:'
template:
  chat: vicuna-chat
  completion: vicuna-completion

you can also fix vicuna-chat.tmpl according to suggested prompt template TheBloke/vicuna-13B-v1.5-GGML · Hugging Face

Pay attention to parameter context_size, where you can customize the allowed context size of your model.
Also pay attention to gpu_layers. The more gpu_layers you deffer to GPU the faster it will go, but more GPU VRAM will be used.

  1. Restart the localai
  2. Configure your home assistant
    Set the model name
thebloke__vicuna-7b-v1.5-ggml__vicuna-7b-v1.5.ggmlv3.q5_k_m.bin

You can improve the template like this:

This smart home is controlled by Home Assistant.

An overview of the areas and the devices in this smart home:
{%- for area in areas() %}
  {%- set area_info = namespace(printed=false) %}
  {%- for device in area_devices(area) -%}
    {%- if not device_attr(device, "disabled_by") and not device_attr(device, "entry_type") and device_attr(device, "name") %}
      {%- if not area_info.printed %}

{{ area_name(area) }}:
        {%- set area_info.printed = true %}
      {%- endif %}
      {%- for entity in device_entities(device) %}
        {%- if not is_state(entity,'unavailable') and not is_state(entity,'unknown') and not is_hidden_entity(entity) %}      
 - {{ state_attr(entity, 'friendly_name') }} is {{ states(entity) }}
        {%- endif %}
      {%- endfor %}
    {%- endif %}
  {%- endfor %}
{%- endfor %}

Answer the user's questions about the world truthfully.

Next step would be configuring the functions so the LLM can actually call your home assistant services and control your home.

2 Likes

Thanks for sharing this, I’ll need to test using the localai vs the text-generation-webui that ive been using

If you are using llama.cpp backend in text-generation-webui you can set the context size by defining n_ctx parameter before loading the model

1 Like

bare with me as I’m still pretty new to the LLM stuff.
I’m using an AutoGPTQ model (TheBloke/airoboros-33B-GPT4-m2.0-GPTQ · Hugging Face).

I’m assuming it cannot be done with that one, so I’ll try the model you’ve shared

Yeah, some models only support 2048 tokens sized context unfortunately, so even if you increase the setting, it might not work.

So I’m using your model and selected llamma.cpp as the loader and increased n_ctx to 6144 then loaded the model but I’m still receiving the same error in home assistant.

I’ll probably test with localai later

LocalAI could be used as a fallback for the Assist pipelines: Pipeline chaining or fallback intent

But maybe should be something more generic than LocalAI, so you could use privateGPT or anything else.
Like maybe just use the already implemented Wyoming protocol integration.

What do you think?

So I’ve got this set up using localai now and am no longer getting the error about context length.

I’ve set my vicuna-chat.tmpl to this:

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.

USER: {prompt}
ASSISTANT:

And I’ve set the template within home assistant to this:

This smart home is controlled by Home Assistant.

An overview of the areas and the devices in this smart home:
{%- for area in areas() %}
  {%- set area_info = namespace(printed=false) %}
  {%- for device in area_devices(area) -%}
    {%- if not device_attr(device, "disabled_by") and not device_attr(device, "entry_type") and device_attr(device, "name") %}
      {%- if not area_info.printed %}

{{ area_name(area) }}:
        {%- set area_info.printed = true %}
      {%- endif %}
      {%- for entity in device_entities(device) %}
        {%- if not is_state(entity,'unavailable') and not is_state(entity,'unknown') and not is_hidden_entity(entity) %}      
 - {{ state_attr(entity, 'friendly_name') }} is {{ states(entity) }}
        {%- endif %}
      {%- endfor %}
    {%- endif %}
  {%- endfor %}
{%- endfor %}

Answer the user's questions about the world truthfully.

but am receiving this response whenever I try to interact with the LLM within home assistant:

```vbnet # Prompt /imagine prompt: [Your Prompt], [Your Prompt], [Your Prompt], [Your Prompt] --ar 16:9 --v 5 ```

Not really sure what’s going on here but I think the prompt template is messed up (even though I applied the one directly from huggingface)

Hi guys, thanks to the information in this topic I got LocalAI to work nicely with home assistant.

I was already playing around with LocalAI and similar projects and got this to work quite fast and nicely on my desktop, as long as I run the smaller models and keep the prompts short.

To make it all a bit more sustainable I did the following:

  • dekstop sleeps after 15 minutes
  • I use an almost empty template, so I can generate templates with automations
    • reducing the prompt speeds up some LLM’s
  • the automation checks if the desktop is on, otherwise Wake-On-Lan
  • then call the service “conversation: process” with your generated prompt
  • send your llm’s prompt wherever you want (notification, tts media player, etc.).

Sure, this does not replace the conversational assistant, but it does allow you to generate LLM messages with some nice prompt engineering with limited hardware, on your own infrastructure :star_struck:.

I’m pretty green to the whole HACS and LLM thing. I managed (i think) to get the LocalAI running, but after installing the hass-openai-custom-conversation I’m a bit lost. Can someone give me a hint?

  • Install local-ai :white_check_mark:
  • Setup model :white_check_mark:
  • Install hass-openai-custom-conversation :white_check_mark:
  • Add custom component to your hass installation :question:
  • Set first field to any string, set second field to the address of local-ai installation :question:
  • Configure hass assist to use custom openai conversation as conversation agent, set options to contain instructions specific to your setup and model name

The two steps with questionmarks above are where I’m getting lost. I’m guessing the last step is adding the assistant through the regular HA Voice assistant interface.

Thanks :slight_smile:

Hi,
I can’t be sure where you get stuck, but I think you have to go to your integrations page and add the just installed custom component afbeelding. You can find it by simply looking for custom OpenAI conversation, clicking that will start the wizard where you can “set the first field to any string” and where you have to point to your localAI installation.

Hope this gets you going! :+1:

Can LocalAI be run as an add-on in Home Assistant? Trying to avoid using docker if possible :slight_smile: (running HA bare metal on a NUC)

1 Like

hey there @drndos thanks for sharing!

I got LocalAI running now and working great. I’d like to incorporate this in HA but since the writing of this, it seems that ggml models are no longer supported for LocalAl and it says to use a gguf model.

Can you please share a model for LocalAI (pref 13b), that would work with you integration and if possible, the settings you used in the files (ie. yaml, tmpl, etc).

thanks so much!

Hi, could you share if you used the gguf or ggml model (as above) ? thanks

I don’t think that is realistic and in any way usefull.

I’m running a 13B model (vicuna-13b-v1.5.ggmlv3.q3_K_S.bin)
on the following hardware:

  • AMD ryzen 5950X
  • 64GB RAM
  • GTX1080 (somewhat dated)

Using all these resources gives me a response time of 4~30 seconds, depending on the difficulty and length of the prompt.

Also note that the desktop is completely useless when it is prompted, and as long as the model is kept in memory.