AI Voice Control for Home Assistant (Fully Local)

BramNH · April 11, 2024, 6:34pm

I have setup a relatively fast, fully local, AI voice assistant for Home Assistant.

The guide below is written for installation with a Nvidia GPU on a Linux machine. But it is also possible to use AMD GPUs and Windows. Feel free to share any info or ask any question related to Assist.
The following components are used:

Wyoming Faster Whisper Docker container (build files)
Llama-cpp-python Docker container (build files)
Extended OpenAI HACS Integration (modified fork)
Functionary Small V2.4 LLM (Q4) (It’s multilingual as well!)
Nvidia GTX 1080 GPU

See the Installation guide below to setup the individual components.

Example 1: Control light entities

Features

Set brightness
Change color
Change temperature to cold / warm

4-11-2024 (20-15-15)

Functions code

- spec:
    name: set_light_color
    description: Sets a color value for a light entity. Only call this function
      when the user explicitly gives a color, and not warm, cold or cool.
    parameters:
      type: object
      properties:
        color:
          type: string
          description: The color to set
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - color
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        color_name: '{{color}}'
      target:
        entity_id: '{{entity_id}}'

- spec:
    name: set_light_brightness
    description: Sets a brightness value for a light entity. Only call this
      function when the user explicitly gives you a percentage value.
    parameters:
      type: object
      properties:
        brightness:
          type: string
          description: The brightness percentage to set.
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - brightness
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        brightness_pct: '{{brightness}}'
      target:
        entity_id: '{{entity_id}}'

- spec:
    name: set_light_warm
    description: Sets a light entity to its warmest temperature.
    parameters:
      type: object
      properties:
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        kelvin: '{{state_attr(entity_id, "min_color_temp_kelvin")}}'
      target:
        entity_id: '{{entity_id}}'

- spec:
    name: set_light_cold
    description: Sets a light entity to its coldest or coolest temperature,
      only call this function when user explicitly asks for cold or cool temperature of the light.
    parameters:
      type: object
      properties:
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        kelvin: '{{state_attr(entity_id, "max_color_temp_kelvin")}}'
      target:
        entity_id: '{{entity_id}}'

Example 2: Call Music Assistant service

Uses the mass.play_media service of the Music Assistant integration in Home Assistant to find and play a given playlist / track on a given Music Assistant media player entity. I have my Spotify connected to Music Assistant, thus it can find any track / playlist that is available on Spotify.

Features

Play track on MA media player
Play playlist on MA media player

5-14-2024 (13-04-33)

Functions code

- spec:
    name: play_track_on_media_player
    description: Plays any track (name or artist of song) on a given media player
    parameters:
      type: object
      properties:
        track:
          type: string
          description: The track to play
        entity_id:
          type: string
          description: The media_player entity_id retrieved from available devices. 
            It must start with the media_player domain, followed by dot character.
      required:
      - track
      - entity_id
  function:
    type: script
    sequence:
    - service: mass.play_media
      data:
        media_id: '{{track}}'
        media_type: track
      target:
        entity_id: '{{entity_id}}'

- spec:
    name: play_playlist_on_media_player
    description: Plays any playlist on a given media player
    parameters:
      type: object
      properties:
        playlist:
          type: string
          description: The name of the playlist to play
        entity_id:
          type: string
          description: The media_player entity_id retrieved from available devices. 
            It must start with the media_player domain, followed by dot character.
      required:
      - playlist
      - entity_id
  function:
    type: script
    sequence:
    - service: mass.play_media
      data:
        media_id: '{{playlist}}'
        media_type: playlist
      target:
        entity_id: '{{entity_id}}'

Important note

Even though I think it works great, don’t expect everything to work flawlessly. The performance of Speech-to-Text and the LLM is really dependent on the type of hardware you have and how you configured it. Important things are:

Speech-to-Text is heavily dependent on the quality of the audio, while the M5stack Atom Echo is fun to play and test with, its not good enough for deployment.
Simple entity naming, otherwise the LLM will not obtain the correct entity id.
Simple and strong naming and description of each function in the Extended OpenAI configuration, this is what the LLM has to use when it decides which function to call, based on your command.
The quantization of the LLM you are using (F16, Q8, Q4). F16 is largest in size, most accurate and slowest, Q4 is the smallest, least accurate, but fastest.

The performance of the GTX 1080 is not good enough for deployment in my opinion, since the LLM inference times are ~8 seconds for function calling with Functionary v2.4 small Q4. A newer Nvidia RTX 3000 / 4000 series is recommended for faster inference times.

Updates

I also got my AMD 6900XT GPU working with llama-cpp-python on my Windows PC, which can perform function calling around 3 seconds! Let me know if you need help with installing llama-cpp-(python) for ROCm on Windows.

Cloud GPUs (Vast.ai)

If you are not sure which GPU is best for you needs, or you don’t want to host the GPU at home and are fine with hourly costs, you can deploy my llama-cpp-python Docker container on Vast.ai cloud GPUs.

Image Path/Tag: bramnh/llama-cpp-python:latest
Docker Options:

-p 8000:8000 -e USE_MLOCK=0 -e HF_MODEL_REPO_ID=meetkai/functionary-small-v2.4-GGUF -e MODEL=functionary-small-v2.4.Q4_0.gguf -e HF_PRETRAINED_MODEL_NAME_OR_PATH=meetkai/functionary-small-v2.4-GGUF -e N_GPU_LAYERS=33 -e CHAT_FORMAT=functionary-v2 -e N_CTX=4092 -e N_BATCH=192 -e N_THREADS=6

Launch Mode: Docker Run

Fallback Conversation Agent (HACS Integration)

If you find the function calling of the local LLM to be too slow, you could install Fallback Conversation Agent. This lets you configure a conversation agent where you can set a primary and secondary (fallback) agent. With this, you can combine the built-in HA assist agent with your local LLM.

This way, simple commands e.g. “turn lights on in bedroom” are executed fast by the built-in HA agent, and everything it doesn’t understand is forwarded to the local LLM.

The Story

I want to quickly update the community with the possibilities in AI, Voice Control and Home Assistant. I am exploring the possibilities of running a fully local Voice Assistant in my home for quite a while now.

I know the majority of HA users run their instance on a small piece of hardware without much compute capability, this post is NOT for those users! My Home Assistant instance is running as a Docker container on an old PC that is now an ubuntu server. I recently upgraded this PC with a Nvidia GTX 1080 GPU (around €100) to achieve the following:

Run a local LLM (AI) model that is completely offloaded into my GPU’s VRAM.
Run local SST with whisper on my GPU with the large-v3-in8 model.

Further read

The local SST using whisper is far off Google’s SST performance, it was therefore annoying to use it with the default Assist of Home Assistant, since this requires precise intents. Especially in Dutch, it is very hard to always get the precise intent output by whisper, and some words are often replaced by others (it feels like overkill to make a wildcard for these words). I therefore focused on using AI, so that you don’t have to memorize any voice commands and it all feels more natural.

To my knowledge, there are two HACS integrations that support AI function calling as of now:

Home-LLM: more focused on smaller HA (CPU only) setups and uses a relatively small LLM (3B parameters) that is trained on a custom Home Assistant Request dataset. However, it is also possible to train and use your own LLM.
Extended OpenAI: an extension of the OpenAI integration in HA, that supports function calling with the GPT3.5/4 models (and other models that supports function calling via OpenAI’s API).

Then, there are multiple ways of setting up your own local LLM:

I first used a combination of LocalAI and Home-LLM and used my own custom trained model on a Dutch translated version of the training set from Home-LLM. I used Unsloth to train the Mistral 7B model using this Google Colab It worked quite well for some functions (e.g. light brightness), but it is still far from a real AI experience. The largest downside of this integration is that you need to train the model for each function call, so its not easy to add a feature.

I have now settled on llama-cpp-python and Extended OpenAI. I came across this YouTube video from FutureProofHomes and his journey in making a dedicated local AI-Powered Voice assistant. It’s not exactly what I am looking for, since his dedicated hardware restrictions make the AI very slow. However, all credits go to FutureProofHomes for pointing me in this direction. Normally, Extended OpenAI is only supported with the GPT models that support function calling, so most models that you can run locally do not work. But now, there is this model called Functionary that you are able to run locally and provides even better function calling than the GPT models! Do note that the chit-chatting with this model is never as good as GPT. Some modifications in the source code of Extended OpenAI and llama-cpp-python were necessary to have this combination working.

It can all easily be made faster if you want to invest in it. As for now, it seems that its best to buy a GPU with as much VRAM as possible and the highest CUDA compute capability. I might buy a RTX 3060 (12GB) or RTX 3090 (24GB) in the future! I was also able to run KoboldCPP on my desktop PC with my AMD Radeon 6900XT.

See below the guide with all the code to get llama-cpp-python / Extended OpenAI / Functionary working together. Also let me know if you have any tips or suggestions in local AI Voice Assistants. Would love to hear alternatives and benchmarks of the processing time of other GPUs.

Installation Guide

This guide is specifically written for installing a local LLM Voice Assistant using Docker containers on a setup with a Nvidia GPU (CUDA) and Ubuntu 22.04. Since we are building our own Docker images, you might have to change a few things dependent on your setup.

Prerequisites:

Linux distribution: one that is supported by Nvidia Container Toolkit
Docker container engine installed
Nvidia GPU (including CUDA drivers), check your maximum supported CUDA version by running the command nvidia-smi
Nvidia Container Tookit: to be able to run Docker containers on CUDA, follow this installation guide.

Wyoming Faster Whisper

You can use this repository to build the wyoming-faster-whisper Docker container that runs on CUDA.

Clone the repository and navigate into it:

git clone https://github.com/BramNH/wyoming-faster-whisper-docker-cuda \
cd wyoming-faster-whisper-docker-cuda

Because my maximum supported CUDA version = 12.2, in Dockerfile, I am using the following image to include the CUDA environment in the built image:
FROM nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04
Faster Whisper requires the cudnn8 and runtime from CUDA. You might need another image based on your CUDA version and Linux distribution (see all possible images).
Build the image:

docker build --tag wyoming-whisper .

Edit the container configuration in compose.yml to specify which model to run. For example: --model ellisd/faster-whisper-large-v3-int8 --language nl
Start the container with Docker Compose:

docker compose up -d

Llama-cpp-python

We setup llama-cpp-python to specifically work in combination with the Functionary LLM. There seems to be a bug with the chat format in the latest llama-cpp-python release. This image therefore contains version llama-cpp-python==0.2.64 which is stable.

Clone the repository to get the necessary files to build and run the Docker container, then navigate into the folder:

git clone https://github.com/BramNH/llama-cpp-python-docker-cuda \
cd llama-cpp-python-docker-cuda

Llama-cpp requires the devel CUDA image for GPU support, so I import the following image in Dockerfile. You might have to change this to your CUDA version / Linux distribution (see all possible images):
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
Build the Docker image with the included Dockerfile:

docker build --tag llama-cpp-python .

You can run the container using the included compose.yml:

docker compose up -d

Extended OpenAI

The Extended OpenAI HACS integration will talk to the OpenAI API that is used by llama-cpp-python. There were also some modifications necessary to get the HACS integration working with Functionary and llama-cpp-python, see this discussion.

You can either re-install the HACS integration using my fork of Extended OpenAI, or replace the __init__.py file within the /custom_components/extended_openai_conversation folder of your Home Assistant installation with the file in my fork

Follow the guide of Extended OpenAI how to create your own functions that the LLM can call.

Important settings when using Functionary LLM:

Enable Use Tools, If you defined your own functions;
Context Threshold = 8000, messages are cleared after 8k, otherwise model gets confused after threshold;

Credits

FutureProofHomes for making Functionary work with Extended OpenAI and llama-cpp-python.
Min Jekal for creating the Extended OpenAI integration!
m50 for the ha-fallback-conversation integration.

daywalker03 · April 11, 2024, 6:58pm

I would definitely be interested in how you did it.

tc23 · April 12, 2024, 1:25am

Yes I would appreciate a write up as well

juan11perez · April 12, 2024, 3:26am

Yes please

kloodHU · April 12, 2024, 11:16am

very interesting!

BramNH · April 13, 2024, 10:04am

Alright! Guide coming soon.

cowboyrushforth · April 13, 2024, 9:01pm

@BramNH would love to see your specific configs. Before finding this post in these forums I actually already had found and been trying your branch of the extended openai integration. But no matter what I try, the function calling is not working. What changes did you have to make to llama-cpp-python? that might be the only thing different. Thanks!

manunited10 · April 14, 2024, 3:48am

Many thanks for writing it up. Yes please continue on sharing your experiments.

BramNH · April 14, 2024, 5:08pm

I added a guide with links to the modifications I made. Let me know if anything is unclear!

cowboyrushforth · April 15, 2024, 9:30pm

Thanks for the updates @BramNH

I was able to get it working without errors now, which is fantastic.

Curious - what exact model are you using? I am using: “functionary-small-v2.4.f16.gguf”

Curious also what your prompt is like? The reason I ask is that it just seems pretty… rather dumb to me.

It often seems to use an incomplete entity_id, for example instead of light.office_light, it just wants to use office_light. So I have tried to give it direction in the prompt to help that.

But in other cases it will adjust the wrong device, or claim that a device doesnt support changing colors, and other weird things. About 20% of the time, it just doesnt even make a service call, it just says it did whatever I wanted it to do, and it doesnt even try.

Maybe the default prompt is far from comprehensive enough? Would love to hear your thoguhts! Thanks!

BramNH · April 15, 2024, 9:46pm

Great to hear it works! I am using the functionary-small-v2.4.Q4_0.gguf model.

I am using the default prompt atm!

I have had that entity naming issue before as well, when I wrote my own yaml functions. I think it was fixed by explicitly stating the following in execute_services function: description: The entity_id retrieved from available devices. Call the service with this entity_id, you must add the domain (such as light or switch) in front of it, followed by dot character.

I noticed that every description is very important for the LLMs decision making in function calls.

All my Extended OpenAI Functions

- spec:
    name: execute_services
    description: Use this function to execute service of devices in Home Assistant.
    parameters:
      type: object
      properties:
        list:
          type: array
          items:
            type: object
            properties:
              domain:
                type: string
                description: The domain of the service
              service:
                type: string
                description: The service to be called
              service_data:
                type: object
                description: The service data object to indicate what to control.
                properties:
                  entity_id:
                    type: string
                    description: The entity_id retrieved from available devices. Call the service
                      with this entity_id, you must add the domain (such as light or switch) in front of it, followed by dot character.
                required:
                - entity_id
            required:
            - domain
            - service
            - service_data
  function:
    type: native
    name: execute_service
- spec:
    name: get_attributes
    description: Get attributes of any home assistant entity
    parameters:
      type: object
      properties:
        entity_id:
          type: string
          description: entity_id
      required:
      - entity_id
  function:
    type: template
    value_template: "{{states[entity_id]}}"
- spec:
    name: get_weather_info
    description: Get info and forecast weather info
    parameters:
      type: object
      properties:
        location:
          type: string
          description: Infer this from home location, (e.g. Den Ham, OV)
        format:
          enum:
            - current forecast
            - daily forecast
          description: The type of weather forecast information to search for and use.
      required:
      - location
      - format
  function:
    type: template
    value_template: "{{states['weather.buienradar']}}"
- spec:
    name: add_item_to_todo_list
    description: Add item to to-do list
    parameters:
      type: object
      properties:
        item:
          type: string
          description: The item to be added to to-do list
        list_id:
          type: string
          descript: The entity id of the to-do list
      required:
      - item
      - list_id
  function:
    type: script
    sequence:
    - service: todo.add_item
      data:
        item: '{{item}}'
      target:
        entity_id: todo.{{list_id}}
- spec:
    name: set_climate_temperature
    description: Sets the target temperature of the altherma thermostat
    parameters:
      type: object
      properties:
        temperature:
          type: string
          description: The target temperature
      required:
      - temperature
  function:
    type: script
    sequence:
    - service: climate.set_temperature
      data:
        temperature: '{{temperature}}'
      target:
        entity_id: climate.altherma_thermostaat
- spec:
    name: set_light_brightness
    description: Sets a brightness value for a light entity. Only call this
      function when the user explicitly gives you a percentage value.
    parameters:
      type: object
      properties:
        brightness:
          type: string
          description: The brightness percentage to set.
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - brightness
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        brightness_pct: '{{brightness}}'
      target:
        entity_id: '{{entity_id}}'
- spec:
    name: set_light_warm
    description: Sets a light entity to its warmest temperature.
    parameters:
      type: object
      properties:
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        kelvin: '{{state_attr(entity_id, "min_color_temp_kelvin")}}'
      target:
        entity_id: '{{entity_id}}'
- spec:
    name: set_light_cold
    description: Sets a light entity to its coldest or coolest temperature,
      only call this function when user explicitly asks for cold or cool temperature of the light.
    parameters:
      type: object
      properties:
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        kelvin: '{{state_attr(entity_id, "max_color_temp_kelvin")}}'
      target:
        entity_id: '{{entity_id}}'
- spec:
    name: set_light_temperature
    description: Sets a temperature value in Kelvin for a light entity, 
      only call this function when an explicit Kelvin value has been provided.
    parameters:
      type: object
      properties:
        temperature:
          type: string
          description: The temperature to set in Kelvin
        entity_id:
          type: string
          description: The entity_id retrieved from available devices. 
            It must start with domain, followed by dot character.
      required:
      - temperature
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        kelvin: '{{temperature}}'
      target:
        entity_id: '{{entity_id}}'
- spec:
    name: set_light_color
    description: Sets a color value for a light entity. Only call this function
      when the user explicitly gives a color, and not warm, cold or cool.
    parameters:
      type: object
      properties:
        color:
          type: string
          description: The color to set
        entity_id:
          type: string
          description: The light entity_id retrieved from available devices. 
            It must start with the light domain, followed by dot character.
      required:
      - color
      - entity_id
  function:
    type: script
    sequence:
    - service: light.turn_on
      data:
        color_name: '{{color}}'
      target:
        entity_id: '{{entity_id}}'
- spec:
    name: start_app_tv
    description: Starts an app on the Sony Bravia TV. Only call this
      function when the user explicitly gives the name of an app. Use
      the function named execute_services, for turning the tv on or off.  
    parameters:
      type: object
      properties:
        app_name:
          type: string
          description: The name of the app to start
      required:
      - app_name
  function:
    type: script
    sequence:
    - service: media_player.play_media
      data:
        media_content_id: '{{app_name}}'
        media_content_type: "app"
      target:
        entity_id: media_player.sony_kd_43xg8399

cowboyrushforth · April 15, 2024, 10:21pm

BramNH:

                    type: string
                    description: The entity_id retrieved from available devices. Call the service
                      with this entity_id, you must add the domain (such as light or switch) in front of it, followed by dot characte

Really appreciate all the details. Adding a specific function as you do for setting the color helped. But for some reason its still randomly just dropping the domain from entities, so it works like 50% of the time. I have instructions as you do exactly now in the function calls, and have been experimenting with re-enforcing that above, surely seems odd, but to be expected with bleeding edge I suppose

BramNH · April 15, 2024, 10:36pm

I also renamed most of my entities. Since the domain is already at the front, doesnt seem logical to include it in the id. I can’t say for sure if it will help.

E.g. change light.office_lights to light.office

I would not recommend redirecting the model via the prompt for everything, this only makes the context larger and the model slower.

Btw: what GPU do you use? And can you share your benchmarks (Natural Language Processing times, inference tokens / s) for certain commands? Would love to make a table in the future to compare performances.

vishworks · April 19, 2024, 12:11pm

Excellent Guide @BramNH

You know what I liked about the whole thing ? Functionary Small 2.4 Q4 only requires 4.11 GB of RAM. That means I can run both Wyoming whisper (based on faster whisper) and LLM on a 8 GB GPU like Nvidia GTX 1070 or 1080. Wow !

I have been looking at using TheBloke/Luna-AI-Llama2-Uncensored-GGUF · Hugging Face. luna-ai-llama2-uncensored.Q6_K.gguf Q6_K model requires 8.03 GB (RAM Required) but is of good quality -
very large, extremely low quality loss (use case). This means I need atleast 10-12 GB GPU to run both this and Wyoming Whisper ! I got aware of this LLM from this guide at How to control Home Assistant with a local LLM instead of ChatGPT | The awesome garage where he mentioned a Nvidia RTX 3060 for running both LLM and whisper !

My question - how is Functionary Small 2.4 Q4 performance and accuracy ? Is is something enough to run a AI voice assistant (English language) instead on relying of larger models like luna-ai-llama2-uncensored.Q6_K.gguf ?

BramNH · April 19, 2024, 2:08pm

Thanks @vishworks! I indeed tried to fit both wyoming-faster-whisper and the LLM in 8GB of VRAM. In idle, the model for faster-whisper (large-v3-int8) uses about 1.4GB I believe, so I still have some VRAM left in total. I did get an out of memory error in Whisper under load one time, so it might be better to use the medium-int8 model (uses about 1GB).

The RTX 3060 is also a candidate for my next GPU, since it’s not that expensive and has a 12GB VRAM variant. Plus ofcourse faster than my GTX 1080.

The Functionary Small V2.4 model is based on the Mistral 7B model. In my opinion, if you use this model for Home Assistant Function calling, the accuracy is of the Q4 variant is good, but it mostly depends on how you prompt the model and how you describe your functions. I have the entire prompt and all function definitions in English, but provide the model with Dutch sentences, and it still performs good! Let me know what your experience is if you try it out!

rvsh2 · May 1, 2024, 11:41am

I’ve got this error when asking for a weather. Any advice?
Sorry, I had a problem talking to `

OpenAI: Error code: 500 - {‘error’: {‘message’: ‘[{'type': 'literal_error', 'loc': ('body', 'messages', 2, 'typed-dict', 'role'), 'msg': “Input should be 'system'”, 'input': 'assistant', 'ctx': {'expected': “'system'”}}, {'type': 'literal_error', 'loc': ('body', 'messages', 2, 'typed-dict', 'role'), 'msg': “Input should be 'user'”, 'input': 'assistant', 'ctx': {'expected': “'user'”}}, {'type': 'dict_type', 'loc': ('body', 'messages', 2, 'typed-dict', 'function_call'), 'msg': 'Input should be a valid dictionary', 'input': None}, {'type': 'literal_error', 'loc': ('body', 'messages', 2, 'typed-dict', 'role'), 'msg': “Input should be 'tool'”, 'input': 'assistant', 'ctx': {'expected': “'tool'”}}, {'type': 'missing', 'loc': ('body', 'messages', 2, 'typed-dict', 'tool_call_id'), 'msg': 'Field required', 'input': {'content': None, 'role': 'assistant', 'function_call': None, 'tool_calls': [{'id': 'call_PISwU8mGaTWpQr6KCSZi9dY8', 'function': {'arguments': '{“entity_id”: “weather.forecast_home”}', 'name': 'get_attributes'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 2, 'typed-dict', 'role'), 'msg': “Input should be 'function'”, 'input': 'assistant', 'ctx': {'expected': “'function'”}}, {'type': 'missing', 'loc': ('body', 'messages', 2, 'typed-dict', 'name'), 'msg': 'Field required', 'input': {'content': None, 'role': 'assistant', 'function_call': None, 'tool_calls': [{'id': 'call_PISwU8mGaTWpQr6KCSZi9dY8', 'function': {'arguments': '{“entity_id”: “weather.forecast_home”}', 'name': 'get_attributes'}, 'type': 'function'}]}}]’, ‘type’: ‘internal_server_error’, ‘param’: None, ‘code’: None}}

`

BramNH · May 1, 2024, 11:53am

Seems like a prompt template issue, since it expects “system” where “assistant” is given as input etc.

Do other commands work?

rvsh2 · May 1, 2024, 11:56am

When I try to turn off music (named biuro). It stops it, but still response is like this:

How can I assist?

Turn off biuro

Sorry, I had a problem talking to OpenAI: Error code: 500 - {'error': {'message': '[{\'type\': \'literal_error\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'role\'), \'msg\': "Input should be \'system\'", \'input\': \'assistant\', \'ctx\': {\'expected\': "\'system\'"}}, {\'type\': \'literal_error\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'role\'), \'msg\': "Input should be \'user\'", \'input\': \'assistant\', \'ctx\': {\'expected\': "\'user\'"}}, {\'type\': \'dict_type\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'function_call\'), \'msg\': \'Input should be a valid dictionary\', \'input\': None}, {\'type\': \'literal_error\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'role\'), \'msg\': "Input should be \'tool\'", \'input\': \'assistant\', \'ctx\': {\'expected\': "\'tool\'"}}, {\'type\': \'missing\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'tool_call_id\'), \'msg\': \'Field required\', \'input\': {\'content\': None, \'role\': \'assistant\', \'function_call\': None, \'tool_calls\': [{\'id\': \'call_3XSYhREfhUzvc9deaiBq057H\', \'function\': {\'arguments\': \'{"list": [{"domain": "media_player", "service": "turn_off", "service_data": {"entity_id": "media_player.biuro"}}]}\', \'name\': \'execute_services\'}, \'type\': \'function\'}]}}, {\'type\': \'literal_error\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'role\'), \'msg\': "Input should be \'function\'", \'input\': \'assistant\', \'ctx\': {\'expected\': "\'function\'"}}, {\'type\': \'missing\', \'loc\': (\'body\', \'messages\', 2, \'typed-dict\', \'name\'), \'msg\': \'Field required\', \'input\': {\'content\': None, \'role\': \'assistant\', \'function_call\': None, \'tool_calls\': [{\'id\': \'call_3XSYhREfhUzvc9deaiBq057H\', \'function\': {\'arguments\': \'{"list": [{"domain": "media_player", "service": "turn_off", "service_data": {"entity_id": "media_player.biuro"}}]}\', \'name\': \'execute_services\'}, \'type\': \'function\'}]}}]', 'type': 'internal_server_error', 'param': None, 'code': None}}

My functions in ext. open ai are defaults:

- spec:
    name: execute_services
    description: Use this function to execute service of devices in Home Assistant.
    parameters:
      type: object
      properties:
        list:
          type: array
          items:
            type: object
            properties:
              domain:
                type: string
                description: The domain of the service
              service:
                type: string
                description: The service to be called
              service_data:
                type: object
                description: The service data object to indicate what to control.
                properties:
                  entity_id:
                    type: string
                    description: The entity_id retrieved from available devices. Call the service with this entity_id, you must add the domain (such as light or switch) in front of it, followed by dot character.
                required:
                - entity_id
            required:
            - domain
            - service
            - service_data
  function:
    type: native
    name: execute_service
- spec:
    name: get_attributes
    description: Get attributes of any home assistant entity
    parameters:
      type: object
      properties:
        entity_id:
          type: string
          description: entity_id
      required:
      - entity_id
  function:
    type: template
    value_template: "{{states[entity_id]}}"

Also Tool is On and User name is On.

rvsh2 · May 1, 2024, 12:17pm

BramNH:

llama-cpp-python |
llama-cpp-python | ==========
llama-cpp-python | == CUDA ==
llama-cpp-python | ==========
llama-cpp-python |
llama-cpp-python | CUDA Version 12.1.1
llama-cpp-python |
llama-cpp-python | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
llama-cpp-python |
llama-cpp-python | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
llama-cpp-python | By pulling and using the container, you accept the terms and conditions of this license:
llama-cpp-python | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
llama-cpp-python |
llama-cpp-python | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
llama-cpp-python |
llama-cpp-python | None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won’t be available and only tokenizers, configuration and file/data utilities can be used.
llama-cpp-python | You set add_prefix_space. The tokenizer needs to be converted from the slow tokenizers
llama-cpp-python | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
llama-cpp-python | llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /var/model/functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama-cpp-python | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama-cpp-python | llama_model_loader: - kv 0: general.architecture str = llama
llama-cpp-python | llama_model_loader: - kv 1: general.name str = .
llama-cpp-python | llama_model_loader: - kv 2: llama.vocab_size u32 = 32004
llama-cpp-python | llama_model_loader: - kv 3: llama.context_length u32 = 32768
llama-cpp-python | llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama-cpp-python | llama_model_loader: - kv 5: llama.block_count u32 = 32
llama-cpp-python | llama_model_loader: - kv 6: llama.feed_forward_length u32 = 14336
llama-cpp-python | llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama-cpp-python | llama_model_loader: - kv 8: llama.attention.head_count u32 = 32
llama-cpp-python | llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama-cpp-python | llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama-cpp-python | llama_model_loader: - kv 11: llama.rope.freq_base f32 = 1000000.000000
llama-cpp-python | llama_model_loader: - kv 12: general.file_type u32 = 2
llama-cpp-python | llama_model_loader: - kv 13: tokenizer.ggml.model str = llama
llama-cpp-python | llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32004] = [“”, “~~”, “~~”, “<0x00>”, "<…
llama-cpp-python | llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32004] = [0.000000, 0.000000, 0.000000, 0.0000…
llama-cpp-python | llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32004] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, …
llama-cpp-python | llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama-cpp-python | llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama-cpp-python | llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama-cpp-python | llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama-cpp-python | llama_model_loader: - kv 21: tokenizer.ggml.add_bos_token bool = true
llama-cpp-python | llama_model_loader: - kv 22: tokenizer.ggml.add_eos_token bool = false
llama-cpp-python | llama_model_loader: - kv 23: tokenizer.chat_template str = {% for message in messages %}\n{% if m…
llama-cpp-python | llama_model_loader: - kv 24: general.quantization_version u32 = 2
llama-cpp-python | llama_model_loader: - type f32: 65 tensors
llama-cpp-python | llama_model_loader: - type q4_0: 225 tensors
llama-cpp-python | llama_model_loader: - type q6_K: 1 tensors
llama-cpp-python | llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llama-cpp-python | llm_load_print_meta: format = GGUF V3 (latest)
llama-cpp-python | llm_load_print_meta: arch = llama
llama-cpp-python | llm_load_print_meta: vocab type = SPM
llama-cpp-python | llm_load_print_meta: n_vocab = 32004
llama-cpp-python | llm_load_print_meta: n_merges = 0
llama-cpp-python | llm_load_print_meta: n_ctx_train = 32768
llama-cpp-python | llm_load_print_meta: n_embd = 4096
llama-cpp-python | llm_load_print_meta: n_head = 32
llama-cpp-python | llm_load_print_meta: n_head_kv = 8
llama-cpp-python | llm_load_print_meta: n_layer = 32
llama-cpp-python | llm_load_print_meta: n_rot = 128
llama-cpp-python | llm_load_print_meta: n_embd_head_k = 128
llama-cpp-python | llm_load_print_meta: n_embd_head_v = 128
llama-cpp-python | llm_load_print_meta: n_gqa = 4
llama-cpp-python | llm_load_print_meta: n_embd_k_gqa = 1024
llama-cpp-python | llm_load_print_meta: n_embd_v_gqa = 1024
llama-cpp-python | llm_load_print_meta: f_norm_eps = 0.0e+00
llama-cpp-python | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llama-cpp-python | llm_load_print_meta: f_clamp_kqv = 0.0e+00
llama-cpp-python | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llama-cpp-python | llm_load_print_meta: f_logit_scale = 0.0e+00
llama-cpp-python | llm_load_print_meta: n_ff = 14336
llama-cpp-python | llm_load_print_meta: n_expert = 0
llama-cpp-python | llm_load_print_meta: n_expert_used = 0
llama-cpp-python | llm_load_print_meta: causal attn = 1
llama-cpp-python | llm_load_print_meta: pooling type = 0
llama-cpp-python | llm_load_print_meta: rope type = 0
llama-cpp-python | llm_load_print_meta: rope scaling = linear
llama-cpp-python | llm_load_print_meta: freq_base_train = 1000000.0
llama-cpp-python | llm_load_print_meta: freq_scale_train = 1
llama-cpp-python | llm_load_print_meta: n_yarn_orig_ctx = 32768
llama-cpp-python | llm_load_print_meta: rope_finetuned = unknown
llama-cpp-python | llm_load_print_meta: ssm_d_conv = 0
llama-cpp-python | llm_load_print_meta: ssm_d_inner = 0
llama-cpp-python | llm_load_print_meta: ssm_d_state = 0
llama-cpp-python | llm_load_print_meta: ssm_dt_rank = 0
llama-cpp-python | llm_load_print_meta: model type = 8B
llama-cpp-python | llm_load_print_meta: model ftype = Q4_0
llama-cpp-python | llm_load_print_meta: model params = 7.24 B
llama-cpp-python | llm_load_print_meta: model size = 3.83 GiB (4.54 BPW)
llama-cpp-python | llm_load_print_meta: general.name = .
llama-cpp-python | llm_load_print_meta: BOS token = 1 ‘’
llama-cpp-python | llm_load_print_meta: EOS token = 2 ‘’
llama-cpp-python | llm_load_print_meta: UNK token = 0 ‘’
llama-cpp-python | llm_load_print_meta: PAD token = 2 ‘’
llama-cpp-python | llm_load_print_meta: LF token = 13 ‘<0x0A>’
llama-cpp-python | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
llama-cpp-python | ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
llama-cpp-python | ggml_cuda_init: found 1 CUDA devices:
llama-cpp-python | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama-cpp-python | llm_load_tensors: ggml ctx size = 0.30 MiB
llama-cpp-python | llm_load_tensors: offloading 32 repeating layers to GPU
llama-cpp-python | llm_load_tensors: offloading non-repeating layers to GPU
llama-cpp-python | llm_load_tensors: offloaded 33/33 layers to GPU
llama-cpp-python | llm_load_tensors: CPU buffer size = 70.32 MiB
llama-cpp-python | llm_load_tensors: CUDA0 buffer size = 3847.57 MiB
llama-cpp-python | warning: failed to mlock 74473472-byte buffer (after previously locking 0 bytes): Cannot allocate memory
llama-cpp-python | Try increasing RLIMIT_MEMLOCK (‘ulimit -l’ as root).
llama-cpp-python | …
llama-cpp-python | llama_new_context_with_model: n_ctx = 4096
llama-cpp-python | llama_new_context_with_model: n_batch = 192
llama-cpp-python | llama_new_context_with_model: n_ubatch = 192
llama-cpp-python | llama_new_context_with_model: flash_attn = 0
llama-cpp-python | llama_new_context_with_model: freq_base = 1000000.0
llama-cpp-python | llama_new_context_with_model: freq_scale = 1
llama-cpp-python | llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB
llama-cpp-python | llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama-cpp-python | llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB
llama-cpp-python | llama_new_context_with_model: CUDA0 compute buffer size = 111.00 MiB
llama-cpp-python | llama_new_context_with_model: CUDA_Host compute buffer size = 6.00 MiB
llama-cpp-python | llama_new_context_with_model: graph nodes = 1030
llama-cpp-python | llama_new_context_with_model: graph splits = 2
llama-cpp-python | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
llama-cpp-python | Model metadata: {‘tokenizer.chat_template’: ‘{% for message in messages %}\n{% if message['role'] == 'user' or message['role'] == 'system' %}\n{{ '<|from|>' + message['role'] + '\n<|recipient|>all\n<|content|>' + message['content'] + '\n' }}{% elif message['role'] == 'tool' %}\n{{ '<|from|>' + message['name'] + '\n<|recipient|>all\n<|content|>' + message['content'] + '\n' }}{% else %}\n{% set contain_content='no'%}\n{% if message['content'] is not none %}\n{{ '<|from|>assistant\n<|recipient|>all\n<|content|>' + message['content'] }}{% set contain_content='yes'%}\n{% endif %}\n{% if 'tool_calls' in message and message['tool_calls'] is not none %}\n{% for tool_call in message['tool_calls'] %}\n{% set prompt='<|from|>assistant\n<|recipient|>' + tool_call['function']['name'] + '\n<|content|>' + tool_call['function']['arguments'] %}\n{% if loop.index == 1 and contain_content == “no” %}\n{{ prompt }}{% else %}\n{{ '\n' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ '<|stop|>\n' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ '<|from|>assistant\n<|recipient|>' }}{% endif %}’, ‘tokenizer.ggml.add_eos_token’: ‘false’, ‘tokenizer.ggml.padding_token_id’: ‘2’, ‘tokenizer.ggml.unknown_token_id’: ‘0’, ‘tokenizer.ggml.eos_token_id’: ‘2’, ‘general.quantization_version’: ‘2’, ‘tokenizer.ggml.model’: ‘llama’, ‘general.architecture’: ‘llama’, ‘llama.rope.freq_base’: ‘1000000.000000’, ‘llama.context_length’: ‘32768’, ‘general.name’: ‘.’, ‘llama.vocab_size’: ‘32004’, ‘general.file_type’: ‘2’, ‘tokenizer.ggml.add_bos_token’: ‘true’, ‘llama.embedding_length’: ‘4096’, ‘llama.feed_forward_length’: ‘14336’, ‘llama.attention.layer_norm_rms_epsilon’: ‘0.000010’, ‘llama.rope.dimension_count’: ‘128’, ‘tokenizer.ggml.bos_token_id’: ‘1’, ‘llama.attention.head_count’: ‘32’, ‘llama.block_count’: ‘32’, ‘llama.attention.head_count_kv’: ‘8’}
llama-cpp-python | INFO: Started server process [27]
llama-cpp-python | INFO: Waiting for application startup.
llama-cpp-python | INFO: Application startup complete.
llama-cpp-python | INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

It seems that llama got the proper prompt format. Still torch missing seems not good - I will try to install torch in docker but I think this is not the reason.

rvsh2 · May 1, 2024, 12:42pm

Also regarding inference speeds and whisper.
I personally use whisper.cpp and wyoming-whisper-api-client:
https://github.com/ser/wyoming-whisper-api-client
I use large-v3 model and it has perfect perfomance. I use rtx3090 for this.
All one/two sentence talk is translated in less than a second.
It is ideal for home assistant. I tried many alternatives, like wyoming faster whisper but I got slower responses.

I think that whisper.cpp + (some llm) + piper is great combo for HA.
For satellites I tried many things and the best I got is onju-voice.
This works quite well with connection directly to homeassistant wyoming-openwakeword but still the wyoming-openwakeword has some issues with overdetection (I tried my own python code with openwakeword and it didn’t have this issues at all but the integration with homeassistant is beyound my skills - I tried to ask HA devs to just use the library version of openwakeword in wyoming-openwakeword but no response).
I tried few TTS’s like e.g. xtts,bark etc… These are much better than piper in terms of quality but much slower. Piper is good enough but doesn’t have high quality voice in my language (polish). In future I will try to train the voice but for now the most important part of the puzzle that is missing is LLM.

For the LLM’s i tried HomeLLM and ext.open ai. Actually in open source space there is nothing that works locally for now.
HomeLLM works partially, extended open ai works partially but only with openais chat models.
I thought that the best option would be to have something like crewai integrated into homeassistant just like ext.open ai. There could be two models in it: first for conversation with user and the second for function calling. The best function calling model I tried is functionary. I tried tens of different models for this task and every one has failed in some extent (maybe naturalfunction wasn’t that bad).
But still for my native language I would like to use model that has good ability to speak(bielik model), but it has poor function calling. That’s why for function calling would best to use functionary but CrewAi doesn’t work properly with functionary for now. So that’s why your topic is quite interesting as another small puzzle in this field

Regarding my issue with using your fork:

I think the problem is with extended open ai:

homeassistant  | 2024-05-01 14:17:25.845 INFO (MainThread) [custom_components.extended_openai_conversation] Prompt for functionary-v2.4: [{'role': 'system', 'content': "I want you to act as smart home manager of Home Assistant.\nI will provide information of smart home along with a question, you will truthfully make correction or answer using information provided in one sentence in everyday language.\n\nCurrent Time: 2024-05-01 14:17:25.842311+02:00\n\nAvailable Devices:\n```csv\nentity_id,name,state,aliases\nsensor.epson_wf_3620_series_black_ink,EPSON WF-3620 Series Black ink,65,\nmedia_player.biuro,Biuro,playing,biuro/głośnik aleksa\nweather.forecast_home,Pogoda,sunny,weather forecast\nswitch.onju_voice_477b00_use_wake_word,Voice1 Use Wake Word,unavailable,\nswitch.zigbee2mqtt_bridge_permit_join,Permit join,unavailable,\n```\n\nThe current state of devices is provided in available devices.\nUse execute_services function only for requested action, not for current states.\nDo not execute service without user's confirmation.\nDo not restate or appreciate what user says, rather make a quick inquiry."}, {'role': 'user', 'content': 'What is the weather?'}]
homeassistant  | 2024-05-01 14:17:28.199 INFO (MainThread) [custom_components.extended_openai_conversation] Response {'id': 'chatcmpl-80cb712f-871f-4216-aaad-a7c459b32141', 'choices': [{'finish_reason': 'tool_calls', 'index': 0, 'message': {'role': 'assistant', 'tool_calls': [{'id': 'call_QKcGpqWv32B4KqtYJZhPNvr4', 'function': {'arguments': '{"entity_id": "weather.forecast_home"}', 'name': 'get_attributes'}, 'type': 'function'}]}}], 'created': 1714565848, 'model': 'functionary-v2.4', 'object': 'chat.completion', 'usage': {'completion_tokens': 21, 'prompt_tokens': 527, 'total_tokens': 528}}

I installed your fork from extended openai from HACS. How did you manage to overcome this?