I have setup a relatively fast, fully local, AI voice assistant for Home Assistant.
The guide below is written for installation with a Nvidia GPU on a Linux machine. But it is also possible to use AMD GPUs and Windows. Feel free to share any info or ask any question related to Assist.
The following components are used:
- Wyoming Faster Whisper Docker container (build files)
- Llama-cpp-python Docker container (build files)
- Extended OpenAI HACS Integration (modified fork)
- Functionary Small V2.4 LLM (Q4) (It’s multilingual as well!)
- Nvidia GTX 1080 GPU
See the Installation guide below to setup the individual components.
Example 1: Control light entities
Features
- Set brightness
- Change color
- Change temperature to cold / warm
Functions code
- spec:
name: set_light_color
description: Sets a color value for a light entity. Only call this function
when the user explicitly gives a color, and not warm, cold or cool.
parameters:
type: object
properties:
color:
type: string
description: The color to set
entity_id:
type: string
description: The light entity_id retrieved from available devices.
It must start with the light domain, followed by dot character.
required:
- color
- entity_id
function:
type: script
sequence:
- service: light.turn_on
data:
color_name: '{{color}}'
target:
entity_id: '{{entity_id}}'
- spec:
name: set_light_brightness
description: Sets a brightness value for a light entity. Only call this
function when the user explicitly gives you a percentage value.
parameters:
type: object
properties:
brightness:
type: string
description: The brightness percentage to set.
entity_id:
type: string
description: The light entity_id retrieved from available devices.
It must start with the light domain, followed by dot character.
required:
- brightness
- entity_id
function:
type: script
sequence:
- service: light.turn_on
data:
brightness_pct: '{{brightness}}'
target:
entity_id: '{{entity_id}}'
- spec:
name: set_light_warm
description: Sets a light entity to its warmest temperature.
parameters:
type: object
properties:
entity_id:
type: string
description: The light entity_id retrieved from available devices.
It must start with the light domain, followed by dot character.
required:
- entity_id
function:
type: script
sequence:
- service: light.turn_on
data:
kelvin: '{{state_attr(entity_id, "min_color_temp_kelvin")}}'
target:
entity_id: '{{entity_id}}'
- spec:
name: set_light_cold
description: Sets a light entity to its coldest or coolest temperature,
only call this function when user explicitly asks for cold or cool temperature of the light.
parameters:
type: object
properties:
entity_id:
type: string
description: The light entity_id retrieved from available devices.
It must start with the light domain, followed by dot character.
required:
- entity_id
function:
type: script
sequence:
- service: light.turn_on
data:
kelvin: '{{state_attr(entity_id, "max_color_temp_kelvin")}}'
target:
entity_id: '{{entity_id}}'
Example 2: Call Music Assistant service
Uses the mass.play_media
service of the Music Assistant integration in Home Assistant to find and play a given playlist / track on a given Music Assistant media player entity. I have my Spotify connected to Music Assistant, thus it can find any track / playlist that is available on Spotify.
Features
- Play track on MA media player
- Play playlist on MA media player
Functions code
- spec:
name: play_track_on_media_player
description: Plays any track (name or artist of song) on a given media player
parameters:
type: object
properties:
track:
type: string
description: The track to play
entity_id:
type: string
description: The media_player entity_id retrieved from available devices.
It must start with the media_player domain, followed by dot character.
required:
- track
- entity_id
function:
type: script
sequence:
- service: mass.play_media
data:
media_id: '{{track}}'
media_type: track
target:
entity_id: '{{entity_id}}'
- spec:
name: play_playlist_on_media_player
description: Plays any playlist on a given media player
parameters:
type: object
properties:
playlist:
type: string
description: The name of the playlist to play
entity_id:
type: string
description: The media_player entity_id retrieved from available devices.
It must start with the media_player domain, followed by dot character.
required:
- playlist
- entity_id
function:
type: script
sequence:
- service: mass.play_media
data:
media_id: '{{playlist}}'
media_type: playlist
target:
entity_id: '{{entity_id}}'
Important note
Even though I think it works great, don’t expect everything to work flawlessly. The performance of Speech-to-Text and the LLM is really dependent on the type of hardware you have and how you configured it. Important things are:
- Speech-to-Text is heavily dependent on the quality of the audio, while the M5stack Atom Echo is fun to play and test with, its not good enough for deployment.
- Simple entity naming, otherwise the LLM will not obtain the correct entity id.
- Simple and strong naming and description of each function in the Extended OpenAI configuration, this is what the LLM has to use when it decides which function to call, based on your command.
- The quantization of the LLM you are using (F16, Q8, Q4). F16 is largest in size, most accurate and slowest, Q4 is the smallest, least accurate, but fastest.
The performance of the GTX 1080 is not good enough for deployment in my opinion, since the LLM inference times are ~8 seconds for function calling with Functionary v2.4 small Q4. A newer Nvidia RTX 3000 / 4000 series is recommended for faster inference times.
Updates
I also got my AMD 6900XT GPU working with llama-cpp-python on my Windows PC, which can perform function calling around 3 seconds! Let me know if you need help with installing llama-cpp-(python) for ROCm on Windows.
Cloud GPUs (Vast.ai)
If you are not sure which GPU is best for you needs, or you don’t want to host the GPU at home and are fine with hourly costs, you can deploy my llama-cpp-python Docker container on Vast.ai cloud GPUs.
Image Path/Tag: bramnh/llama-cpp-python:latest
Docker Options:
-p 8000:8000 -e USE_MLOCK=0 -e HF_MODEL_REPO_ID=meetkai/functionary-small-v2.4-GGUF -e MODEL=functionary-small-v2.4.Q4_0.gguf -e HF_PRETRAINED_MODEL_NAME_OR_PATH=meetkai/functionary-small-v2.4-GGUF -e N_GPU_LAYERS=33 -e CHAT_FORMAT=functionary-v2 -e N_CTX=4092 -e N_BATCH=192 -e N_THREADS=6
Launch Mode: Docker Run
Fallback Conversation Agent (HACS Integration)
If you find the function calling of the local LLM to be too slow, you could install Fallback Conversation Agent. This lets you configure a conversation agent where you can set a primary and secondary (fallback) agent. With this, you can combine the built-in HA assist agent with your local LLM.
This way, simple commands e.g. “turn lights on in bedroom” are executed fast by the built-in HA agent, and everything it doesn’t understand is forwarded to the local LLM.
The Story
I want to quickly update the community with the possibilities in AI, Voice Control and Home Assistant. I am exploring the possibilities of running a fully local Voice Assistant in my home for quite a while now.
I know the majority of HA users run their instance on a small piece of hardware without much compute capability, this post is NOT for those users! My Home Assistant instance is running as a Docker container on an old PC that is now an ubuntu server. I recently upgraded this PC with a Nvidia GTX 1080 GPU (around €100) to achieve the following:
- Run a local LLM (AI) model that is completely offloaded into my GPU’s VRAM.
- Run local SST with whisper on my GPU with the large-v3-in8 model.
Further read
The local SST using whisper is far off Google’s SST performance, it was therefore annoying to use it with the default Assist of Home Assistant, since this requires precise intents. Especially in Dutch, it is very hard to always get the precise intent output by whisper, and some words are often replaced by others (it feels like overkill to make a wildcard for these words). I therefore focused on using AI, so that you don’t have to memorize any voice commands and it all feels more natural.
To my knowledge, there are two HACS integrations that support AI function calling as of now:
- Home-LLM: more focused on smaller HA (CPU only) setups and uses a relatively small LLM (3B parameters) that is trained on a custom Home Assistant Request dataset. However, it is also possible to train and use your own LLM.
- Extended OpenAI: an extension of the OpenAI integration in HA, that supports function calling with the GPT3.5/4 models (and other models that supports function calling via OpenAI’s API).
Then, there are multiple ways of setting up your own local LLM:
- LocalAI
- llama-cpp (-python)
- KoboldCPP (AMD GPU support)
- Many more!
I first used a combination of LocalAI and Home-LLM and used my own custom trained model on a Dutch translated version of the training set from Home-LLM. I used Unsloth to train the Mistral 7B model using this Google Colab It worked quite well for some functions (e.g. light brightness), but it is still far from a real AI experience. The largest downside of this integration is that you need to train the model for each function call, so its not easy to add a feature.
I have now settled on llama-cpp-python and Extended OpenAI. I came across this YouTube video from FutureProofHomes and his journey in making a dedicated local AI-Powered Voice assistant. It’s not exactly what I am looking for, since his dedicated hardware restrictions make the AI very slow. However, all credits go to FutureProofHomes for pointing me in this direction. Normally, Extended OpenAI is only supported with the GPT models that support function calling, so most models that you can run locally do not work. But now, there is this model called Functionary that you are able to run locally and provides even better function calling than the GPT models! Do note that the chit-chatting with this model is never as good as GPT. Some modifications in the source code of Extended OpenAI and llama-cpp-python were necessary to have this combination working.
It can all easily be made faster if you want to invest in it. As for now, it seems that its best to buy a GPU with as much VRAM as possible and the highest CUDA compute capability. I might buy a RTX 3060 (12GB) or RTX 3090 (24GB) in the future! I was also able to run KoboldCPP on my desktop PC with my AMD Radeon 6900XT.
See below the guide with all the code to get llama-cpp-python / Extended OpenAI / Functionary working together. Also let me know if you have any tips or suggestions in local AI Voice Assistants. Would love to hear alternatives and benchmarks of the processing time of other GPUs.
Installation Guide
This guide is specifically written for installing a local LLM Voice Assistant using Docker containers on a setup with a Nvidia GPU (CUDA) and Ubuntu 22.04. Since we are building our own Docker images, you might have to change a few things dependent on your setup.
Prerequisites:
- Linux distribution: one that is supported by Nvidia Container Toolkit
- Docker container engine installed
- Nvidia GPU (including CUDA drivers), check your maximum supported CUDA version by running the command
nvidia-smi
- Nvidia Container Tookit: to be able to run Docker containers on CUDA, follow this installation guide.
Wyoming Faster Whisper
You can use this repository to build the wyoming-faster-whisper Docker container that runs on CUDA.
- Clone the repository and navigate into it:
git clone https://github.com/BramNH/wyoming-faster-whisper-docker-cuda \
cd wyoming-faster-whisper-docker-cuda
-
Because my maximum supported CUDA version = 12.2, in
Dockerfile
, I am using the following image to include the CUDA environment in the built image:
FROM nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04
Faster Whisper requires the cudnn8 and runtime from CUDA. You might need another image based on your CUDA version and Linux distribution (see all possible images). -
Build the image:
docker build --tag wyoming-whisper .
-
Edit the container configuration in
compose.yml
to specify which model to run. For example:--model ellisd/faster-whisper-large-v3-int8 --language nl
-
Start the container with Docker Compose:
docker compose up -d
Llama-cpp-python
We setup llama-cpp-python to specifically work in combination with the Functionary LLM. There seems to be a bug with the chat format in the latest llama-cpp-python release. This image therefore contains version llama-cpp-python==0.2.64
which is stable.
- Clone the repository to get the necessary files to build and run the Docker container, then navigate into the folder:
git clone https://github.com/BramNH/llama-cpp-python-docker-cuda \
cd llama-cpp-python-docker-cuda
-
Llama-cpp requires the devel CUDA image for GPU support, so I import the following image in
Dockerfile
. You might have to change this to your CUDA version / Linux distribution (see all possible images):
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
-
Build the Docker image with the included
Dockerfile
:
docker build --tag llama-cpp-python .
- You can run the container using the included
compose.yml
:
docker compose up -d
Extended OpenAI
The Extended OpenAI HACS integration will talk to the OpenAI API that is used by llama-cpp-python. There were also some modifications necessary to get the HACS integration working with Functionary and llama-cpp-python, see this discussion.
You can either re-install the HACS integration using my fork of Extended OpenAI, or replace the __init__.py
file within the /custom_components/extended_openai_conversation
folder of your Home Assistant installation with the file in my fork
Follow the guide of Extended OpenAI how to create your own functions that the LLM can call.
Important settings when using Functionary LLM:
- Enable
Use Tools
, If you defined your own functions; Context Threshold = 8000
, messages are cleared after 8k, otherwise model gets confused after threshold;
Credits
- FutureProofHomes for making Functionary work with Extended OpenAI and llama-cpp-python.
- Min Jekal for creating the Extended OpenAI integration!
- m50 for the ha-fallback-conversation integration.