I have setup a relatively fast, fully local, AI voice assistant for Home Assistant.
The following components are used:
- Wyoming Faster Whisper Docker container (build files)
- Llama-cpp-python Docker container (build files)
- Extended OpenAI HACS Integration (modified fork)
- Functionary Small V2.4 LLM (Q4)
- Nvidia GTX 1080 GPU
Update
I also got my AMD 6900XT GPU working with llama-cpp-python on my Windows PC, which can perform function calling around 3 seconds! Let me know if you need help with installing llama-cpp-(python) for ROCm on Windows.
Credits
- FutureProofHomes for making Functionary work with Extended OpenAI and llama-cpp-python.
- Min Jekal for creating the Extended OpenAI integration!
The Story
I want to quickly update the community with the possibilities in AI, Voice Control and Home Assistant. I am exploring the possibilities of running a fully local Voice Assistant in my home for quite a while now.
I know the majority of HA users run their instance on a small piece of hardware without much compute capability, this post is NOT for those users! My Home Assistant instance is running as a Docker container on an old PC that is now an ubuntu server. I recently upgraded this PC with a Nvidia GTX 1080 GPU (around €100) to achieve the following:
- Run a local LLM (AI) model that is completely offloaded into my GPU’s VRAM.
- Run local SST with whisper on my GPU with the large-v3-in8 model.
Further read
The local SST using whisper is far off Google’s SST performance, it was therefore annoying to use it with the default Assist of Home Assistant, since this requires precise intents. Especially in Dutch, it is very hard to always get the precise intent output by whisper, and some words are often replaced by others (it feels like overkill to make a wildcard for these words). I therefore focused on using AI, so that you don’t have to memorize any voice commands and it all feels more natural.
To my knowledge, there are two HACS integrations that support AI function calling as of now:
- Home-LLM: more focused on smaller HA (CPU only) setups and uses a relatively small LLM (3B parameters) that is trained on a custom Home Assistant Request dataset. However, it is also possible to train and use your own LLM.
- Extended OpenAI: an extension of the OpenAI integration in HA, that supports function calling with the GPT3.5/4 models (and other models that supports function calling via OpenAI’s API).
Then, there are multiple ways of setting up your own local LLM:
- LocalAI
- llama-cpp (-python)
- KoboldCPP (AMD GPU support)
- Many more!
I first used a combination of LocalAI and Home-LLM and used my own custom trained model on a Dutch translated version of the training set from Home-LLM. I used Unsloth to train the Mistral 7B model using this Google Colab It worked quite well for some functions (e.g. light brightness), but it is still far from a real AI experience. The largest downside of this integration is that you need to train the model for each function call, so its not easy to add a feature.
I have now settled on llama-cpp-python and Extended OpenAI. I came across this YouTube video from FutureProofHomes and his journey in making a dedicated local AI-Powered Voice assistant. It’s not exactly what I am looking for, since his dedicated hardware restrictions make the AI very slow. However, all credits go to FutureProofHomes for pointing me in this direction. Normally, Extended OpenAI is only supported with the GPT models that support function calling, so most models that you can run locally do not work. But now, there is this model called Functionary that you are able to run locally and provides even better function calling than the GPT models! Do note that the chit-chatting with this model is never as good as GPT. Some modifications in the source code of Extended OpenAI and llama-cpp-python were necessary to have this combination working.
It can all easily be made faster if you want to invest in it. As for now, it seems that its best to buy a GPU with as much VRAM as possible and the highest CUDA compute capability. I might buy a RTX 3060 (12GB) or RTX 3090 (24GB) in the future! I was also able to run KoboldCPP on my desktop PC with my AMD Radeon 6900XT.
See below the guide with all the code to get llama-cpp-python / Extended OpenAI / Functionary working together. Also let me know if you have any tips or suggestions in local AI Voice Assistants. Would love to hear alternatives and benchmarks of the processing time of other GPUs.
Installation Guide
This guide is specifically written for installing a local LLM Voice Assistant using Docker containers on a setup with a Nvidia GPU (CUDA) and Ubuntu 22.04. Since we are building our own Docker images, you might have to change a few things dependent on your setup.
Prerequisites:
- Linux distribution: one that is supported by Nvidia Container Toolkit
- Docker container engine installed
- Nvidia GPU (including CUDA drivers), check your maximum supported CUDA version by running the command
$ nvidia-smi
- Nvidia Container Tookit: to be able to run Docker containers on CUDA, follow this installation guide.
Wyoming Faster Whisper
You can use this repository to build the wyoming-faster-whisper Docker container that runs on CUDA.
-
Clone the repository and navigate into it:
$ git clone https://github.com/BramNH/wyoming-faster-whisper-docker-cuda
$ cd wyoming-faster-whisper-docker-cuda
-
Because my maximum supported CUDA version = 12.2, in
Dockerfile
, I am using the following image to include the CUDA environment in the built image:
FROM nvidia/cuda:12.0.1-cudnn8-runtime-ubuntu22.04
Faster Whisper requires the cudnn8 and runtime from CUDA. You might need another image based on your CUDA version and Linux distribution. -
Build the image:
$ docker build --tag wyoming-whisper .
-
Edit the container configuration in
compose.yml
to specify which model to run. For example:--model ellisd/faster-whisper-large-v3-int8 --language nl
-
Start the container with Docker Compose:
$ docker compose up -d
Llama-cpp-python
We setup llama-cpp-python to specifically work in combination with the Functionary LLM. Because llama-cpp did not work out of the box with Functionary, a few changes had to be made in llama-types.py
from the llama-cpp-python directory. During the Docker image build, the modified file is copied into the the pythons dependency folder of llama-cpp. This is a temporary solution and might be changed or fixed in the future.
-
Clone the repository to get the necessary files to build and run the Docker container, then navigate into the folder:
$ git clone https://github.com/BramNH/llama-cpp-python-docker-cuda
$ cd llama-cpp-python-docker-cuda
-
Llama-cpp requires the devel CUDA image for GPU support, so I import the following image in
Dockerfile
. You might have to change this to your CUDA version / Linux distribution:
FROM nvidia/cuda:12.1.1-devel-ubuntu22.04
-
Build the Docker image with the included
Dockerfile
:
$ docker build --tag llama-cpp-python .
-
You can run the container using the included
compose.yml
:
$ docker compose up -d
Extended OpenAI
The Extended OpenAI HACS integration will talk to the OpenAI API that is used by llama-cpp-python. There were also some modifications necessary to get the HACS integration working with Functionary and llama-cpp-python, see this discussion. You can either re-install the HACS integration using my fork of Extended OpenAI, or replace the __init__.py
file within the /custom_components/extended_openai_conversation
__init__.py
"""The OpenAI Conversation integration."""
from __future__ import annotations
import json
import logging
from typing import Literal
from openai import AsyncAzureOpenAI, AsyncOpenAI
from openai._exceptions import AuthenticationError, OpenAIError
from openai.types.chat.chat_completion import (
ChatCompletion,
ChatCompletionMessage,
Choice,
)
import yaml
from homeassistant.components import conversation
from homeassistant.components.homeassistant.exposed_entities import async_should_expose
from homeassistant.config_entries import ConfigEntry
from homeassistant.const import ATTR_NAME, CONF_API_KEY, MATCH_ALL
from homeassistant.core import HomeAssistant
from homeassistant.exceptions import (
ConfigEntryNotReady,
HomeAssistantError,
TemplateError,
)
from homeassistant.helpers import (
config_validation as cv,
entity_registry as er,
intent,
template,
)
from homeassistant.helpers.typing import ConfigType
from homeassistant.util import ulid
from .const import (
CONF_API_VERSION,
CONF_ATTACH_USERNAME,
CONF_BASE_URL,
CONF_CHAT_MODEL,
CONF_CONTEXT_THRESHOLD,
CONF_CONTEXT_TRUNCATE_STRATEGY,
CONF_FUNCTIONS,
CONF_MAX_FUNCTION_CALLS_PER_CONVERSATION,
CONF_MAX_TOKENS,
CONF_ORGANIZATION,
CONF_PROMPT,
CONF_SKIP_AUTHENTICATION,
CONF_TEMPERATURE,
CONF_TOP_P,
CONF_USE_TOOLS,
DEFAULT_ATTACH_USERNAME,
DEFAULT_CHAT_MODEL,
DEFAULT_CONF_FUNCTIONS,
DEFAULT_CONTEXT_THRESHOLD,
DEFAULT_CONTEXT_TRUNCATE_STRATEGY,
DEFAULT_MAX_FUNCTION_CALLS_PER_CONVERSATION,
DEFAULT_MAX_TOKENS,
DEFAULT_PROMPT,
DEFAULT_SKIP_AUTHENTICATION,
DEFAULT_TEMPERATURE,
DEFAULT_TOP_P,
DEFAULT_USE_TOOLS,
DOMAIN,
EVENT_CONVERSATION_FINISHED,
)
from .exceptions import (
FunctionLoadFailed,
FunctionNotFound,
InvalidFunction,
ParseArgumentsFailed,
TokenLengthExceededError,
)
from .helpers import (
get_function_executor,
is_azure,
validate_authentication,
)
from .services import async_setup_services
_LOGGER = logging.getLogger(__name__)
CONFIG_SCHEMA = cv.config_entry_only_config_schema(DOMAIN)
# hass.data key for agent.
DATA_AGENT = "agent"
async def async_setup(hass: HomeAssistant, config: ConfigType) -> bool:
"""Set up OpenAI Conversation."""
await async_setup_services(hass, config)
return True
async def async_setup_entry(hass: HomeAssistant, entry: ConfigEntry) -> bool:
"""Set up OpenAI Conversation from a config entry."""
try:
await validate_authentication(
hass=hass,
api_key=entry.data[CONF_API_KEY],
base_url=entry.data.get(CONF_BASE_URL),
api_version=entry.data.get(CONF_API_VERSION),
organization=entry.data.get(CONF_ORGANIZATION),
skip_authentication=entry.data.get(
CONF_SKIP_AUTHENTICATION, DEFAULT_SKIP_AUTHENTICATION
),
)
except AuthenticationError as err:
_LOGGER.error("Invalid API key: %s", err)
return False
except OpenAIError as err:
raise ConfigEntryNotReady(err) from err
agent = OpenAIAgent(hass, entry)
data = hass.data.setdefault(DOMAIN, {}).setdefault(entry.entry_id, {})
data[CONF_API_KEY] = entry.data[CONF_API_KEY]
data[DATA_AGENT] = agent
conversation.async_set_agent(hass, entry, agent)
return True
async def async_unload_entry(hass: HomeAssistant, entry: ConfigEntry) -> bool:
"""Unload OpenAI."""
hass.data[DOMAIN].pop(entry.entry_id)
conversation.async_unset_agent(hass, entry)
return True
class OpenAIAgent(conversation.AbstractConversationAgent):
"""OpenAI conversation agent."""
def __init__(self, hass: HomeAssistant, entry: ConfigEntry) -> None:
"""Initialize the agent."""
self.hass = hass
self.entry = entry
self.history: dict[str, list[dict]] = {}
base_url = entry.data.get(CONF_BASE_URL)
if is_azure(base_url):
self.client = AsyncAzureOpenAI(
api_key=entry.data[CONF_API_KEY],
azure_endpoint=base_url,
api_version=entry.data.get(CONF_API_VERSION),
organization=entry.data.get(CONF_ORGANIZATION),
)
else:
self.client = AsyncOpenAI(
api_key=entry.data[CONF_API_KEY],
base_url=base_url,
organization=entry.data.get(CONF_ORGANIZATION),
)
@property
def supported_languages(self) -> list[str] | Literal["*"]:
"""Return a list of supported languages."""
return MATCH_ALL
async def async_process(
self, user_input: conversation.ConversationInput
) -> conversation.ConversationResult:
exposed_entities = self.get_exposed_entities()
if user_input.conversation_id in self.history:
conversation_id = user_input.conversation_id
else:
conversation_id = ulid.ulid()
user_input.conversation_id = conversation_id
try:
system_message = self._generate_system_message(
exposed_entities, user_input
)
except TemplateError as err:
_LOGGER.error("Error rendering prompt: %s", err)
intent_response = intent.IntentResponse(language=user_input.language)
intent_response.async_set_error(
intent.IntentResponseErrorCode.UNKNOWN,
f"Sorry, I had a problem with my template: {err}",
)
return conversation.ConversationResult(
response=intent_response, conversation_id=conversation_id
)
messages = [system_message]
user_message = {"role": "user", "content": user_input.text}
if self.entry.options.get(CONF_ATTACH_USERNAME, DEFAULT_ATTACH_USERNAME):
user = await self.hass.auth.async_get_user(user_input.context.user_id)
if user is not None and user.name is not None:
user_message[ATTR_NAME] = user.name
messages.append(user_message)
try:
query_response = await self.query(user_input, messages, exposed_entities, 0)
except OpenAIError as err:
_LOGGER.error(err)
intent_response = intent.IntentResponse(language=user_input.language)
intent_response.async_set_error(
intent.IntentResponseErrorCode.UNKNOWN,
f"Sorry, I had a problem talking to OpenAI: {err}",
)
return conversation.ConversationResult(
response=intent_response, conversation_id=conversation_id
)
except HomeAssistantError as err:
_LOGGER.error(err, exc_info=err)
intent_response = intent.IntentResponse(language=user_input.language)
intent_response.async_set_error(
intent.IntentResponseErrorCode.UNKNOWN,
f"Something went wrong: {err}",
)
return conversation.ConversationResult(
response=intent_response, conversation_id=conversation_id
)
messages.append(query_response.message.model_dump())
self.hass.bus.async_fire(
EVENT_CONVERSATION_FINISHED,
{
"response": query_response.response.model_dump(),
"user_input": user_input,
"messages": messages,
},
)
intent_response = intent.IntentResponse(language=user_input.language)
intent_response.async_set_speech(query_response.message.content)
return conversation.ConversationResult(
response=intent_response, conversation_id=conversation_id
)
def _generate_system_message(
self, exposed_entities, user_input: conversation.ConversationInput
):
raw_prompt = self.entry.options.get(CONF_PROMPT, DEFAULT_PROMPT)
prompt = self._async_generate_prompt(raw_prompt, exposed_entities, user_input)
return {"role": "system", "content": prompt}
def _async_generate_prompt(
self,
raw_prompt: str,
exposed_entities,
user_input: conversation.ConversationInput,
) -> str:
"""Generate a prompt for the user."""
return template.Template(raw_prompt, self.hass).async_render(
{
"ha_name": self.hass.config.location_name,
"exposed_entities": exposed_entities,
"current_device_id": user_input.device_id,
},
parse_result=False,
)
def get_exposed_entities(self):
states = [
state
for state in self.hass.states.async_all()
if async_should_expose(self.hass, conversation.DOMAIN, state.entity_id)
]
entity_registry = er.async_get(self.hass)
exposed_entities = []
for state in states:
entity_id = state.entity_id
entity = entity_registry.async_get(entity_id)
aliases = []
if entity and entity.aliases:
aliases = entity.aliases
exposed_entities.append(
{
"entity_id": entity_id,
"name": state.name,
"state": self.hass.states.get(entity_id).state,
"aliases": aliases,
}
)
return exposed_entities
def get_functions(self):
try:
function = self.entry.options.get(CONF_FUNCTIONS)
result = yaml.safe_load(function) if function else DEFAULT_CONF_FUNCTIONS
if result:
for setting in result:
function_executor = get_function_executor(
setting["function"]["type"]
)
setting["function"] = function_executor.to_arguments(
setting["function"]
)
return result
except (InvalidFunction, FunctionNotFound) as e:
raise e
except:
raise FunctionLoadFailed()
async def truncate_message_history(
self, messages, exposed_entities, user_input: conversation.ConversationInput
):
"""Truncate message history."""
strategy = self.entry.options.get(
CONF_CONTEXT_TRUNCATE_STRATEGY, DEFAULT_CONTEXT_TRUNCATE_STRATEGY
)
if strategy == "clear":
last_user_message_index = None
for i in reversed(range(len(messages))):
if messages[i]["role"] == "user":
last_user_message_index = i
break
if last_user_message_index is not None:
del messages[1:last_user_message_index]
# refresh system prompt when all messages are deleted
messages[0] = self._generate_system_message(
exposed_entities, user_input
)
async def query(
self,
user_input: conversation.ConversationInput,
messages,
exposed_entities,
n_requests,
) -> OpenAIQueryResponse:
"""Process a sentence."""
model = self.entry.options.get(CONF_CHAT_MODEL, DEFAULT_CHAT_MODEL)
max_tokens = self.entry.options.get(CONF_MAX_TOKENS, DEFAULT_MAX_TOKENS)
top_p = self.entry.options.get(CONF_TOP_P, DEFAULT_TOP_P)
temperature = self.entry.options.get(CONF_TEMPERATURE, DEFAULT_TEMPERATURE)
use_tools = self.entry.options.get(CONF_USE_TOOLS, DEFAULT_USE_TOOLS)
context_threshold = self.entry.options.get(
CONF_CONTEXT_THRESHOLD, DEFAULT_CONTEXT_THRESHOLD
)
functions = list(map(lambda s: s["spec"], self.get_functions()))
function_call = "auto"
if n_requests == self.entry.options.get(
CONF_MAX_FUNCTION_CALLS_PER_CONVERSATION,
DEFAULT_MAX_FUNCTION_CALLS_PER_CONVERSATION,
):
function_call = "none"
tool_kwargs = {"functions": functions, "function_call": function_call}
if use_tools:
tool_kwargs = {
"tools": [{"type": "function", "function": func} for func in functions],
"tool_choice": function_call,
}
if len(functions) == 0:
tool_kwargs = {}
_LOGGER.info("Prompt for %s: %s", model, messages)
response: ChatCompletion = await self.client.chat.completions.create(
model=model,
messages=messages,
max_tokens=max_tokens,
top_p=top_p,
temperature=temperature,
user=user_input.conversation_id,
**tool_kwargs,
)
_LOGGER.info("Response %s", response.model_dump(exclude_none=True))
if response.usage.total_tokens > context_threshold:
await self.truncate_message_history(messages, exposed_entities, user_input)
choice: Choice = response.choices[0]
message = choice.message
if choice.finish_reason == "function_call":
return await self.execute_function_call(
user_input, messages, message, exposed_entities, n_requests + 1
)
if choice.finish_reason == "tool_calls":
return await self.execute_tool_calls(
user_input, messages, message, exposed_entities, n_requests + 1
)
if choice.finish_reason == "length":
raise TokenLengthExceededError(response.usage.completion_tokens)
return OpenAIQueryResponse(response=response, message=message)
async def execute_function_call(
self,
user_input: conversation.ConversationInput,
messages,
message: ChatCompletionMessage,
exposed_entities,
n_requests,
) -> OpenAIQueryResponse:
function_name = message.function_call.name.strip()
function = next(
(s for s in self.get_functions() if s["spec"]["name"] == function_name),
None,
)
if function is not None:
return await self.execute_function(
user_input,
messages,
message,
exposed_entities,
n_requests,
function,
)
raise FunctionNotFound(function_name)
async def execute_function(
self,
user_input: conversation.ConversationInput,
messages,
message: ChatCompletionMessage,
exposed_entities,
n_requests,
function,
) -> OpenAIQueryResponse:
function_executor = get_function_executor(function["function"]["type"])
try:
arguments = json.loads(message.function_call.arguments)
except json.decoder.JSONDecodeError as err:
raise ParseArgumentsFailed(message.function_call.arguments) from err
result = await function_executor.execute(
self.hass, function["function"], arguments, user_input, exposed_entities
)
messages.append(
{
"role": "function",
"name": message.function_call.name,
"content": str(result),
}
)
return await self.query(user_input, messages, exposed_entities, n_requests)
async def execute_tool_calls(
self,
user_input: conversation.ConversationInput,
messages,
message: ChatCompletionMessage,
exposed_entities,
n_requests,
) -> OpenAIQueryResponse:
messages.append(message.model_dump())
for tool in message.tool_calls:
function_name = tool.function.name.strip()
function = next(
(s for s in self.get_functions() if s["spec"]["name"] == function_name),
None,
)
if function is not None:
result = await self.execute_tool_function(
user_input,
tool,
exposed_entities,
function,
)
else:
raise FunctionNotFound(function_name)
return await self.query(user_input, messages, exposed_entities, n_requests)
async def execute_tool_function(
self,
user_input: conversation.ConversationInput,
tool,
exposed_entities,
function,
) -> OpenAIQueryResponse:
function_executor = get_function_executor(function["function"]["type"])
try:
arguments = json.loads(tool.function.arguments)
except json.decoder.JSONDecodeError as err:
raise ParseArgumentsFailed(tool.function.arguments) from err
result = await function_executor.execute(
self.hass, function["function"], arguments, user_input, exposed_entities
)
return result
class OpenAIQueryResponse:
"""OpenAI query response value object."""
def __init__(
self, response: ChatCompletion, message: ChatCompletionMessage
) -> None:
"""Initialize OpenAI query response value object."""
self.response = response
self.message = message
Follow the guide of Extended OpenAI how to create your own functions that the LLM can call.
Important settings when using Functionary LLM:
- Enable Use Tools, If you defined your own functions;
- Context Threshold = 8000, messages are cleared after 8k, otherwise model gets confused after threshold;