AI Voice Control for Home Assistant (Fully Local)

Seems like I did a mistake while updating the init.py file, my mistake, sorry for the inconvenience!
After using your fork it seems to work and I get the correct answer for my question (like light turned off), maybe I did forget to restart? I do not know.

But thank you very very much for the very fast and good help! Can’t stress that enough!

No worries, nice you got it all working! :smiley:
Could you post your benchmarks here? Curious what the natural language processing time for the RTX 2070 is. You can check it in the pipeline debug:

Yes I will surely look into that tomorrow now its time for bed :smiley:

1 Like

Hey, installed on ubuntu now. Got it working from an external ssd. MNow im facing the following problem, and dont know enough about linux&Docker to fix it…
Nvidias toolkit is installed, gpu drivers too. Docker.io installed. What am I doing wrong? I attempted to change the cuda version in the docker file to 12.2.0 and running it again, but when using the run command in the readme it still shows cuda 12.1.1. Im thinking maybe thats the issue?

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama_cpp.py", line 70, in _load_shared_library
    return ctypes.CDLL(str(_lib_path), **cdll_args)  # type: ignore
  File "/usr/lib/python3.10/ctypes/__init__.py", line 374, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcuda.so.1: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 187, in _run_module_as_main
    mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
  File "/usr/lib/python3.10/runpy.py", line 110, in _get_module_details
    __import__(pkg_name)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama_cpp.py", line 83, in <module>
    _lib = _load_shared_library(_lib_base_name)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama_cpp.py", line 72, in _load_shared_library
    raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
RuntimeError: Failed to load shared library '/usr/local/lib/python3.10/dist-packages/llama_cpp/libllama.so': libcuda.so.1: cannot open shared object file: No such file or directory

Which command are you running?

Did you build your llama-cpp-python image using docker build --tag llama-cpp-python . and then running it with docker compose up?

Or are you executing docker run -p 8000:8000 -e USE_MLOCK=0 -e HF_MODEL_REPO_ID=meetkai/functionary-small-v2.4-GGUF -e MODEL=functionary-small-v2.4.Q4_0.gguf -e HF_PRETRAINED_MODEL_NAME_OR_PATH=meetkai/functionary-small-v2.4-GGUF -e N_GPU_LAYERS=33 -e CHAT_FORMAT=functionary-v2 -e N_CTX=4092 -e N_BATCH=192 -e N_THREADS=6 bramnh/llama-cpp-python:latest?

If you used the second command: this command is running the image bramnh/llama-cpp-python:latest directly and not the one you potentially build locally on your system thus any changes you made to your Dockerfile are not applied.

I am using Cuda 12.4.1 and and get the same error when I am just executing the docker run command from above.

If this applies to you please try again by executing the instructions given here in the Build image section.
Edit: I would propose that you do not run Docker compose up -d the first time but docker compose up. The -d detaches from the container so in rund in the background and you will not see its output. By omitting the -d you will be sble to see the output.

If you need more assistance feel free to ask (event though I am new to this as well and do not know how far I can really help haha :D)

1 Like

Thanks!
I first tried via the first command, then via the second. I built it again and tried using docker compose up, however got the following driver issue:

Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown

Sometimes a reboot fixes it and I get the following output:

sudo docker compose up
[+] Running 1/0
 ✔ Container llama-cpp-python  Running                                     0.0s 
Attaching to llama-cpp-python

Dont bother - Its now running lol. Just had to shutdown the container and start it back up… Super weird.

1 Like

Is there any possibility of changing the functionary quantization/model from say Q4 to Q8 or small to medium?

Short “benchmark” with a RTX2070

Speech to text using whisper

< 0.8 seconds for normal voice commands (like turn the light on) running the medium-int8 model with language set to German. I am happy with that (also with the precision it transcribes the words with).

llama

Most of the time around 8.5 seconds which is definitely to long to use it in production, need to upgrade I guess.

Sometimes llama tells me that the light has been turned on but it wasn’t turned on in any kind and sometimes it just works. Is there a good way to debug this?
Same for getting the temperature of my room or something like that.
If I want to set the brightness I get Something went wrong: Service light.set_brightness not found. Is that something I would have to implement using custom functions?

I did not test this myself but I would guess that you can change the model parameters in the .env file to match the model you want to use.
As this .env file is passed with the compose.yml a simple recreation of the container (docker compose down followed by docker compose up) should be enough.

Depends on your hardware, but llama-cpp-python supports all quantizations and can be changed in the .env file as mentioned by @tiko. The medium model FP16 takes about 80GB VRAM, so probably any quantization is still too large (maybe Q4 fits?). The small model FP16 takes 24GB VRAM, so can fit in e.g. a RTX 3090. Any quantization of the small model will be smaller than 24GB (Q4 about 5GB).

Oh I did not expect that… the memory bandwidth is probably small (?).

You can enable verbose debugging on the Extended OpenAI integration and the logs will show you (in json format) what function the LLM called.

Yes. See my examples in the original post above for example code of functions. And make sure you enable use tools in Extended OpenAI.

Could you post your llama-cpp-python logs here? The RTX 2070 does have tensor cores which it should use and the memory bandwidth is larger than my GTX 1080. So definitely faster speeds than 8 seconds are expected.

Yes I will do that, but probably not before Friday :frowning:

1 Like

Are the logs just the output of the container or do I have to do something else?

No just the output of the container from startup (how the model is loaded and inference times, so I can see token/s).

Here is the output I got when I started the container, I hope this helps:

llama-cpp-python  | 
llama-cpp-python  | ==========
llama-cpp-python  | == CUDA ==
llama-cpp-python  | ==========
llama-cpp-python  | 
llama-cpp-python  | CUDA Version 12.4.1
llama-cpp-python  | 
llama-cpp-python  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
llama-cpp-python  | 
llama-cpp-python  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
llama-cpp-python  | By pulling and using the container, you accept the terms and conditions of this license:
llama-cpp-python  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
llama-cpp-python  | 
llama-cpp-python  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
llama-cpp-python  | 
llama-cpp-python  | None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
tokenizer_config.json: 100% 2.86k/2.86k [00:00<00:00, 18.3MB/s]
tokenizer.model: 100% 493k/493k [00:00<00:00, 5.97MB/s]
tokenizer.json: 100% 1.80M/1.80M [00:00<00:00, 3.03MB/s]
added_tokens.json: 100% 95.0/95.0 [00:00<00:00, 938kB/s]
special_tokens_map.json: 100% 660/660 [00:00<00:00, 6.67MB/s]
llama-cpp-python  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
llama-cpp-python  | llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /var/model/functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama-cpp-python  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama-cpp-python  | llama_model_loader: - kv   0:                       general.architecture str              = llama
llama-cpp-python  | llama_model_loader: - kv   1:                               general.name str              = .
llama-cpp-python  | llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
llama-cpp-python  | llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama-cpp-python  | llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama-cpp-python  | llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama-cpp-python  | llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama-cpp-python  | llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama-cpp-python  | llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama-cpp-python  | llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama-cpp-python  | llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama-cpp-python  | llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama-cpp-python  | llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama-cpp-python  | llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama-cpp-python  | llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama-cpp-python  | llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama-cpp-python  | llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama-cpp-python  | llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama-cpp-python  | llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama-cpp-python  | llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama-cpp-python  | llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama-cpp-python  | llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama-cpp-python  | llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama-cpp-python  | llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama-cpp-python  | llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama-cpp-python  | llama_model_loader: - type  f32:   65 tensors
llama-cpp-python  | llama_model_loader: - type q4_0:  225 tensors
llama-cpp-python  | llama_model_loader: - type q6_K:    1 tensors
llama-cpp-python  | llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llama-cpp-python  | llm_load_print_meta: format           = GGUF V3 (latest)
llama-cpp-python  | llm_load_print_meta: arch             = llama
llama-cpp-python  | llm_load_print_meta: vocab type       = SPM
llama-cpp-python  | llm_load_print_meta: n_vocab          = 32004
llama-cpp-python  | llm_load_print_meta: n_merges         = 0
llama-cpp-python  | llm_load_print_meta: n_ctx_train      = 32768
llama-cpp-python  | llm_load_print_meta: n_embd           = 4096
llama-cpp-python  | llm_load_print_meta: n_head           = 32
llama-cpp-python  | llm_load_print_meta: n_head_kv        = 8
llama-cpp-python  | llm_load_print_meta: n_layer          = 32
llama-cpp-python  | llm_load_print_meta: n_rot            = 128
llama-cpp-python  | llm_load_print_meta: n_embd_head_k    = 128
llama-cpp-python  | llm_load_print_meta: n_embd_head_v    = 128
llama-cpp-python  | llm_load_print_meta: n_gqa            = 4
llama-cpp-python  | llm_load_print_meta: n_embd_k_gqa     = 1024
llama-cpp-python  | llm_load_print_meta: n_embd_v_gqa     = 1024
llama-cpp-python  | llm_load_print_meta: f_norm_eps       = 0.0e+00
llama-cpp-python  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llama-cpp-python  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llama-cpp-python  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llama-cpp-python  | llm_load_print_meta: f_logit_scale    = 0.0e+00
llama-cpp-python  | llm_load_print_meta: n_ff             = 14336
llama-cpp-python  | llm_load_print_meta: n_expert         = 0
llama-cpp-python  | llm_load_print_meta: n_expert_used    = 0
llama-cpp-python  | llm_load_print_meta: causal attn      = 1
llama-cpp-python  | llm_load_print_meta: pooling type     = 0
llama-cpp-python  | llm_load_print_meta: rope type        = 0
llama-cpp-python  | llm_load_print_meta: rope scaling     = linear
llama-cpp-python  | llm_load_print_meta: freq_base_train  = 1000000.0
llama-cpp-python  | llm_load_print_meta: freq_scale_train = 1
llama-cpp-python  | llm_load_print_meta: n_yarn_orig_ctx  = 32768
llama-cpp-python  | llm_load_print_meta: rope_finetuned   = unknown
llama-cpp-python  | llm_load_print_meta: ssm_d_conv       = 0
llama-cpp-python  | llm_load_print_meta: ssm_d_inner      = 0
llama-cpp-python  | llm_load_print_meta: ssm_d_state      = 0
llama-cpp-python  | llm_load_print_meta: ssm_dt_rank      = 0
llama-cpp-python  | llm_load_print_meta: model type       = 8B
llama-cpp-python  | llm_load_print_meta: model ftype      = Q4_0
llama-cpp-python  | llm_load_print_meta: model params     = 7.24 B
llama-cpp-python  | llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llama-cpp-python  | llm_load_print_meta: general.name     = .
llama-cpp-python  | llm_load_print_meta: BOS token        = 1 '<s>'
llama-cpp-python  | llm_load_print_meta: EOS token        = 2 '</s>'
llama-cpp-python  | llm_load_print_meta: UNK token        = 0 '<unk>'
llama-cpp-python  | llm_load_print_meta: PAD token        = 2 '</s>'
llama-cpp-python  | llm_load_print_meta: LF token         = 13 '<0x0A>'
llama-cpp-python  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
llama-cpp-python  | ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
llama-cpp-python  | ggml_cuda_init: found 1 CUDA devices:
llama-cpp-python  |   Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
llama-cpp-python  | llm_load_tensors: ggml ctx size =    0.30 MiB
llama-cpp-python  | llm_load_tensors: offloading 32 repeating layers to GPU
llama-cpp-python  | llm_load_tensors: offloading non-repeating layers to GPU
llama-cpp-python  | llm_load_tensors: offloaded 33/33 layers to GPU
llama-cpp-python  | llm_load_tensors:        CPU buffer size =    70.32 MiB
llama-cpp-python  | llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
llama-cpp-python  | ..................................................................................................
llama-cpp-python  | llama_new_context_with_model: n_ctx      = 4096
llama-cpp-python  | llama_new_context_with_model: n_batch    = 192
llama-cpp-python  | llama_new_context_with_model: n_ubatch   = 192
llama-cpp-python  | llama_new_context_with_model: freq_base  = 1000000.0
llama-cpp-python  | llama_new_context_with_model: freq_scale = 1
llama-cpp-python  | llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
llama-cpp-python  | llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama-cpp-python  | llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
llama-cpp-python  | llama_new_context_with_model:      CUDA0 compute buffer size =   111.00 MiB
llama-cpp-python  | llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
llama-cpp-python  | llama_new_context_with_model: graph nodes  = 1030
llama-cpp-python  | llama_new_context_with_model: graph splits = 2
llama-cpp-python  | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
llama-cpp-python  | Model metadata: {'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': '.', 'llama.vocab_size': '32004', 'general.file_type': '2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
llama-cpp-python  | INFO:     Started server process [27]
llama-cpp-python  | INFO:     Waiting for application startup.
llama-cpp-python  | INFO:     Application startup complete.
llama-cpp-python  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Edit:

This is when I asked it “How old is the Earth” or “Wie alt ist die Erde” in German.

llama-cpp-python  | llama_print_timings:        load time =     281.85 ms
llama-cpp-python  | llama_print_timings:      sample time =       0.34 ms /     3 runs   (    0.11 ms per token,  8771.93 tokens per second)
llama-cpp-python  | llama_print_timings: prompt eval time =    2562.53 ms /  2476 tokens (    1.03 ms per token,   966.23 tokens per second)
llama-cpp-python  | llama_print_timings:        eval time =      51.68 ms /     2 runs   (   25.84 ms per token,    38.70 tokens per second)
llama-cpp-python  | llama_print_timings:       total time =    6041.46 ms /  2478 tokens
llama-cpp-python  | Llama.generate: prefix-match hit
llama-cpp-python  | 
llama-cpp-python  | llama_print_timings:        load time =     281.85 ms
llama-cpp-python  | llama_print_timings:      sample time =       7.46 ms /    67 runs   (    0.11 ms per token,  8976.42 tokens per second)
llama-cpp-python  | llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama-cpp-python  | llama_print_timings:        eval time =    1227.71 ms /    67 runs   (   18.32 ms per token,    54.57 tokens per second)
llama-cpp-python  | llama_print_timings:       total time =    2168.88 ms /    68 tokens
llama-cpp-python  | INFO:     192.168.178.10:54778 - "POST /v1/chat/completions HTTP/1.1" 200 OK

You definitely have faster inference times (I get around 30 tokens/s, you 54 tokens/s). But yours seems to be doing more “runs”. I am not an expert in LLMs, but this causes your sample time and eval time to go up.

Interesting, if I find time I will look into that but a quick goolge search did not really help, anyways thanks for the tip!

1 Like

Hello, I just got functionary 2.5 working.

I found 2 problems so far.

  1. It doesn’t seem to know how to call function to home assistant
    I fixed it with prompt example i.e.
I want you to act as smart home manager of Home Assistant.
I will provide information of smart home along with a question, you will truthfully make correction or answer using information provided in one sentence in everyday language.

Current Time: {{now()}}

Available Devices:
```csv
entity_id,name,state,aliases
{% for entity in exposed_entities -%}
{{ entity.entity_id }},{{ entity.name }},{{ entity.state }},{{entity.aliases | join('/')}}
{% endfor -%}
\``` # IF YOU COPY MY CODE PLEASE FIX THIS LINE

Example services:
- name: fan.set_percentage
  parameters:
    - percentage: int

The current state of devices is provided in available devices.
Use execute_services function only for requested action, not for current states.
Do not execute service without user's confirmation.
Do not restate or appreciate what user says, rather make a quick inquiry.
  1. The CSV that we provided was basically useless, it doesn’t know at all to look at that when ask things like how is ROOMNAME

Does anyone know how to prompt the model to take a look at the CSV instead?

For anyone who interested in running functionary 2.5 with vllm
I’ve build an image without any COPY command for you to use.

https://hub.docker.com/r/zen3515/functionary-vllm

BTW, this version doesn’t work with the forked of extended_openai_conversation