AI Voice Control for Home Assistant (Fully Local)

BramNH · July 19, 2024, 10:06am

Don’t try to skip authentication, if this step fails you won’t get any further. Did you configure the port correctly in Docker, such that it is accessible from outside your machine?

vunhtun · July 20, 2024, 3:16am

Bro, you are right, I have a problem with the firewall that makes it inaccessible from the outside.

Now I can connect it from my home assistant.

After some quick test, this is the result.

It take ~10s for Natural Language Processing with some normal question like: “what is 1+1” and “what is the capital of France” and ~13s for some question need function calling like “what is the weather like today”.

I’m not sure what should I do to improve operational efficiency.

I have a Rtx3060 running on Windows11. I install llama-cpp-python docker in a WSL with Cuda image 12.5.1-devel-ubuntu24.04. I’m using default .env file from your github and can see it load ~5GB vram and GPU working when I ask something.

This is my Docker log

2024-07-19 14:53:15 llama-cpp-python  | ==========
2024-07-19 14:53:15 llama-cpp-python  | == CUDA ==
2024-07-19 14:53:15 llama-cpp-python  | ==========
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | CUDA Version 12.5.1
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-07-19 14:53:15 llama-cpp-python  | By pulling and using the container, you accept the terms and conditions of this license:
2024-07-19 14:53:15 llama-cpp-python  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:17 llama-cpp-python  | None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2024-07-19 14:53:18 llama-cpp-python  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--meetkai--functionary-small-v2.4-GGUF/snapshots/a0d171eb78e02a58858c464e278234afbcf85c5c/./functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   0:                       general.architecture str              = llama
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   1:                               general.name str              = .
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   5:                          llama.block_count u32              = 32
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  12:                          general.file_type u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  24:               general.quantization_version u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - type  f32:   65 tensors
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - type q4_0:  225 tensors
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - type q6_K:    1 tensors
2024-07-19 14:53:19 llama-cpp-python  | llm_load_vocab: special tokens definition check successful ( 263/32004 ).
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: format           = GGUF V3 (latest)
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: arch             = llama
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: vocab type       = SPM
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_vocab          = 32004
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_merges         = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_ctx_train      = 32768
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd           = 4096
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_head           = 32
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_head_kv        = 8
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_layer          = 32
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_rot            = 128
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_head_k    = 128
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_head_v    = 128
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_gqa            = 4
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_k_gqa     = 1024
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_v_gqa     = 1024
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_norm_eps       = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_ff             = 14336
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_expert         = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_expert_used    = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: causal attn      = 1
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: pooling type     = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: rope type        = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: rope scaling     = linear
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: freq_base_train  = 1000000.0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: freq_scale_train = 1
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_yarn_orig_ctx  = 32768
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: rope_finetuned   = unknown
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_d_conv       = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_d_inner      = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_d_state      = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_dt_rank      = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model type       = 8B
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model ftype      = Q4_0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model params     = 7.24 B
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: general.name     = .
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: BOS token        = 1 '<s>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: EOS token        = 2 '</s>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: UNK token        = 0 '<unk>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: PAD token        = 2 '</s>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: LF token         = 13 '<0x0A>'
2024-07-19 14:53:19 llama-cpp-python  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
2024-07-19 14:53:19 llama-cpp-python  | ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
2024-07-19 14:53:19 llama-cpp-python  | ggml_cuda_init: found 1 CUDA devices:
2024-07-19 14:53:19 llama-cpp-python  |   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
2024-07-19 14:53:19 llama-cpp-python  | llm_load_tensors: ggml ctx size =    0.30 MiB
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors: offloading 32 repeating layers to GPU
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors: offloading non-repeating layers to GPU
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors: offloaded 33/33 layers to GPU
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors:        CPU buffer size =    70.32 MiB
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
2024-07-19 14:53:23 llama-cpp-python  | ..................................................................................................
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: n_ctx      = 4096
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: n_batch    = 192
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: n_ubatch   = 192
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: freq_base  = 1000000.0
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: freq_scale = 1
2024-07-19 14:53:23 llama-cpp-python  | llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model:      CUDA0 compute buffer size =   111.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: graph nodes  = 1030
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: graph splits = 2
2024-07-19 14:53:23 llama-cpp-python  | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
2024-07-19 14:53:23 llama-cpp-python  | Model metadata: {'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': '.', 'llama.vocab_size': '32004', 'general.file_type': '2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Started server process [27]
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Waiting for application startup.
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Application startup complete.
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

This is the log when I ask a question from home assistant

2024-07-20 10:02:10 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:19 llama-cpp-python  | 
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:      sample time =       3.27 ms /     8 runs   (    0.41 ms per token,  2447.98 tokens per second)
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings: prompt eval time =    3169.35 ms /  2475 tokens (    1.28 ms per token,   780.92 tokens per second)
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:        eval time =     151.53 ms /     7 runs   (   21.65 ms per token,    46.19 tokens per second)
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:       total time =    9274.52 ms /  2482 tokens
2024-07-20 10:02:19 llama-cpp-python  | from_string grammar:
2024-07-20 10:02:19 llama-cpp-python  | root ::= [{] space time-kv [}] space 
2024-07-20 10:02:19 llama-cpp-python  | space ::= space_3 
2024-07-20 10:02:19 llama-cpp-python  | time-kv ::= ["] [t] [i] [m] [e] ["] space [:] space time- 
2024-07-20 10:02:19 llama-cpp-python  | space_3 ::= [ ] | 
2024-07-20 10:02:19 llama-cpp-python  | time- ::= ["] [c] [u] [r] [r] [e] [n] [t] ["] | ["] [t] [o] [d] [a] [y] ["] | ["] [t] [o] [m] [o] [r] [r] [o] [w] ["] 
2024-07-20 10:02:19 llama-cpp-python  | 
2024-07-20 10:02:19 llama-cpp-python  | root ::= "{" space time-kv "}" space
2024-07-20 10:02:19 llama-cpp-python  | space ::= " "?
2024-07-20 10:02:19 llama-cpp-python  | time- ::= "\"current\"" | "\"today\"" | "\"tomorrow\""
2024-07-20 10:02:19 llama-cpp-python  | time-kv ::= "\"time\"" space ":" space time-
2024-07-20 10:02:19 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:20 llama-cpp-python  | 
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:      sample time =      49.08 ms /     8 runs   (    6.13 ms per token,   163.00 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        eval time =     212.64 ms /     8 runs   (   26.58 ms per token,    37.62 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:       total time =     508.52 ms /     9 tokens
2024-07-20 10:02:20 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:20 llama-cpp-python  | 
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:      sample time =       0.44 ms /     1 runs   (    0.44 ms per token,  2262.44 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings: prompt eval time =      42.50 ms /     7 tokens (    6.07 ms per token,   164.70 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:       total time =      90.35 ms /     8 tokens
2024-07-20 10:02:20 llama-cpp-python  | INFO:     172.18.0.1:34866 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-07-20 10:02:21 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:23 llama-cpp-python  | 
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:      sample time =      16.67 ms /    41 runs   (    0.41 ms per token,  2459.95 tokens per second)
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings: prompt eval time =     104.15 ms /    10 tokens (   10.41 ms per token,    96.02 tokens per second)
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:        eval time =     990.18 ms /    40 runs   (   24.75 ms per token,    40.40 tokens per second)
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:       total time =    2319.23 ms /    50 tokens
2024-07-20 10:02:23 llama-cpp-python  | INFO:     172.18.0.1:34866 - "POST /v1/chat/completions HTTP/1.1" 200 OK

BramNH · July 21, 2024, 10:53am

Hmm 10s is way too long. It might be related with your setup. It’s also possible to manually build llama-cpp-python directly on your Windows machine. It required me to build llama.cpp and refere to the built .dll in llama-cpp-python. You could try this.

briancmpbll · July 25, 2024, 12:55am

Do you think an external gpu enclosure would work for this? I don’t think I have the budget for a new system, but I could swing a 3060 12GB. I already have a mini PC with an i5 and 16 GB RAM. I’d have to look at the m.2 slots, but if I have a slot for it I could use an m.2 adapter for the eGPU.

I’m not really familiar enough with the bandwidth these models need to know if this is a path I could consider.

kltye · July 25, 2024, 5:04am

I have a NUC running an i5 7260U (2C/4T, 3.4GHz) and an external GPU box connected via Thunderbolt using an RTX 3080. I find that the CPU is pegged when doing chat completion/function calling; the GPU is hardly taxed (apart from VRAM usage). It takes forever to complete the call, so while it’s fine for experimenting, it’s definitely not fast enough to daily drive. faster-whisper is extremely fast, however.

AshaiRey · August 28, 2024, 12:28pm

Pfff, quiet a long read to get to this but this thread is loaded with info, thanks for that.
Quiet a few things on this complicated matter are getting clearer.
But there is one thing that is not quiet sure of.
I have a small nuc that runs HA happily but it is nowhere near of capable of running a LLM. Now I was thinking of getting myself better hardware just for the LLM only. This would be a separate machine. The idea is to let the nuc with HA on it make calls to the dedicated LLM server. Will this work?

BramNH · August 28, 2024, 12:48pm

Ofcourse! I bought a GPU that I inserted into my server running HA, and while HA is a Docker container that runs processes on my CPU, all my AI related stuff runs on Docker containers that utilizes my GPU (Whisper, llama-cpp-python).

All I do within Extended OpenAI is connect to the localhost on a certain port, but if you run it on any other machine would work as well (required you set the ports open and the machine is reachable etc).

On another note. I know the Open Home Foundation is looking with Nvidia for specific hardware solutions, but expect them to be very expensive. I like my GPU solution for testing

AshaiRey · August 28, 2024, 1:28pm

Thx for the quick response.
This sounds great. This way i can keep my trusted HA nuc safe from experiments. Keeping my house in an operational state is my highest priority

AshaiRey · September 10, 2024, 4:21pm

Thanks due all the info here and the help of this thread starter I managed to get a local LLM up and running. I have written down my experiences with setting up a local LLM
Hopefully this will help starters just like me to get started quickly

Local LLM for Dummies (just like me)

prudy · September 25, 2024, 1:27pm

What is the external GPU power consumption when used solely for this?

kltye · September 29, 2024, 1:45am

I didn’t measure that, and I’ve since switched over to using a regular PC and using the RTX 3080 in that. It’s a Skylake i7-6700k (my old workstation), and it’s drawing 40 watts at idle with 16GB RAM and a M.2 SATA drive. The power supply is an older 80+ Bronze, so I’m guessing the GPU alone is only a couple of watts at idle. When inferencing, it shoots up to over 300 watts for a few seconds.

avsantos76 · November 12, 2024, 6:05pm

Hi,

Nice feature. It’s possible to implement this in an Intel NUC i3?

thank you

BramNH · November 12, 2024, 8:20pm

You can run LLMs on a CPU, but would bother trying, it’s just way too slow.

tiko · November 17, 2024, 5:52pm

Hey there, I tried to get it running again on a new environment and did not have any issues with the initial setup. Containers are running and connection from HomeAssistant works without a problem.

However somehow the return values from my LLM are not valid for the Extended OpenAI Conversation integration as I get a function not found error back.
I do not totally understand why as the parsing in general looks correct but maybe someone has already seen this issue.

Logs

2024-11-17 18:41:06.882 INFO (MainThread) [custom_components.extended_openai_conversation] Prompt for gpt-3.5-turbo-1106: [{'role': 'system', 'content': "I want you to act as smart home manager of Home Assistant.\nI will provide information of smart home along with a question, you will truthfully make correction or answer using information provided in one sentence in everyday language.\n\nCurrent Time: 2024-11-17 18:41:06.874053+01:00\n\nAvailable Devices:\n```csv\nentity_id,name,state,aliases\nscript.move_spotify_playback,Spotify: Transfer and start spotify playback,off,\nscript.spotify_play_playlist,Spotify: Play searched playlist in area,off,\nbinary_sensor.pv_generating,PV generating,unavailable,PV Anlage\nsensor.daily_consumed_energy,Heute verbrauchte Energie,unavailable,\nsensor.daily_pv_generation,Heutige PV Produktion,unavailable,\nsensor.daily_exported_energy_from_pv,Heute exportierte Energie,unavailable,\nsensor.daily_imported_energy,Heute importierte Energie,unavailable,\nswitch.pi_hole,Pi-Hole,on,\nlight.schrankwand,Schrankwand,on,\nlight.bildschirm_links,Bildschirm links,on,\nlight.bildschirm_unten,Bildschirm unten,on,\nlight.schreibtisch,Schreibtisch,on,\nlight.bildschirm_rechts,Bildschirm rechts,on,\nlight.bett,Bett,on,\nlight.lightstrip,Lightstrip,on,\nlight.deckenlampe,Deckenlampe,on,\nscene.deckenlampe_konzentrieren,Deckenlampe Konzentrieren,2024-08-11T15:21:57.143597+00:00,\nscene.deckenlampe_nachtlicht,Deckenlampe Nachtlicht,2024-11-17T00:12:00.714975+00:00,\nscene.tims_zimmer_nachtlicht,Tims Zimmer Nachtlicht,2024-11-05T22:30:21.388839+00:00,\nscene.deckenlampe_lesen,Deckenlampe Lesen,2024-11-17T15:25:43.220624+00:00,\nscene.tims_zimmer_konzentrieren,Tims Zimmer Konzentrieren,2024-10-27T11:45:54.830200+00:00,\nscene.tims_zimmer_lesen,Tims Zimmer Lesen,2024-11-17T12:20:25.118701+00:00,\nscene.bildschirm_nachtlicht,Bildschirm Nachtlicht,2024-09-13T20:14:47.997107+00:00,\nscene.deckenlampe_sonnenuntergang_savanne,Deckenlampe Sonnenuntergang Savanne,unknown,\nscene.bildschirm_lesen,Bildschirm Lesen,2024-10-17T19:34:31.094546+00:00,\nscene.deckenlampe_nordlichter,Deckenlampe Nordlichter,unknown,\nscene.bildschirm_konzentrieren,Bildschirm Konzentrieren,2024-07-17T15:45:31.630469+00:00,\nweather.forecast_rheda_wiedenbruck,Zuhause,cloudy,Rheda-Wiedenbrück\nlight.bildschirm,Bildschirm,on,\nbinary_sensor.window_tim_contact,Fenster Tim,off,Fenster Tim/Tims Fenster\nlight.fussleiste,Fußleiste,off,Fussleiste/Fußleister/Fuß leiste\nmedia_player.spotify_tim,Spotify tim,idle,\nclimate.heizung_tim,Heizung Tim,heat,\nmedia_player.this_device_2,This Device,standby,\nmedia_player.spotifyplus_tim,SpotifyPlus tim,paused,\nsensor.humidity_tim,Humidity Tim,56.20000076,\nsensor.co2_tim,CO2 Tim,1081,\nsensor.temperature_tim,Temperature Tim,21.5,Schlafzimmer Tim/Temperatur Tim\nmedia_player.sm_t970_snapcast_client_2,SM-T970 Snapcast Client,unavailable,\nmedia_player.sm_t970_snapcast_group_5,SM-T970 Snapcast Group,unavailable,\nmedia_player.snapweb_client_snapcast_group_2,Snapweb client Snapcast Group,unavailable,\nmedia_player.sm_t970_snapcast_group_6,SM-T970 Snapcast Group,unavailable,\nmedia_player.sm_t970_snapcast_group_7,SM-T970 Snapcast Group,unavailable,\nmedia_player.tim_linux_snapcast_group_2,tim-linux Snapcast Group,unavailable,\nmedia_player.tim_linux_snapcast_group_3,tim-linux Snapcast Group,idle,\nmedia_player.tim_linux,tim-linux,idle,\n```\n\nThe current state of devices is provided in available devices.\nUse execute_services function only for requested action, not for current states.\nDo not execute service without user's confirmation.\nDo not restate or appreciate what user says, rather make a quick inquiry."}, {'role': 'user', 'content': 'Turn off Deckenlampe'}]
2024-11-17 18:41:16.023 INFO (MainThread) [custom_components.extended_openai_conversation] Response {'id': 'chatcmpl-d471fb1a-20d5-4cbe-bbe8-f68a91cd6c21', 'choices': [{'finish_reason': 'tool_calls', 'index': 0, 'message': {'role': 'assistant', 'function_call': {'arguments': '{"domain": "light", "service": "turn_off"}', 'name': 'assistant\n<|recipient|> all'}, 'tool_calls': [{'id': 'call_sLGEPMjwbWVI31qE7QAMRl09', 'function': {'arguments': '{"domain": "light", "service": "turn_off"}', 'name': 'assistant\n<|recipient|> all'}, 'type': 'function'}]}}], 'created': 1731865275, 'model': 'gpt-3.5-turbo-1106', 'object': 'chat.completion', 'usage': {'completion_tokens': 1, 'prompt_tokens': 1737, 'total_tokens': 1738}}
2024-11-17 18:41:16.030 ERROR (MainThread) [custom_components.extended_openai_conversation] function 'assistant
<|recipient|> all' does not exist
Traceback (most recent call last):
  File "/config/custom_components/extended_openai_conversation/__init__.py", line 196, in async_process
    query_response = await self.query(user_input, messages, exposed_entities, 0)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/extended_openai_conversation/__init__.py", line 383, in query
    return await self.execute_tool_calls(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/config/custom_components/extended_openai_conversation/__init__.py", line 468, in execute_tool_calls
    raise FunctionNotFound(function_name)
custom_components.extended_openai_conversation.exceptions.FunctionNotFound: function 'assistant
<|recipient|> all' does not exist