AI Voice Control for Home Assistant (Fully Local)

Seems like a chat format issue. I could try that model when I have the time

I tried to use the Gorilla-OpenFunctions model in place of the functionary model from the OP. This is really the first time I’m dabbling in ML beyond as an end user, so I’m going through trial and error to try to get it to work.

The model seems to be targeted towards engineers trying to figure out how to call functions to do what they want and that’s the sort of response I get back. I don’t think it’s actually calling the functions to generate a useful response.

The Gorilla OpenFunctions website says the following:
Note: Gorilla through our hosted end-point is currently only supported with `openai==0.28.1` . We will migrate to also include support for `openai==1.xx` soon with which `functions` is replaced by `tool_calls` .
So function calling on the newer OpenAI APIs is not yet implemented in this LLM.

1 Like

I got some clues from the logs, but I can’t solve it. I submitted a question in the LocalAI project:error="unexpected end of JSON input" · Issue #2850 · mudler/LocalAI · GitHub
I hope someone can solve it

Very interesting topic about local AI.

I’m installing follow by your guide and have problem when add local llm cpp to home assistant via Extended Open AI. Please help me to see if I have mistake somewhere.

I install the LLama-cpp-python on a Windows machine and it’s running on http://172.23.126.96:8000/ . I’ve test if the docker working or not by post a json via API and this is the result

So yes, the docker working properly

Next step. I adjust my Extended OpenAI Conversation by download init.py and replace it with the original file in /homeassistant/custom_components/extended_openai_conversation . Double check by open the new file init.py and see the 218th row where you have adjustment. I can see it is messages.append(query_response.message.model_dump()) . So this is the new init.py file, not the old one.

image

Restart Home assistant.

In the Extended OpenAI Conversation Intergration I add a new service like the picture below.

Reconfigure this service by enable Use Tools and mạke Context Threshold 8000.

Then I add this Local AI as Voice Assistant Conversation agent and test by text to my assistant. By when I ask everything it just show “Sorry, I had a problem talking to OpenAI: Request timed out.”
image

When I check the log in Llama-cpp-python docker, there is nothing happened there. This should be something error in Extended OpenAI that does not send any information to Llama-cpp-python.

Would love to hear your reply. Thanks

When configuring the connection in Extended OpenAI, can you verify that the connection is made by unchecking the box “Skip Authentication”?

Yes, it show me

Failed to connect

when I trying to add new service in Extended OpenAI so I have to skip authen

Don’t try to skip authentication, if this step fails you won’t get any further. Did you configure the port correctly in Docker, such that it is accessible from outside your machine?

Bro, you are right, I have a problem with the firewall that makes it inaccessible from the outside.

Now I can connect it from my home assistant.

After some quick test, this is the result.

It take ~10s for Natural Language Processing with some normal question like: “what is 1+1” and “what is the capital of France” and ~13s for some question need function calling like “what is the weather like today”.

I’m not sure what should I do to improve operational efficiency.

I have a Rtx3060 running on Windows11. I install llama-cpp-python docker in a WSL with Cuda image 12.5.1-devel-ubuntu24.04. I’m using default .env file from your github and can see it load ~5GB vram and GPU working when I ask something.

This is my Docker log

2024-07-19 14:53:15 llama-cpp-python  | ==========
2024-07-19 14:53:15 llama-cpp-python  | == CUDA ==
2024-07-19 14:53:15 llama-cpp-python  | ==========
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | CUDA Version 12.5.1
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
2024-07-19 14:53:15 llama-cpp-python  | By pulling and using the container, you accept the terms and conditions of this license:
2024-07-19 14:53:15 llama-cpp-python  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:15 llama-cpp-python  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
2024-07-19 14:53:15 llama-cpp-python  | 
2024-07-19 14:53:17 llama-cpp-python  | None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
2024-07-19 14:53:18 llama-cpp-python  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--meetkai--functionary-small-v2.4-GGUF/snapshots/a0d171eb78e02a58858c464e278234afbcf85c5c/./functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   0:                       general.architecture str              = llama
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   1:                               general.name str              = .
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   5:                          llama.block_count u32              = 32
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  12:                          general.file_type u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - kv  24:               general.quantization_version u32              = 2
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - type  f32:   65 tensors
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - type q4_0:  225 tensors
2024-07-19 14:53:19 llama-cpp-python  | llama_model_loader: - type q6_K:    1 tensors
2024-07-19 14:53:19 llama-cpp-python  | llm_load_vocab: special tokens definition check successful ( 263/32004 ).
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: format           = GGUF V3 (latest)
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: arch             = llama
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: vocab type       = SPM
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_vocab          = 32004
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_merges         = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_ctx_train      = 32768
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd           = 4096
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_head           = 32
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_head_kv        = 8
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_layer          = 32
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_rot            = 128
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_head_k    = 128
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_head_v    = 128
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_gqa            = 4
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_k_gqa     = 1024
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_embd_v_gqa     = 1024
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_norm_eps       = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_clamp_kqv      = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: f_logit_scale    = 0.0e+00
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_ff             = 14336
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_expert         = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_expert_used    = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: causal attn      = 1
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: pooling type     = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: rope type        = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: rope scaling     = linear
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: freq_base_train  = 1000000.0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: freq_scale_train = 1
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: n_yarn_orig_ctx  = 32768
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: rope_finetuned   = unknown
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_d_conv       = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_d_inner      = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_d_state      = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: ssm_dt_rank      = 0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model type       = 8B
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model ftype      = Q4_0
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model params     = 7.24 B
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: general.name     = .
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: BOS token        = 1 '<s>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: EOS token        = 2 '</s>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: UNK token        = 0 '<unk>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: PAD token        = 2 '</s>'
2024-07-19 14:53:19 llama-cpp-python  | llm_load_print_meta: LF token         = 13 '<0x0A>'
2024-07-19 14:53:19 llama-cpp-python  | ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
2024-07-19 14:53:19 llama-cpp-python  | ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
2024-07-19 14:53:19 llama-cpp-python  | ggml_cuda_init: found 1 CUDA devices:
2024-07-19 14:53:19 llama-cpp-python  |   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
2024-07-19 14:53:19 llama-cpp-python  | llm_load_tensors: ggml ctx size =    0.30 MiB
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors: offloading 32 repeating layers to GPU
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors: offloading non-repeating layers to GPU
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors: offloaded 33/33 layers to GPU
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors:        CPU buffer size =    70.32 MiB
2024-07-19 14:53:22 llama-cpp-python  | llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
2024-07-19 14:53:23 llama-cpp-python  | ..................................................................................................
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: n_ctx      = 4096
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: n_batch    = 192
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: n_ubatch   = 192
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: freq_base  = 1000000.0
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: freq_scale = 1
2024-07-19 14:53:23 llama-cpp-python  | llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model:      CUDA0 compute buffer size =   111.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model:  CUDA_Host compute buffer size =     6.00 MiB
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: graph nodes  = 1030
2024-07-19 14:53:23 llama-cpp-python  | llama_new_context_with_model: graph splits = 2
2024-07-19 14:53:23 llama-cpp-python  | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
2024-07-19 14:53:23 llama-cpp-python  | Model metadata: {'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': '.', 'llama.vocab_size': '32004', 'general.file_type': '2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Started server process [27]
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Waiting for application startup.
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Application startup complete.
2024-07-19 14:53:23 llama-cpp-python  | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

This is the log when I ask a question from home assistant

2024-07-20 10:02:10 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:19 llama-cpp-python  | 
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:      sample time =       3.27 ms /     8 runs   (    0.41 ms per token,  2447.98 tokens per second)
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings: prompt eval time =    3169.35 ms /  2475 tokens (    1.28 ms per token,   780.92 tokens per second)
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:        eval time =     151.53 ms /     7 runs   (   21.65 ms per token,    46.19 tokens per second)
2024-07-20 10:02:19 llama-cpp-python  | llama_print_timings:       total time =    9274.52 ms /  2482 tokens
2024-07-20 10:02:19 llama-cpp-python  | from_string grammar:
2024-07-20 10:02:19 llama-cpp-python  | root ::= [{] space time-kv [}] space 
2024-07-20 10:02:19 llama-cpp-python  | space ::= space_3 
2024-07-20 10:02:19 llama-cpp-python  | time-kv ::= ["] [t] [i] [m] [e] ["] space [:] space time- 
2024-07-20 10:02:19 llama-cpp-python  | space_3 ::= [ ] | 
2024-07-20 10:02:19 llama-cpp-python  | time- ::= ["] [c] [u] [r] [r] [e] [n] [t] ["] | ["] [t] [o] [d] [a] [y] ["] | ["] [t] [o] [m] [o] [r] [r] [o] [w] ["] 
2024-07-20 10:02:19 llama-cpp-python  | 
2024-07-20 10:02:19 llama-cpp-python  | root ::= "{" space time-kv "}" space
2024-07-20 10:02:19 llama-cpp-python  | space ::= " "?
2024-07-20 10:02:19 llama-cpp-python  | time- ::= "\"current\"" | "\"today\"" | "\"tomorrow\""
2024-07-20 10:02:19 llama-cpp-python  | time-kv ::= "\"time\"" space ":" space time-
2024-07-20 10:02:19 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:20 llama-cpp-python  | 
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:      sample time =      49.08 ms /     8 runs   (    6.13 ms per token,   163.00 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        eval time =     212.64 ms /     8 runs   (   26.58 ms per token,    37.62 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:       total time =     508.52 ms /     9 tokens
2024-07-20 10:02:20 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:20 llama-cpp-python  | 
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:      sample time =       0.44 ms /     1 runs   (    0.44 ms per token,  2262.44 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings: prompt eval time =      42.50 ms /     7 tokens (    6.07 ms per token,   164.70 tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
2024-07-20 10:02:20 llama-cpp-python  | llama_print_timings:       total time =      90.35 ms /     8 tokens
2024-07-20 10:02:20 llama-cpp-python  | INFO:     172.18.0.1:34866 - "POST /v1/chat/completions HTTP/1.1" 200 OK
2024-07-20 10:02:21 llama-cpp-python  | Llama.generate: prefix-match hit
2024-07-20 10:02:23 llama-cpp-python  | 
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:        load time =    2423.11 ms
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:      sample time =      16.67 ms /    41 runs   (    0.41 ms per token,  2459.95 tokens per second)
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings: prompt eval time =     104.15 ms /    10 tokens (   10.41 ms per token,    96.02 tokens per second)
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:        eval time =     990.18 ms /    40 runs   (   24.75 ms per token,    40.40 tokens per second)
2024-07-20 10:02:23 llama-cpp-python  | llama_print_timings:       total time =    2319.23 ms /    50 tokens
2024-07-20 10:02:23 llama-cpp-python  | INFO:     172.18.0.1:34866 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Hmm 10s is way too long. It might be related with your setup. It’s also possible to manually build llama-cpp-python directly on your Windows machine. It required me to build llama.cpp and refere to the built .dll in llama-cpp-python. You could try this.

Do you think an external gpu enclosure would work for this? I don’t think I have the budget for a new system, but I could swing a 3060 12GB. I already have a mini PC with an i5 and 16 GB RAM. I’d have to look at the m.2 slots, but if I have a slot for it I could use an m.2 adapter for the eGPU.

I’m not really familiar enough with the bandwidth these models need to know if this is a path I could consider.

I have a NUC running an i5 7260U (2C/4T, 3.4GHz) and an external GPU box connected via Thunderbolt using an RTX 3080. I find that the CPU is pegged when doing chat completion/function calling; the GPU is hardly taxed (apart from VRAM usage). It takes forever to complete the call, so while it’s fine for experimenting, it’s definitely not fast enough to daily drive. faster-whisper is extremely fast, however.

Pfff, quiet a long read to get to this but this thread is loaded with info, thanks for that.
Quiet a few things on this complicated matter are getting clearer.
But there is one thing that is not quiet sure of.
I have a small nuc that runs HA happily but it is nowhere near of capable of running a LLM. Now I was thinking of getting myself better hardware just for the LLM only. This would be a separate machine. The idea is to let the nuc with HA on it make calls to the dedicated LLM server. Will this work?

Ofcourse! I bought a GPU that I inserted into my server running HA, and while HA is a Docker container that runs processes on my CPU, all my AI related stuff runs on Docker containers that utilizes my GPU (Whisper, llama-cpp-python).

All I do within Extended OpenAI is connect to the localhost on a certain port, but if you run it on any other machine would work as well (required you set the ports open and the machine is reachable etc).

On another note. I know the Open Home Foundation is looking with Nvidia for specific hardware solutions, but expect them to be very expensive. I like my GPU solution for testing :slight_smile:

Thx for the quick response.
This sounds great. This way i can keep my trusted HA nuc safe from experiments. Keeping my house in an operational state is my highest priority

1 Like

Thanks due all the info here and the help of this thread starter I managed to get a local LLM up and running. I have written down my experiences with setting up a local LLM
Hopefully this will help starters just like me to get started quickly

Local LLM for Dummies (just like me)

3 Likes

What is the external GPU power consumption when used solely for this?

I didn’t measure that, and I’ve since switched over to using a regular PC and using the RTX 3080 in that. It’s a Skylake i7-6700k (my old workstation), and it’s drawing 40 watts at idle with 16GB RAM and a M.2 SATA drive. The power supply is an older 80+ Bronze, so I’m guessing the GPU alone is only a couple of watts at idle. When inferencing, it shoots up to over 300 watts for a few seconds.

Hi,

Nice feature. It’s possible to implement this in an Intel NUC i3?

thank you

You can run LLMs on a CPU, but would bother trying, it’s just way too slow.