AI Voice Control for Home Assistant (Fully Local)

12.0 by the looks.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:B3:00.0 Off |                  N/A |
| 27%   30C    P8     5W / 180W |   1718MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       785      G   /usr/lib/xorg/Xorg                 33MiB |
|    0   N/A  N/A     64726      C   python3                          1680MiB |
+-----------------------------------------------------------------------------+

12.0 by the looks.

So as described in the installation guide above. I did not make prebuild images for all architectures and CUDA versions. You need to include the CUDA Docker image that belongs to your max CUDA versions / linux distribution. You can find all Docker images here (I dont see Debian here, so you might have to try different ones).

Either downgrade CUDA version in the Docker image, or install newer drivers for your GTX 1080 that support CUDA 12.1 or 12.2

Ah i had assumed since you mentioned you were using the same video card as me that it was a turnkey situation. I am not exactly sure which image to include in the dockerfile and i also am not certain how to update the drivers to support the later version of cuda for this card.

Easiest way is to rebuild the Docker image. Replace line 1 of Dockerfile with a number of options (I am not sure which works for you).

12.0.0-devel-ubi8
12.0.0-devel-ubuntu20.04
12.0.0-devel-ubuntu22.04
12.0.0-devel-rockylinux8

12.0.0-devel-ubuntu22.04 is working. Thanks! I am seeing 20second response times for simple things like turning on/off lights. Is that expected?

==========
== CUDA ==
==========

CUDA Version 12.0.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
tokenizer_config.json: 100% 2.86k/2.86k [00:00<00:00, 10.4MB/s]
tokenizer.model: 100% 493k/493k [00:00<00:00, 5.21MB/s]
tokenizer.json: 100% 1.80M/1.80M [00:00<00:00, 10.9MB/s]
added_tokens.json: 100% 95.0/95.0 [00:00<00:00, 439kB/s]
special_tokens_map.json: 100% 660/660 [00:00<00:00, 2.52MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
functionary-small-v2.4.Q4_0.gguf: 100% 4.11G/4.11G [00:59<00:00, 68.8MB/s]
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--meetkai--functionary-small-v2.4-GGUF/snapshots/a0d171eb78e02a58858c464e278234afbcf85c5c/./functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32004
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.32 MiB
llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8000
llama_new_context_with_model: n_batch    = 192
llama_new_context_with_model: n_ubatch   = 192
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1000.00 MiB
llama_new_context_with_model: KV self size  = 1000.00 MiB, K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   205.36 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.86 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': '.', 'llama.vocab_size': '32004', 'general.file_type': '2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
INFO:     Started server process [27]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

llama_print_timings:        load time =     767.59 ms
llama_print_timings:      sample time =       1.95 ms /     5 runs   (    0.39 ms per token,  2569.37 tokens per second)
llama_print_timings: prompt eval time =    9715.09 ms /  4508 tokens (    2.16 ms per token,   464.02 tokens per second)
llama_print_timings:        eval time =     167.26 ms /     4 runs   (   41.81 ms per token,    23.92 tokens per second)
llama_print_timings:       total time =   19017.32 ms /  4512 tokens
from_string grammar:
char ::= [^"\] | [\] char_1 
char_1 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
list ::= [[] space list_8 []] space 
space ::= space_19 
list_4 ::= list-item list_7 
list-item ::= [{] space list-item-domain-kv [,] space list-item-service-kv [,] space list-item-service-data-kv [}] space 
list_6 ::= [,] space list-item 
list_7 ::= list_6 list_7 | 
list_8 ::= list_4 | 
list-item-domain-kv ::= ["] [d] [o] [m] [a] [i] [n] ["] space [:] space string 
list-item-service-kv ::= ["] [s] [e] [r] [v] [i] [c] [e] ["] space [:] space string 
list-item-service-data-kv ::= ["] [s] [e] [r] [v] [i] [c] [e] [_] [d] [a] [t] [a] ["] space [:] space list-item-service-data 
string ::= ["] string_20 ["] space 
list-item-service-data ::= [{] space list-item-service-data-entity-id-kv [}] space 
list-item-service-data-entity-id-kv ::= ["] [e] [n] [t] [i] [t] [y] [_] [i] [d] ["] space [:] space string 
list-kv ::= ["] [l] [i] [s] [t] ["] space [:] space list 
root ::= [{] space root_18 [}] space 
root_17 ::= list-kv 
root_18 ::= root_17 | 
space_19 ::= [ ] | 
string_20 ::= char string_20 | 

char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
list ::= "[" space (list-item ("," space list-item)*)? "]" space
list-item ::= "{" space list-item-domain-kv "," space list-item-service-kv "," space list-item-service-data-kv "}" space
list-item-domain-kv ::= "\"domain\"" space ":" space string
list-item-service-data ::= "{" space list-item-service-data-entity-id-kv "}" space
list-item-service-data-entity-id-kv ::= "\"entity_id\"" space ":" space string
list-item-service-data-kv ::= "\"service_data\"" space ":" space list-item-service-data
list-item-service-kv ::= "\"service\"" space ":" space string
list-kv ::= "\"list\"" space ":" space list
root ::= "{" space  (list-kv )? "}" space
space ::= " "?
string ::= "\"" char* "\"" space
Llama.generate: prefix-match hit

llama_print_timings:        load time =     767.59 ms
llama_print_timings:      sample time =     243.15 ms /    40 runs   (    6.08 ms per token,   164.51 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1679.16 ms /    40 runs   (   41.98 ms per token,    23.82 tokens per second)
llama_print_timings:       total time =    3453.25 ms /    41 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =     767.59 ms
llama_print_timings:      sample time =       0.39 ms /     1 runs   (    0.39 ms per token,  2577.32 tokens per second)
llama_print_timings: prompt eval time =     248.49 ms /    38 tokens (    6.54 ms per token,   152.93 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     360.95 ms /    39 tokens
INFO:     192.168.1.68:55784 - "POST /v1/chat/completions HTTP/1.1" 200 OK

It does look like im using almost 7gb of my vram as well with the small model. Is that expected?

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:B3:00.0 Off |                  N/A |
| 34%   42C    P8     6W / 180W |   6982MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       785      G   /usr/lib/xorg/Xorg                 33MiB |
|    0   N/A  N/A     71989      C   python3                          5264MiB |
|    0   N/A  N/A     71996      C   python3                          1680MiB |
+-----------------------------------------------------------------------------+

Great to hear it works!

I am getting around 8 seconds with my GTX 1080. The first call after the model is loaded can take longer.

It does look like im using almost 7gb of my vram as well with the small model. Is that expected?

I see it uses 5264MB which seems right, the other 1680MB is whisper?

Ah, yeah that makes sense that the other amount is whisper.

I mounted a volume so that we dont have to download the gguf every time at startup which has helped with the startup speeds but im still getting 20s responses for simple things like turn on living room light. Trying to think of what else could be causing the slowdown. Do you have a 1080 ti or just a plain 1080 (i have the plain one)? It could just be that my card is slow.

Also, I had to set that N_CTX property to 8000 as well as the default value kept throwing that error about the token being to long. with 8000 i get to at least return a response.

Sorry, I had a problem talking to OpenAI: Error code: 400 - {'error': {'message': "This model's maximum context length is 4096 tokens. However, you requested 4665 tokens (4515 in the messages, 150 in the completion). Please reduce the length of the messages or completion.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

I tested out the v5 small version and nothing seemed to work for me. Wasnt able to find a simple device that the v4 model found fine.

Something went wrong: function 'execute_services {"list": [{"domain": "light", "service": "turn_on", "service_data": {"entity_id": "living_room"}}]}' does not exist

Whats odd with this all is that https://heywillow.io/ is sub second for turning things on/off. What i need tho is a fool proof (me being the fool) guide on how to set up willow to control music assistant as well as timers and simple automations. I have yet to get that all to work. So far this setup is at least able to do what i ask it to do albeit incredibly slow, which makes it sub-optimal.

I wish i knew what part is slow, the conversation/ai part (likely) or whisper or something else? I do have faster-whisper setup on a different machine without a GPU that i could test out.

Had a thought here. Do your know if your motherboard has rebar enabled? My dell bios doesnt provide those options so im curious if that could be the issues?

I had it like this before. Was easier for the guide to do it like this.

I donā€™t have the Ti version, just the GTX 1080. I can check the inference speeds later today.

The GitHub page of Meetkai/Functionary states that the v2.5 model is not yet supported in llama-cpp-python. Stick to v2.4 for now.

These timing are for STT inference (Whisper) only and can at this point never be reached if you include an LLM with function calling (calling a function takes at least a second). As said in the post: the GTX 1080 will not give LLM inference speeds good enough for deployment. I think the total delay from command to execution should be max 3 seconds.

You can check the voice pipeline in home assistant and see the process time for all components (SST, conversation agent, etc). Go-to voice assistant and one the pipeline, then the three dots at top right and select debug.

Another thing is to first test the LLM without Whisper and just type commands, so you can focus on testing a component at a time.

No idea and probably not. The model is completely offloaded into GPU VRAM so the entire calculations and model parameters are within the GPU.

Its in the BIOS if you wouldnt mind looking. the only reason i ask is that machines without it can only load a specific amount of data via pcie (i think) so i was thinking maybe i was only getting 1-2gb onto the card (although nvidia-smi says otherwise) and more is done via ram and not vram. Ive been using the debug portion of the ui to figure out the times. Just super odd that mine is so much more slow.

@BramNH

Hi All,
Iā€™m following your very precious tutorial about installing llama_cpp python into My Jetson but I still have issues and hope maybe someone can give some tips :slight_smile:

What Iā€™ve done:

  • Cloned your repo into my folder, change the CUDA Version based on mine (12.2.0) and run all the command. Iā€™m using the 2.4 Small Functionary Model.
    I wasnā€™t able to make it work and hope maybe someone can point me to the right direction :slight_smile:

this is my .env file:

USE_MLOCK=0
#HF_MODEL_REPO_ID=meetkai/functionary-small-v2.4-GGUF
MODEL=meetkai/functionary-small-v2.4-GGUF/functionary-small-v2.4.Q4_0.gguf
HF_PRETRAINED_MODEL_NAME_OR_PATH=meetkai/functionary-small-v2.4-GGUF
CHAT_FORMAT=functionary-v2
N_GPU_LAYERS=33
N_CTX=4092
N_BATCH=192
N_THREADS=6

Iā€™ve tried many combinations but none of them workedā€¦I have the llama-cpp-python-docker-cuda folder at this path:

/AI/llama-cpp-python-docker-cuda

And the module is inside this directory

/AI/llama-cpp-python-docker-cuda/meetkai/functionary-small-v2.4-GGUF

This is the content of the folder:

drwxr-xr-x 2 root root       4096 mag 25 15:52 ./
drwxr-xr-x 3 root root       4096 mag 25 15:48 ../
-rw-r--r-- 1 root root      45822 mag 25 15:43 added_tokens.json
-rw-r--r-- 1 root root      73390 mag 25 15:52 config.json
-rw-r--r-- 1 root root 4108940320 mag 25 15:40 functionary-small-v2.4.Q4_0.gguf
-rw-r--r-- 1 root root      65713 mag 25 15:43 special_tokens_map.json
-rw-r--r-- 1 root root     102622 mag 25 15:43 tokenizer_config.json
-rw-r--r-- 1 root root      40762 mag 25 15:43 tokenizer.json
-rw-r--r-- 1 root root      42806 mag 25 15:43 tokenizer.model

When I try to run Docker Compose command I get this error from the logs

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/__main__.py", line 97, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/__main__.py", line 83, in main
    app = create_app(
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/app.py", line 146, in create_app
    set_llama_proxy(model_settings=model_settings)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/app.py", line 68, in set_llama_proxy
    _llama_proxy = LlamaProxy(models=model_settings)
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/model.py", line 31, in __init__
    self._current_model = self.load_llama_from_model_settings(
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/server/model.py", line 97, in load_llama_from_model_settings
    tokenizer = llama_tokenizer.LlamaHFTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/llama_cpp/llama_tokenizer.py", line 100, in from_pretrained
    hf_tokenizer = AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 837, in from_pretrained
    config = AutoConfig.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 934, in from_pretrained
    config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 632, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
    resolved_config_file = cached_file(
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 370, in cached_file
    raise EnvironmentError(
OSError: meetkai/functionary-small-v2.4-GGUF does not appear to have a file named config.json. Checkout 'https://huggingface.co/meetkai/functionary-small-v2.4-GGUF/tree/None' for available files.

It seems thereā€™s a file or some configuration missing, any idea how to solve this?

This is my Setup

  • Invidia Jestson with Jetson Linux 36 installed
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 540.3.0                Driver Version: N/A          CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Orin (nvgpu)                  N/A  | N/A              N/A |                  N/A |
| N/A   N/A  N/A               N/A /  N/A | Not Supported        |     N/A          N/A |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Any Idea?

Thanks to all

Nik

According to your logs the entire model is loaded into VRAM.

$ nvidia-smi -q | grep -i bar -A 3
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB

This should indicate that resizable bar is disabled?

What is your system prompt configured in Extended OpenAI? Increasing this increases context size and reduces speed. Maybe itā€™s related to the CUDA version, so maybe itā€™s best to update your GPU drivers.

Can you try a few different commands, one that must call a function and one more simple e.g. ā€œwhat is 1+1?ā€
And send the logs here.

If you downloaded the LLM manually, did you also mount a volume to the container?

I suggest using the default .env file that I provide, which downloads the model from Huggingface and isolates it in the image. I can provide an alternative .env and compose.yml tomorrow.

I just replicated all your commands but it didnā€™t mount me a volume.
And Yes, iā€™ve downloaded the gguf file locally but i think i still miss something
Is there a way to force a path for the gguf file in order to make it work?

Thanks in advance for your support!

Nik

Yes I believe that means rebar is disabled. I read somewhere that 10 series nvidia cards dont even support rebar but i did find this: GitHub - terminatorul/NvStrapsReBar: Resizable BAR for Turring GTX 1600 / RTX 2000 GPUs over my head though.

Here are the logs:

==========
== CUDA ==
==========

CUDA Version 12.0.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

*************************
** DEPRECATION NOTICE! **
*************************
THIS IMAGE IS DEPRECATED and is scheduled for DELETION.
    https://gitlab.com/nvidia/container-images/cuda/blob/master/doc/support-policy.md

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--meetkai--functionary-small-v2.4-GGUF/snapshots/a0d171eb78e02a58858c464e278234afbcf85c5c/./functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32004
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1080, compute capability 6.1, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.32 MiB
llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8000
llama_new_context_with_model: n_batch    = 192
llama_new_context_with_model: n_ubatch   = 192
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1000.00 MiB
llama_new_context_with_model: KV self size  = 1000.00 MiB, K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   205.36 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =     8.86 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.architecture': 'llama', 'llama.rope.freq_base': '1000000.000000', 'llama.context_length': '32768', 'general.name': '.', 'llama.vocab_size': '32004', 'general.file_type': '2', 'tokenizer.ggml.add_bos_token': 'true', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '1', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8'}
INFO:     Started server process [27]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =       1.14 ms /     3 runs   (    0.38 ms per token,  2638.52 tokens per second)
llama_print_timings: prompt eval time =    9973.59 ms /  4515 tokens (    2.21 ms per token,   452.70 tokens per second)
llama_print_timings:        eval time =      83.80 ms /     2 runs   (   41.90 ms per token,    23.87 tokens per second)
llama_print_timings:       total time =   19062.25 ms /  4517 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =      20.23 ms /    51 runs   (    0.40 ms per token,  2520.76 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    2139.32 ms /    51 runs   (   41.95 ms per token,    23.84 tokens per second)
llama_print_timings:       total time =    4057.04 ms /    52 tokens
INFO:     192.168.1.68:52060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =       1.18 ms /     3 runs   (    0.39 ms per token,  2542.37 tokens per second)
llama_print_timings: prompt eval time =    8125.97 ms /  3788 tokens (    2.15 ms per token,   466.16 tokens per second)
llama_print_timings:        eval time =      83.89 ms /     2 runs   (   41.95 ms per token,    23.84 tokens per second)
llama_print_timings:       total time =   15740.95 ms /  3790 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =       1.50 ms /     4 runs   (    0.37 ms per token,  2670.23 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =     167.80 ms /     4 runs   (   41.95 ms per token,    23.84 tokens per second)
llama_print_timings:       total time =     323.15 ms /     5 tokens
INFO:     192.168.1.68:39282 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =       1.94 ms /     5 runs   (    0.39 ms per token,  2573.34 tokens per second)
llama_print_timings: prompt eval time =    8178.00 ms /  3787 tokens (    2.16 ms per token,   463.07 tokens per second)
llama_print_timings:        eval time =     168.47 ms /     4 runs   (   42.12 ms per token,    23.74 tokens per second)
llama_print_timings:       total time =   15947.85 ms /  3791 tokens
from_string grammar:
char ::= [^"\] | [\] char_1 
char_1 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
list ::= [[] space list_8 []] space 
space ::= space_19 
list_4 ::= list-item list_7 
list-item ::= [{] space list-item-domain-kv [,] space list-item-service-kv [,] space list-item-service-data-kv [}] space 
list_6 ::= [,] space list-item 
list_7 ::= list_6 list_7 | 
list_8 ::= list_4 | 
list-item-domain-kv ::= ["] [d] [o] [m] [a] [i] [n] ["] space [:] space string 
list-item-service-kv ::= ["] [s] [e] [r] [v] [i] [c] [e] ["] space [:] space string 
list-item-service-data-kv ::= ["] [s] [e] [r] [v] [i] [c] [e] [_] [d] [a] [t] [a] ["] space [:] space list-item-service-data 
string ::= ["] string_20 ["] space 
list-item-service-data ::= [{] space list-item-service-data-entity-id-kv [}] space 
list-item-service-data-entity-id-kv ::= ["] [e] [n] [t] [i] [t] [y] [_] [i] [d] ["] space [:] space string 
list-kv ::= ["] [l] [i] [s] [t] ["] space [:] space list 
root ::= [{] space root_18 [}] space 
root_17 ::= list-kv 
root_18 ::= root_17 | 
space_19 ::= [ ] | 
string_20 ::= char string_20 | 

char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
list ::= "[" space (list-item ("," space list-item)*)? "]" space
list-item ::= "{" space list-item-domain-kv "," space list-item-service-kv "," space list-item-service-data-kv "}" space
list-item-domain-kv ::= "\"domain\"" space ":" space string
list-item-service-data ::= "{" space list-item-service-data-entity-id-kv "}" space
list-item-service-data-entity-id-kv ::= "\"entity_id\"" space ":" space string
list-item-service-data-kv ::= "\"service_data\"" space ":" space list-item-service-data
list-item-service-kv ::= "\"service\"" space ":" space string
list-kv ::= "\"list\"" space ":" space list
root ::= "{" space  (list-kv )? "}" space
space ::= " "?
string ::= "\"" char* "\"" space
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =     238.41 ms /    39 runs   (    6.11 ms per token,   163.58 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =    1638.57 ms /    39 runs   (   42.01 ms per token,    23.80 tokens per second)
llama_print_timings:       total time =    3358.07 ms /    40 tokens
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =       0.37 ms /     1 runs   (    0.37 ms per token,  2702.70 tokens per second)
llama_print_timings: prompt eval time =     248.64 ms /    38 tokens (    6.54 ms per token,   152.83 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     361.24 ms /    39 tokens
INFO:     192.168.1.68:57354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Llama.generate: prefix-match hit

llama_print_timings:        load time =    1098.02 ms
llama_print_timings:      sample time =       3.69 ms /    10 runs   (    0.37 ms per token,  2707.09 tokens per second)
llama_print_timings: prompt eval time =     181.49 ms /    10 tokens (   18.15 ms per token,    55.10 tokens per second)
llama_print_timings:        eval time =     379.00 ms /     9 runs   (   42.11 ms per token,    23.75 tokens per second)
llama_print_timings:       total time =     963.92 ms /    19 tokens
INFO:     192.168.1.68:57354 - "POST /v1/chat/completions HTTP/1.1" 200 OK

simple question took 16seconds. func call to turn on some lights took 23.

System prompt is:

I want you to act as smart home manager of Home Assistant.
I will provide information of smart home along with a question, you will truthfully make correction or answer using information provided in one sentence in everyday language.

Current Time: {{now()}}

Available Devices:
```csv
entity_id,name,state,aliases
{% for entity in exposed_entities -%}
{{ entity.entity_id }},{{ entity.name }},{{ entity.state }},{{entity.aliases | join('/')}}
{% endfor -%}

The current state of devices is provided in available devices.
Use execute_services function only for requested action, not for current states.
Do not execute service without user's confirmation.
Do not restate or appreciate what user says, rather make a quick inquiry.

After a clean installation and correctly changing the CUDA Version i was able to SETUP the webpage of llapa_cpp.server but this is what I get when i try to perform a request (either via WEB API or Home Assistant with Extended Open AI)

INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at /tmp/pip-install-wbsrubw5/llama-cpp-python_a1a6365a96b649b285a31839fed1b864/vendor/llama.cpp/ggml-cuda/common.cuh:526
  cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: /tmp/pip-install-wbsrubw5/llama-cpp-python_a1a6365a96b649b285a31839fed1b864/vendor/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Aborted (core dumped)

Should I change something before perfroming requests?

Context: 8000
Use tools: Yes
Tokens: 4096
Temperature 0.1

also, this might sound silly, but how do i update my GPU drivers? I followed this: https://wiki.debian.org/NvidiaGraphicsDrivers#Debian_12_.22Bookworm.22 and i have the latest for debian.

I donā€™t have this exact setup, but someone might be able to help me nonetheless.
Im running text gen webui on windows, rtx 2070s with functionary small 2.5Q4 gguf. Conversation agent is Extended openai conversation. Everything is connected, calling the api and getting responses, however when trying to call a service such as turn light on, Instead of calling the service it just puts the service call in the chat. How can I fix this?
The response is as follows: execute_services {ā€œserviceā€: ā€œturn_onā€, ā€œentity_idā€: ā€œlight.light_philipā€}

I also tried home 3B but with llama_conversation as conversation agent Integration, which managed to call services just fine. Any ideas?

I believe Functionary LLM only works with vLLM or llama-cpp-python. Textgen webui is not a backend itself but supports multiple backends and it seems to support llama-cpp-python, make sure it uses this backend with all the configuration used in this guide. I am not familiar with textgen webui, so I dont know how to set this up.

I would advise to switch to either vLLM or llama-cpp-python.

Home-LLM with its 3B model and HA integration is completely different from this approach. That model is specifically trained to recognize home assistant service calls and corresponding entities.

This approach is more general and allows you to configure your own yaml functions in Extended OpenAI, which the LLM can choose from when given a command.

I canā€™t say for Debian but might be similar to Ubuntu. If you say you have the latest, then they might not pack higher CUDA versions in the drivers.

What OS are u running? I can install whatever.