AI Voice Control for Home Assistant (Fully Local)

Ubuntu 22.04

Ill give that a try. Do u by any chance know your nvidia driver versions and your cuda version?

Edit: I installed ubuntu 22.04 and my cuda version is higher i think, but the performance is still not great.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:B3:00.0  On |                  N/A |
| 37%   46C    P8               6W / 180W |   6973MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1941      G   /usr/lib/xorg/Xorg                           51MiB |
|    0   N/A  N/A      2145      G   /usr/bin/gnome-shell                         57MiB |
|    0   N/A  N/A     39572      C   python3                                    1672MiB |
|    0   N/A  N/A     39593      C   python3                                    5188MiB |
+---------------------------------------------------------------------------------------+

I noticed that when i started up the container its showing a different version:


==========
== CUDA ==
==========

CUDA Version 12.1.1

Following your setup i wonder if i need to modify something in one of the docker builds so that these CUDA versions match?

Edit 2:

Modified the docker base images for both whisper and llama-cpp to match my CUDA version (12.2. Installed 12.2.2 versions of both). Still no luck. Still 20 seconds. Bummer. Really weird. Same card. Same OS. Same config for the most part. Twice as slow.

Ive got it all setup as per this guide. llama.cpp running functionary. I tried different conversation agents, both with the same issue: Service call going into the chat rather than actually being called. Did anyone ever experience this?
EDIT:
Functions are defined as follows in openai conversation:

- spec:
    name: execute_services
    description: Use this function to execute service of devices in Home Assistant.
    parameters:
      type: object
      properties:
        list:
          type: array
          items:
            type: object
            properties:
              domain:
                type: string
                description: The domain of the service
              service:
                type: string
                description: The service to be called
              service_data:
                type: object
                description: The service data object to indicate what to control.
                properties:
                  entity_id:
                    type: string
                    description: The entity_id retrieved from available devices. It
                      must start with domain, followed by dot character.
                required:
                - entity_id
            required:
            - domain
            - service
            - service_data
  function:
    type: native
    name: execute_service

Another idea I had is that it might be the instruction template causing the issues, since im using the one thats obtained from the models metadata.

Unfortunate… below you can see my nvidia driver version. Which GTX 1080 do you have? Mine is the MSI GeForce GTX 1080 AERO 8G. Maybe there is a performance difference (couldnt image this significant).

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        Off | 00000000:01:00.0 Off |                  N/A |
| 33%   40C    P8              11W / 200W |   6864MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3423161      C   python3                                    1672MiB |
|    0   N/A  N/A   3736459      C   python3                                    5188MiB |
+---------------------------------------------------------------------------------------+

I also see a gnome-shell and Xorg process is running on your GPU, which are related to the monitor. I installed the headless version off the nvidia drivers, so that monitor pieces of the drivers are not included.

I hope you find what makes the huge difference, but in the end, do you want to continue with 8 seconds of function calling that can be achieved with the GTX 1080? My AMD Radeon 6900XT is getting ~3 seconds, still considering my next GPU…

I can’t see anything without logs. Can you post your llama-cpp-python logs here?

<|start_header_id|>system<|end_header_id|>

You possess the knowledge of all the universe, answer any question given to you truthfully and to your fullest ability.
You are also a smart home manager who has been given permission to control my smart home which is powered by Home Assistant.
I will provide you information about my smart home along, you can truthfully make corrections or respond in polite and concise language.
Your name is jarvis, pretend to have jarvis' personality from iron man.
Current Time: 2024-05-26 16:41:24.560687+02:00

Current Area: None

Available Devices:
csv
entity_id,name,state,aliases
***redacted***




Areas:
csv
area_id,name
***Redacted***


The current state of devices is provided in Available Devices.
Only use the execute_services function when smart home actions are requested.
Do not tell me what you're thinking about doing either, just do it.
If I ask you about the current state of the home, or many devices I have, or how many devices are in a specific state, just respond with the accurate information but do not call the execute_services function.
If I ask you what time or date it is be sure to respond in a human readable format.
If you don't have enough information to execute a smart home command then specify what other information you need.<|eot_id|><|start_header_id|>user<|end_header_id|>

turn on Philip light<|eot_id|><|start_header_id|>assistant<|end_header_id|>




llama_print_timings:        load time =     249.74 ms
llama_print_timings:      sample time =     187.05 ms /    22 runs   (    8.50 ms per token,   117.62 tokens per second)
llama_print_timings: prompt eval time =     248.84 ms /   361 tokens (    0.69 ms per token,  1450.72 tokens per second)
llama_print_timings:        eval time =     423.00 ms /    21 runs   (   20.14 ms per token,    49.65 tokens per second)
llama_print_timings:       total time =    1062.82 ms /   382 tokens
Output generated in 1.47 seconds (13.65 tokens/s, 20 tokens, context 403, seed 940791195)

Yeah it could be the x server messing up things. Yeah 8 seconds is still suboptimal but thats all i got with this card. I’ve been seeing a few rtx 3060’s popping up at a reasonable price but i also read that those perform almost the same as the gtx 1080. I really dont want to dump a ton of money into all this.

I stood up local-ai and configured it with the gpt-4 model and i get 2 second responses for simple questions but it isnt great for function calling. I am gonna try and see if i can figure out how to plug your functionary model into this and see if it performs closer to your numbers.

Oh no they perform way better! The RTX series have tensor cores which are optimized to perform matrix multiplications + the 3060 has higher memory bandwidth.

For 5 dollars you can test this Docker container on cloud GPUs and see the performances. I’ve described above how to run it using vast.ai.

You can’t unless you have a lot of time and are very skilled. The developers made it runnable on llama-cpp-python and vLLM only.

<|start_header_id|>system<|end_header_id|>

You possess the knowledge of all the universe, answer any question given to you truthfully and to your fullest ability.
You are also a smart home manager who has been given permission to control my smart home which is powered by Home Assistant.
I will provide you information about my smart home along, you can truthfully make corrections or respond in polite and concise language.
Your name is jarvis, pretend to have jarvis' personality from iron man.
Current Time: 2024-05-26 16:41:24.560687+02:00

Current Area: None

Available Devices:
csv
entity_id,name,state,aliases
***redacted***




Areas:
csv
area_id,name
***Redacted***


The current state of devices is provided in Available Devices.
Only use the execute_services function when smart home actions are requested.
Do not tell me what you're thinking about doing either, just do it.
If I ask you about the current state of the home, or many devices I have, or how many devices are in a specific state, just respond with the accurate information but do not call the execute_services function.
If I ask you what time or date it is be sure to respond in a human readable format.
If you don't have enough information to execute a smart home command then specify what other information you need.<|eot_id|><|start_header_id|>user<|end_header_id|>

turn on Philip light<|eot_id|><|start_header_id|>assistant<|end_header_id|>




llama_print_timings:        load time =     249.74 ms
llama_print_timings:      sample time =     187.05 ms /    22 runs   (    8.50 ms per token,   117.62 tokens per second)
llama_print_timings: prompt eval time =     248.84 ms /   361 tokens (    0.69 ms per token,  1450.72 tokens per second)
llama_print_timings:        eval time =     423.00 ms /    21 runs   (   20.14 ms per token,    49.65 tokens per second)
llama_print_timings:       total time =    1062.82 ms /   382 tokens
Output generated in 1.47 seconds (13.65 tokens/s, 20 tokens, context 403, seed 940791195)

These are not the logs of llama-cpp-python and probably textgen-webui? I can’t see from this how your model is loaded and configured.

Is this what youre looking for?

16:40:00-942984 INFO     Loading "functionary-small-v2.5.Q4_0.gguf"
16:40:01-372233 INFO     llama.cpp weights detected: "models\functionary-small-v2.5.Q4_0.gguf"
llama_model_loader: loaded meta data with 23 key-value pairs and 291 tensors from models\functionary-small-v2.5.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama3-functionary-hf
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 128001
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name     = llama3-functionary-hf
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: PAD token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'general.name': 'llama3-functionary-hf', 'general.architecture': 'llama', 'llama.block_count': '32', 'llama.context_length': '8192', 'tokenizer.ggml.eos_token_id': '128001', 'general.file_type': '2', 'llama.attention.head_count_kv': '8', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'llama.rope.freq_base': '500000.000000', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.model': 'gpt2', 'tokenizer.ggml.pre': 'llama-bpe', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '128000', 'tokenizer.ggml.padding_token_id': '128001', 'tokenizer.chat_template': "{% for message in messages %}\n{% if message['role'] == 'user' or message['role'] == 'system' %}\n{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] + '<|eot_id|>' }}{% elif message['role'] == 'tool' %}\n{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + 'name=' + message['name'] + '\n' + message['content'] + '<|eot_id|>' }}{% else %}\n{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'}}{% if message['content'] is not none %}\n{{ message['content'] }}{% endif %}\n{% if 'tool_calls' in message and message['tool_calls'] is not none %}\n{% for tool_call in message['tool_calls'] %}\n{{ '<|reserved_special_token_249|>' + tool_call['function']['name'] + '\n' + tool_call['function']['arguments'] }}{% endfor %}\n{% endif %}\n{{ '<|eot_id|>' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ '<|start_header_id|>{role}<|end_header_id|>\n\n' }}{% endif %}"}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {% for message in messages %}
{% if message['role'] == 'user' or message['role'] == 'system' %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>

' + message['content'] + '<|eot_id|>' }}{% elif message['role'] == 'tool' %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>

' + 'name=' + message['name'] + '
' + message['content'] + '<|eot_id|>' }}{% else %}
{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'}}{% if message['content'] is not none %}
{{ message['content'] }}{% endif %}
{% if 'tool_calls' in message and message['tool_calls'] is not none %}
{% for tool_call in message['tool_calls'] %}
{{ '<|reserved_special_token_249|>' + tool_call['function']['name'] + '
' + tool_call['function']['arguments'] }}{% endfor %}
{% endif %}
{{ '<|eot_id|>' }}{% endif %}
{% endfor %}
{% if add_generation_prompt %}{{ '<|start_header_id|>{role}<|end_header_id|>

' }}{% endif %}
Using chat eos_token: <|end_of_text|>
Using chat bos_token: <|begin_of_text|>

Functionary Small v2.5 is not yet supported in llama-cpp-python. Try v2.4 first.

I really hope that that wasnt the issue. Trying now.

I just found this leaderboard of Function Calling LLMs and my attention fell on the Gorilla-OpenFunctions-v2 (FC) LLM, which seem to have super low latencies.

It seems that you can run this model on llama-cpp-python. Anyone who wants to test this?

Soooo, tested to no avail. Works fine in webgui chat, as soon as I try to send it anything via HA I get a Token length exceeded error. Any clue?

20:39:25-076961 INFO     Loading "functionary-small-v2.4.Q4_0.gguf"
20:39:25-141881 INFO     llama.cpp weights detected: "models\functionary-small-v2.4.Q4_0.gguf"
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from models\functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32004
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 31 repeating layers to GPU
llm_load_tensors: offloaded 31/33 layers to GPU
llm_load_tensors:        CPU buffer size =  3917.89 MiB
llm_load_tensors:      CUDA0 buffer size =  3627.97 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8000
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    31.25 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   968.75 MiB
llama_new_context_with_model: KV self size  = 1000.00 MiB, K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   571.88 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    23.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 15
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'general.name': '.', 'general.architecture': 'llama', 'llama.block_count': '32', 'llama.vocab_size': '32004', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}'}
Available chat formats from metadata: chat_template.default
Using gguf chat template: {% for message in messages %}
{% if message['role'] == 'user' or message['role'] == 'system' %}
{{ '<|from|>' + message['role'] + '
<|recipient|>all
<|content|>' + message['content'] + '
' }}{% elif message['role'] == 'tool' %}
{{ '<|from|>' + message['name'] + '
<|recipient|>all
<|content|>' + message['content'] + '
' }}{% else %}
{% set contain_content='no'%}
{% if message['content'] is not none %}
{{ '<|from|>assistant
<|recipient|>all
<|content|>' + message['content'] }}{% set contain_content='yes'%}
{% endif %}
{% if 'tool_calls' in message and message['tool_calls'] is not none %}
{% for tool_call in message['tool_calls'] %}
{% set prompt='<|from|>assistant
<|recipient|>' + tool_call['function']['name'] + '
<|content|>' + tool_call['function']['arguments'] %}
{% if loop.index == 1 and contain_content == "no" %}
{{ prompt }}{% else %}
{{ '
' + prompt}}{% endif %}
{% endfor %}
{% endif %}
{{ '<|stop|>
' }}{% endif %}
{% endfor %}
{% if add_generation_prompt %}{{ '<|from|>assistant
<|recipient|>' }}{% endif %}
Using chat eos_token: </s>
Using chat bos_token: <s>

Yes I have seen that issue before, but it’s hard to say what you are doing. I can’t see again how you configure and launch the model in textgen webui. And are you using my extended openAI fork?

I can recognize from the logs that an older version of llama-cpp-python is used. My Docker container downloads a specific version of llama-cpp-python. So sticking to textgen webui will probably be problematic if you can’t change the version.

I found a textgenwebui running your specific version of llama. How did you manage to fix the issue, when you had the same problem?

I believe it were some chat format or tokenizer issues that were fixed later on in llama-cpp-python. Something with that the context was getting too large because it would loop the response or something. Also extended openAI requires some modification to handle function calling (my fork).

Alright, Ill try with this version. If i get the same issue, ill try installing docker…

Ive installed 0.2.64 llama python, but im still getting the same error. Any ideas what it might be? It seems to first get stuck on Prompt evaluation 0% and then times out…

Please provide HOW you configured the Functionary LLM to start with textgen webui / llama-cpp-python.