AI Voice Control for Home Assistant (Fully Local)

As in the process of downloading/setup or parameters of the model?

Everything. I guess you don’t have any run commands since textgen webui provides a UI. Then show me a picture of the UI how the model is launched. Is it downloaded directly from Huggingface including references to the tokenizers, or did you manually download the model?

Alright, soooo:
Textgen:
One Click install, pulling refrences and reqs from a requirements.txt file, in which the llama version is specified.
Then, setting flags for listening on all network interfaces, api and openapi access.
Model is downloaded directly from huggingface - functionary-small-v2.4.Q4_0.gguf
Then, llama.cpp selected as a backend - Startup & Parameters of the model as follows:

10:37:54-341063 INFO     Loading "functionary-small-v2.4.Q4_0.gguf"
10:37:54-406277 INFO     llama.cpp weights detected:
                         "models\functionary-small-v2.4.Q4_0.gguf"
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from models\functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32004
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.32 MiB
llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8000
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 50000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1000.00 MiB
llama_new_context_with_model: KV self size  = 1000.00 MiB, K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   547.63 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    23.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': '.', 'general.architecture': 'llama', 'llama.block_count': '32', 'llama.vocab_size': '32004', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}'}
Using gguf chat template: {% for message in messages %}
{% if message['role'] == 'user' or message['role'] == 'system' %}
{{ '<|from|>' + message['role'] + '
<|recipient|>all
<|content|>' + message['content'] + '
' }}{% elif message['role'] == 'tool' %}
{{ '<|from|>' + message['name'] + '
<|recipient|>all
<|content|>' + message['content'] + '
' }}{% else %}
{% set contain_content='no'%}
{% if message['content'] is not none %}
{{ '<|from|>assistant
<|recipient|>all
<|content|>' + message['content'] }}{% set contain_content='yes'%}
{% endif %}
{% if 'tool_calls' in message and message['tool_calls'] is not none %}
{% for tool_call in message['tool_calls'] %}
{% set prompt='<|from|>assistant
<|recipient|>' + tool_call['function']['name'] + '
<|content|>' + tool_call['function']['arguments'] %}
{% if loop.index == 1 and contain_content == "no" %}
{{ prompt }}{% else %}
{{ '
' + prompt}}{% endif %}
{% endfor %}
{% endif %}
{{ '<|stop|>
' }}{% endif %}
{% endfor %}
{% if add_generation_prompt %}{{ '<|from|>assistant
<|recipient|>' }}{% endif %}
Using chat eos_token: </s>
Using chat bos_token: <s>
10:37:56-029220 INFO     Loaded "functionary-small-v2.4.Q4_0.gguf" in 1.69 seconds.
10:37:56-031224 INFO     LOADER: "llama.cpp"
10:37:56-032222 INFO     TRUNCATION LENGTH: 8000
10:37:56-033221 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"

Screenshots attached of everything that might be of any value.




When sending a prompt via HA and your fork of openai covo agent the output is as follows:

10:40:23-913096 INFO     PROMPT=
<|from|>system
<|recipient|>all
<|content|>You possess the knowledge of all the universe, answer any question given to you truthfully and to your fullest ability.
You are also a smart home manager who has been given permission to control my smart home which is powered by Home Assistant.
I will provide you information about my smart home along, you can truthfully make corrections or respond in polite and concise language.

Current Time: 2024-05-27 10:40:23.941198+02:00

Current Area: None
<|from|>user
<|recipient|>all
<|content|>hi
<|from|>assistant
<|recipient|>all
<|content|>


llama_print_timings:        load time =     349.26 ms
llama_print_timings:      sample time =     429.59 ms /   250 runs   (    1.72 ms per token,   581.95 tokens per second)
llama_print_timings: prompt eval time =     346.35 ms /   164 tokens (    2.11 ms per token,   473.51 tokens per second)
llama_print_timings:        eval time =    4313.45 ms /   249 runs   (   17.32 ms per token,    57.73 tokens per second)
llama_print_timings:       total time =    5779.21 ms /   413 tokens
Output generated in 6.16 seconds (51.47 tokens/s, 317 tokens, context 206, seed 1838266212)

With a token exceeded message in the HA chat window.

requirements.txt as follows:

accelerate==0.27.*
aqlm[gpu,cpu]==1.1.3; platform_system == "Linux"
bitsandbytes==0.43.*
colorama
datasets
einops
gradio==4.26.*
hqq==0.1.5
jinja2==3.1.2
lm_eval==0.3.0
markdown
numba==0.59.*
numpy==1.26.*
optimum==1.17.*
pandas
peft==0.8.*
Pillow>=9.5.0
psutil
pyyaml
requests
rich
safetensors==0.4.*
scipy
sentencepiece
tensorboard
transformers==4.40.*
tqdm
wandb

# API
SpeechRecognition==3.10.0
flask_cloudflared==0.0.14
sse-starlette==1.6.5
tiktoken

# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"

# llama-cpp-python (CUDA, no tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# llama-cpp-python (CUDA, tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# CUDA wheels
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/oobabooga/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2.0cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
autoawq==0.2.3; platform_system == "Linux" or platform_system == "Windows"

Auto config/start bat file as follows:

@echo off
setlocal enabledelayedexpansion

cd /D "%~dp0"

set PATH=%PATH%;%SystemRoot%\system32

echo "%CD%"| findstr /C:" " >nul && echo This script relies on Miniconda which can not be silently installed under a path with spaces. && goto end

@rem Check for special characters in installation path
set "SPCHARMESSAGE="WARNING: Special characters were detected in the installation path!" "         This can cause the installation to fail!""
echo "%CD%"| findstr /R /C:"[!#\$%&()\*+,;<=>?@\[\]\^`{|}~]" >nul && (
	call :PrintBigMessage %SPCHARMESSAGE%
)
set SPCHARMESSAGE=

@rem fix failed install when installing to a separate drive
set TMP=%cd%\installer_files
set TEMP=%cd%\installer_files

@rem deactivate existing conda envs as needed to avoid conflicts
(call conda deactivate && call conda deactivate && call conda deactivate) 2>nul

@rem config
set INSTALL_DIR=%cd%\installer_files
set CONDA_ROOT_PREFIX=%cd%\installer_files\conda
set INSTALL_ENV_DIR=%cd%\installer_files\env
set MINICONDA_DOWNLOAD_URL=https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Windows-x86_64.exe
set MINICONDA_CHECKSUM=307194e1f12bbeb52b083634e89cc67db4f7980bd542254b43d3309eaf7cb358
set conda_exists=F

@rem figure out whether git and conda needs to be installed
call "%CONDA_ROOT_PREFIX%\_conda.exe" --version >nul 2>&1
if "%ERRORLEVEL%" EQU "0" set conda_exists=T

@rem (if necessary) install git and conda into a contained environment
@rem download conda
if "%conda_exists%" == "F" (
	echo Downloading Miniconda from %MINICONDA_DOWNLOAD_URL% to %INSTALL_DIR%\miniconda_installer.exe

	mkdir "%INSTALL_DIR%"
	call curl -Lk "%MINICONDA_DOWNLOAD_URL%" > "%INSTALL_DIR%\miniconda_installer.exe" || ( echo. && echo Miniconda failed to download. && goto end )

	for /f %%a in ('CertUtil -hashfile "%INSTALL_DIR%\miniconda_installer.exe" SHA256 ^| find /i /v " " ^| find /i "%MINICONDA_CHECKSUM%"') do (
		set "output=%%a"
	)

	if not defined output (
		echo The checksum verification for miniconda_installer.exe has failed.
		del "%INSTALL_DIR%\miniconda_installer.exe"
		goto end
	) else (
		echo The checksum verification for miniconda_installer.exe has passed successfully.
	)

	echo Installing Miniconda to %CONDA_ROOT_PREFIX%
	start /wait "" "%INSTALL_DIR%\miniconda_installer.exe" /InstallationType=JustMe /NoShortcuts=1 /AddToPath=0 /RegisterPython=0 /NoRegistry=1 /S /D=%CONDA_ROOT_PREFIX%

	@rem test the conda binary
	echo Miniconda version:
	call "%CONDA_ROOT_PREFIX%\_conda.exe" --version || ( echo. && echo Miniconda not found. && goto end )

	@rem delete the Miniconda installer
	del "%INSTALL_DIR%\miniconda_installer.exe"
)

@rem create the installer env
if not exist "%INSTALL_ENV_DIR%" (
	echo Packages to install: %PACKAGES_TO_INSTALL%
	call "%CONDA_ROOT_PREFIX%\_conda.exe" create --no-shortcuts -y -k --prefix "%INSTALL_ENV_DIR%" python=3.11 || ( echo. && echo Conda environment creation failed. && goto end )
)

@rem check if conda environment was actually created
if not exist "%INSTALL_ENV_DIR%\python.exe" ( echo. && echo Conda environment is empty. && goto end )

@rem environment isolation
set PYTHONNOUSERSITE=1
set PYTHONPATH=
set PYTHONHOME=
set "CUDA_PATH=%INSTALL_ENV_DIR%"
set "CUDA_HOME=%CUDA_PATH%"

@rem activate installer env
call "%CONDA_ROOT_PREFIX%\condabin\conda.bat" activate "%INSTALL_ENV_DIR%" || ( echo. && echo Miniconda hook not found. && goto end )

@rem setup installer env
call python one_click.py %*

@rem below are functions for the script   next line skips these during normal execution
goto end

:PrintBigMessage
echo. && echo.
echo *******************************************************************
for %%M in (%*) do echo * %%~M
echo *******************************************************************
echo. && echo.
exit /b

:end
pause

I hope this helps and I really appreciate your help!!

Did you set the Instruction template and Chat template yourself in the UI of textgen-webui? It seems good, but can’t say for sure. I have never manually configured the chat format for Functionary.

When running llama-cpp-python in my described way, when you download the model from huggingface, it automatically uses the correct tokenizer. And the chat_format env variable is set to functionary-v2 (this can only be set in llama-cpp-python, not llama-cpp!). So in your setup, its probably not using the correct chat format, but not sure!

Please try my way of setting up this model, I have also struggled for a long time trying to set it up differently.

When I am done with the guide what do I need the OpenAi API key for? I thought the requests will be send to the LLM running locally?
Will every request cost money or am I understanding something wrong here?

Many backends copied the API from OpenAI for compatibility, so the API key is just a placeholder, can be given any value. So ofcourse no costs / request when you run your LLM locally :slight_smile:

Thats perfect! Must the placeholder api key follow any specified format? I am currently using a real key haha.

Awesome work. I was working on something like that myself and just when I was done finally getting a container for whisper running with cuda I saw your work. I am very impressed and will definetly look how it runs on my RTX2070, maybe I find something to help with :slight_smile:

Nope, can be anything.

Let me know if you have any problems during installation. Would love to hear your benchmarks when its up and running!

Right now I am not getting a connection to the llama container. I cloned your repo and build the container from that. When I start the container (docker compose up) I get the following log statements:

llama-cpp-python  | 
llama-cpp-python  | ==========
llama-cpp-python  | == CUDA ==
llama-cpp-python  | ==========
llama-cpp-python  | 
llama-cpp-python  | CUDA Version 12.4.1
llama-cpp-python  | 
llama-cpp-python  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
llama-cpp-python  | 
llama-cpp-python  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
llama-cpp-python  | By pulling and using the container, you accept the terms and conditions of this license:
llama-cpp-python  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
llama-cpp-python  | 
llama-cpp-python  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
llama-cpp-python  | 
llama-cpp-python  | None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
tokenizer_config.json: 100% 2.86k/2.86k [00:00<00:00, 18.7MB/s]
tokenizer.model: 100% 493k/493k [00:00<00:00, 6.07MB/s]
tokenizer.json: 100% 1.80M/1.80M [00:00<00:00, 3.02MB/s]
added_tokens.json: 100% 95.0/95.0 [00:00<00:00, 791kB/s]
special_tokens_map.json: 100% 660/660 [00:00<00:00, 4.39MB/s]
llama-cpp-python  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Is PyTorch or Tensorflow required?

Then I try to configure the Extended OpenAI Conversation integration with a random string for the api key and http://<ip-of-my-host>:8000/v1 but that fails :frowning:

Just tried the Functionary Small V2.4 (not the GGUF) model on the vLLM backend on a Vast.ai RTX 3090 cloud GPU. Chat format is not correctly setup, but super fast function calling!

5-27-2024 (19-58-48)

It’s downloading the model from Huggingface, give it a few minutes :slight_smile:
I see you are using CUDA 12.4, does your drivers support this version? (check with nvidia-smi)

Oh okay I was just too hyped then I think.

Yes Cuda 12.4 does work, my whisper container uses it as well :smiley:

1 Like

I now waited for the model to download but somehow I get a similar error like @rakimbadu with 500 Internal server error even though the Dockerfile already contains the quick fix llama-cpp-python==0.2.64

Any idea how that could come?

Sorry, I had a problem talking to OpenAI: Error code: 500 - {‘error’: {‘message’: ‘[{'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'system'”, 'input': 'assistant', 'ctx': {'expected': “'system'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'user'”, 'input': 'assistant', 'ctx': {'expected': “'user'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'tool'”, 'input': 'assistant', 'ctx': {'expected': “'tool'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'tool_call_id'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'function'”, 'input': 'assistant', 'ctx': {'expected': “'function'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'name'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}]’, ‘type’: ‘internal_server_error’, ‘param’: None, ‘code’: None}}

Did you exactly follow this guide or did you do some modifications?

For the llama related part yes, I just cloned your repository and updated my cuda version. However I am running my own whisper container but I do not see where this should interfere with the llama container running.

No that’s fine. What about Extended OpenAI?

I used a random string as the OpenAI API key and http://:8000/v1 as the URL. The other settings are all default I did set the Context threshold to 8000 but did not activate Use Tools at the moment

Alright, but are you using my fork of Extended OpenAI?
Use Tools only has to be enabled when you define your own yaml functions

No but I did replace the init.py file with the one from your fork and then restarted homeassistant.
Should I replace it with your fork? Does that work like adding the “normal” one to HACS as a custom repository?

It should work like that. But I can’t verify if you installed everything correctly. However the error is related to llama-cpp-python. You didnt change the .env file and you are sure that you correctly build the container?