AI Voice Control for Home Assistant (Fully Local)

philip.schmid1142 · May 27, 2024, 8:30am

As in the process of downloading/setup or parameters of the model?

BramNH · May 27, 2024, 8:34am

Everything. I guess you don’t have any run commands since textgen webui provides a UI. Then show me a picture of the UI how the model is launched. Is it downloaded directly from Huggingface including references to the tokenizers, or did you manually download the model?

philip.schmid1142 · May 27, 2024, 8:44am

Alright, soooo:
Textgen:
One Click install, pulling refrences and reqs from a requirements.txt file, in which the llama version is specified.
Then, setting flags for listening on all network interfaces, api and openapi access.
Model is downloaded directly from huggingface - functionary-small-v2.4.Q4_0.gguf
Then, llama.cpp selected as a backend - Startup & Parameters of the model as follows:

10:37:54-341063 INFO     Loading "functionary-small-v2.4.Q4_0.gguf"
10:37:54-406277 INFO     llama.cpp weights detected:
                         "models\functionary-small-v2.4.Q4_0.gguf"
llama_model_loader: loaded meta data with 25 key-value pairs and 291 tensors from models\functionary-small-v2.4.Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32004
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32004]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32004]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32004]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  22:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 263/32004 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32004
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
llm_load_print_meta: general.name     = .
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    70.32 MiB
llm_load_tensors:      CUDA0 buffer size =  3847.57 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 8000
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 50000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1000.00 MiB
llama_new_context_with_model: KV self size  = 1000.00 MiB, K (f16):  500.00 MiB, V (f16):  500.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   547.63 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    23.63 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'general.name': '.', 'general.architecture': 'llama', 'llama.block_count': '32', 'llama.vocab_size': '32004', 'llama.context_length': '32768', 'llama.rope.dimension_count': '128', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.head_count': '32', 'tokenizer.ggml.eos_token_id': '2', 'general.file_type': '2', 'llama.attention.head_count_kv': '8', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.freq_base': '1000000.000000', 'tokenizer.ggml.model': 'llama', 'general.quantization_version': '2', 'tokenizer.ggml.bos_token_id': '1', 'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.padding_token_id': '2', 'tokenizer.ggml.add_bos_token': 'true', 'tokenizer.ggml.add_eos_token': 'false', 'tokenizer.chat_template': '{% for message in messages %}\n{% if message[\'role\'] == \'user\' or message[\'role\'] == \'system\' %}\n{{ \'<|from|>\' + message[\'role\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% elif message[\'role\'] == \'tool\' %}\n{{ \'<|from|>\' + message[\'name\'] + \'\n<|recipient|>all\n<|content|>\' + message[\'content\'] + \'\n\' }}{% else %}\n{% set contain_content=\'no\'%}\n{% if message[\'content\'] is not none %}\n{{ \'<|from|>assistant\n<|recipient|>all\n<|content|>\' + message[\'content\'] }}{% set contain_content=\'yes\'%}\n{% endif %}\n{% if \'tool_calls\' in message and message[\'tool_calls\'] is not none %}\n{% for tool_call in message[\'tool_calls\'] %}\n{% set prompt=\'<|from|>assistant\n<|recipient|>\' + tool_call[\'function\'][\'name\'] + \'\n<|content|>\' + tool_call[\'function\'][\'arguments\'] %}\n{% if loop.index == 1 and contain_content == "no" %}\n{{ prompt }}{% else %}\n{{ \'\n\' + prompt}}{% endif %}\n{% endfor %}\n{% endif %}\n{{ \'<|stop|>\n\' }}{% endif %}\n{% endfor %}\n{% if add_generation_prompt %}{{ \'<|from|>assistant\n<|recipient|>\' }}{% endif %}'}
Using gguf chat template: {% for message in messages %}
{% if message['role'] == 'user' or message['role'] == 'system' %}
{{ '<|from|>' + message['role'] + '
<|recipient|>all
<|content|>' + message['content'] + '
' }}{% elif message['role'] == 'tool' %}
{{ '<|from|>' + message['name'] + '
<|recipient|>all
<|content|>' + message['content'] + '
' }}{% else %}
{% set contain_content='no'%}
{% if message['content'] is not none %}
{{ '<|from|>assistant
<|recipient|>all
<|content|>' + message['content'] }}{% set contain_content='yes'%}
{% endif %}
{% if 'tool_calls' in message and message['tool_calls'] is not none %}
{% for tool_call in message['tool_calls'] %}
{% set prompt='<|from|>assistant
<|recipient|>' + tool_call['function']['name'] + '
<|content|>' + tool_call['function']['arguments'] %}
{% if loop.index == 1 and contain_content == "no" %}
{{ prompt }}{% else %}
{{ '
' + prompt}}{% endif %}
{% endfor %}
{% endif %}
{{ '<|stop|>
' }}{% endif %}
{% endfor %}
{% if add_generation_prompt %}{{ '<|from|>assistant
<|recipient|>' }}{% endif %}
Using chat eos_token: </s>
Using chat bos_token: <s>
10:37:56-029220 INFO     Loaded "functionary-small-v2.4.Q4_0.gguf" in 1.69 seconds.
10:37:56-031224 INFO     LOADER: "llama.cpp"
10:37:56-032222 INFO     TRUNCATION LENGTH: 8000
10:37:56-033221 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"

Screenshots attached of everything that might be of any value.

When sending a prompt via HA and your fork of openai covo agent the output is as follows:

10:40:23-913096 INFO     PROMPT=
<|from|>system
<|recipient|>all
<|content|>You possess the knowledge of all the universe, answer any question given to you truthfully and to your fullest ability.
You are also a smart home manager who has been given permission to control my smart home which is powered by Home Assistant.
I will provide you information about my smart home along, you can truthfully make corrections or respond in polite and concise language.

Current Time: 2024-05-27 10:40:23.941198+02:00

Current Area: None
<|from|>user
<|recipient|>all
<|content|>hi
<|from|>assistant
<|recipient|>all
<|content|>


llama_print_timings:        load time =     349.26 ms
llama_print_timings:      sample time =     429.59 ms /   250 runs   (    1.72 ms per token,   581.95 tokens per second)
llama_print_timings: prompt eval time =     346.35 ms /   164 tokens (    2.11 ms per token,   473.51 tokens per second)
llama_print_timings:        eval time =    4313.45 ms /   249 runs   (   17.32 ms per token,    57.73 tokens per second)
llama_print_timings:       total time =    5779.21 ms /   413 tokens
Output generated in 6.16 seconds (51.47 tokens/s, 317 tokens, context 206, seed 1838266212)

With a token exceeded message in the HA chat window.

requirements.txt as follows:

accelerate==0.27.*
aqlm[gpu,cpu]==1.1.3; platform_system == "Linux"
bitsandbytes==0.43.*
colorama
datasets
einops
gradio==4.26.*
hqq==0.1.5
jinja2==3.1.2
lm_eval==0.3.0
markdown
numba==0.59.*
numpy==1.26.*
optimum==1.17.*
pandas
peft==0.8.*
Pillow>=9.5.0
psutil
pyyaml
requests
rich
safetensors==0.4.*
scipy
sentencepiece
tensorboard
transformers==4.40.*
tqdm
wandb

# API
SpeechRecognition==3.10.0
flask_cloudflared==0.0.14
sse-starlette==1.6.5
tiktoken

# llama-cpp-python (CPU only, AVX2)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/cpu/llama_cpp_python-0.2.64+cpuavx2-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"

# llama-cpp-python (CUDA, no tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda-0.2.64+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# llama-cpp-python (CUDA, tensor cores)
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/llama-cpp-python-cuBLAS-wheels/releases/download/textgen-webui/llama_cpp_python_cuda_tensorcores-0.2.64+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"

# CUDA wheels
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/jllllll/AutoGPTQ/releases/download/v0.6.0/auto_gptq-0.6.0+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/oobabooga/exllamav2/releases/download/v0.0.20/exllamav2-0.0.20-py3-none-any.whl; platform_system == "Linux" and platform_machine != "x86_64"
https://github.com/oobabooga/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2.0cxx11abiFALSE-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/oobabooga/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2.0cxx11abiFALSE-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2cxx11abiFALSE-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-win_amd64.whl; platform_system == "Windows" and python_version == "3.11"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp310-cp310-win_amd64.whl; platform_system == "Windows" and python_version == "3.10"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp311-cp311-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.11"
https://github.com/jllllll/GPTQ-for-LLaMa-CUDA/releases/download/0.1.1/gptq_for_llama-0.1.1+cu121-cp310-cp310-linux_x86_64.whl; platform_system == "Linux" and platform_machine == "x86_64" and python_version == "3.10"
autoawq==0.2.3; platform_system == "Linux" or platform_system == "Windows"

Auto config/start bat file as follows:

@echo off
setlocal enabledelayedexpansion

cd /D "%~dp0"

set PATH=%PATH%;%SystemRoot%\system32

echo "%CD%"| findstr /C:" " >nul && echo This script relies on Miniconda which can not be silently installed under a path with spaces. && goto end

@rem Check for special characters in installation path
set "SPCHARMESSAGE="WARNING: Special characters were detected in the installation path!" "         This can cause the installation to fail!""
echo "%CD%"| findstr /R /C:"[!#\$%&()\*+,;<=>?@\[\]\^`{|}~]" >nul && (
	call :PrintBigMessage %SPCHARMESSAGE%
)
set SPCHARMESSAGE=

@rem fix failed install when installing to a separate drive
set TMP=%cd%\installer_files
set TEMP=%cd%\installer_files

@rem deactivate existing conda envs as needed to avoid conflicts
(call conda deactivate && call conda deactivate && call conda deactivate) 2>nul

@rem config
set INSTALL_DIR=%cd%\installer_files
set CONDA_ROOT_PREFIX=%cd%\installer_files\conda
set INSTALL_ENV_DIR=%cd%\installer_files\env
set MINICONDA_DOWNLOAD_URL=https://repo.anaconda.com/miniconda/Miniconda3-py310_23.3.1-0-Windows-x86_64.exe
set MINICONDA_CHECKSUM=307194e1f12bbeb52b083634e89cc67db4f7980bd542254b43d3309eaf7cb358
set conda_exists=F

@rem figure out whether git and conda needs to be installed
call "%CONDA_ROOT_PREFIX%\_conda.exe" --version >nul 2>&1
if "%ERRORLEVEL%" EQU "0" set conda_exists=T

@rem (if necessary) install git and conda into a contained environment
@rem download conda
if "%conda_exists%" == "F" (
	echo Downloading Miniconda from %MINICONDA_DOWNLOAD_URL% to %INSTALL_DIR%\miniconda_installer.exe

	mkdir "%INSTALL_DIR%"
	call curl -Lk "%MINICONDA_DOWNLOAD_URL%" > "%INSTALL_DIR%\miniconda_installer.exe" || ( echo. && echo Miniconda failed to download. && goto end )

	for /f %%a in ('CertUtil -hashfile "%INSTALL_DIR%\miniconda_installer.exe" SHA256 ^| find /i /v " " ^| find /i "%MINICONDA_CHECKSUM%"') do (
		set "output=%%a"
	)

	if not defined output (
		echo The checksum verification for miniconda_installer.exe has failed.
		del "%INSTALL_DIR%\miniconda_installer.exe"
		goto end
	) else (
		echo The checksum verification for miniconda_installer.exe has passed successfully.
	)

	echo Installing Miniconda to %CONDA_ROOT_PREFIX%
	start /wait "" "%INSTALL_DIR%\miniconda_installer.exe" /InstallationType=JustMe /NoShortcuts=1 /AddToPath=0 /RegisterPython=0 /NoRegistry=1 /S /D=%CONDA_ROOT_PREFIX%

	@rem test the conda binary
	echo Miniconda version:
	call "%CONDA_ROOT_PREFIX%\_conda.exe" --version || ( echo. && echo Miniconda not found. && goto end )

	@rem delete the Miniconda installer
	del "%INSTALL_DIR%\miniconda_installer.exe"
)

@rem create the installer env
if not exist "%INSTALL_ENV_DIR%" (
	echo Packages to install: %PACKAGES_TO_INSTALL%
	call "%CONDA_ROOT_PREFIX%\_conda.exe" create --no-shortcuts -y -k --prefix "%INSTALL_ENV_DIR%" python=3.11 || ( echo. && echo Conda environment creation failed. && goto end )
)

@rem check if conda environment was actually created
if not exist "%INSTALL_ENV_DIR%\python.exe" ( echo. && echo Conda environment is empty. && goto end )

@rem environment isolation
set PYTHONNOUSERSITE=1
set PYTHONPATH=
set PYTHONHOME=
set "CUDA_PATH=%INSTALL_ENV_DIR%"
set "CUDA_HOME=%CUDA_PATH%"

@rem activate installer env
call "%CONDA_ROOT_PREFIX%\condabin\conda.bat" activate "%INSTALL_ENV_DIR%" || ( echo. && echo Miniconda hook not found. && goto end )

@rem setup installer env
call python one_click.py %*

@rem below are functions for the script   next line skips these during normal execution
goto end

:PrintBigMessage
echo. && echo.
echo *******************************************************************
for %%M in (%*) do echo * %%~M
echo *******************************************************************
echo. && echo.
exit /b

:end
pause

I hope this helps and I really appreciate your help!!

BramNH · May 27, 2024, 9:30am

Did you set the Instruction template and Chat template yourself in the UI of textgen-webui? It seems good, but can’t say for sure. I have never manually configured the chat format for Functionary.

When running llama-cpp-python in my described way, when you download the model from huggingface, it automatically uses the correct tokenizer. And the chat_format env variable is set to functionary-v2 (this can only be set in llama-cpp-python, not llama-cpp!). So in your setup, its probably not using the correct chat format, but not sure!

Please try my way of setting up this model, I have also struggled for a long time trying to set it up differently.

tiko · May 27, 2024, 5:44pm

When I am done with the guide what do I need the OpenAi API key for? I thought the requests will be send to the LLM running locally?
Will every request cost money or am I understanding something wrong here?

BramNH · May 27, 2024, 6:02pm

Many backends copied the API from OpenAI for compatibility, so the API key is just a placeholder, can be given any value. So ofcourse no costs / request when you run your LLM locally

tiko · May 27, 2024, 6:06pm

Thats perfect! Must the placeholder api key follow any specified format? I am currently using a real key haha.

Awesome work. I was working on something like that myself and just when I was done finally getting a container for whisper running with cuda I saw your work. I am very impressed and will definetly look how it runs on my RTX2070, maybe I find something to help with

BramNH · May 27, 2024, 6:09pm

Nope, can be anything.

Let me know if you have any problems during installation. Would love to hear your benchmarks when its up and running!

tiko · May 27, 2024, 6:15pm

Right now I am not getting a connection to the llama container. I cloned your repo and build the container from that. When I start the container (docker compose up) I get the following log statements:

llama-cpp-python  | 
llama-cpp-python  | ==========
llama-cpp-python  | == CUDA ==
llama-cpp-python  | ==========
llama-cpp-python  | 
llama-cpp-python  | CUDA Version 12.4.1
llama-cpp-python  | 
llama-cpp-python  | Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
llama-cpp-python  | 
llama-cpp-python  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
llama-cpp-python  | By pulling and using the container, you accept the terms and conditions of this license:
llama-cpp-python  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
llama-cpp-python  | 
llama-cpp-python  | A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
llama-cpp-python  | 
llama-cpp-python  | None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
tokenizer_config.json: 100% 2.86k/2.86k [00:00<00:00, 18.7MB/s]
tokenizer.model: 100% 493k/493k [00:00<00:00, 6.07MB/s]
tokenizer.json: 100% 1.80M/1.80M [00:00<00:00, 3.02MB/s]
added_tokens.json: 100% 95.0/95.0 [00:00<00:00, 791kB/s]
special_tokens_map.json: 100% 660/660 [00:00<00:00, 4.39MB/s]
llama-cpp-python  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Is PyTorch or Tensorflow required?

Then I try to configure the Extended OpenAI Conversation integration with a random string for the api key and http://<ip-of-my-host>:8000/v1 but that fails

BramNH · May 27, 2024, 6:17pm

Just tried the Functionary Small V2.4 (not the GGUF) model on the vLLM backend on a Vast.ai RTX 3090 cloud GPU. Chat format is not correctly setup, but super fast function calling!

5-27-2024 (19-58-48)

BramNH · May 27, 2024, 6:20pm

It’s downloading the model from Huggingface, give it a few minutes
I see you are using CUDA 12.4, does your drivers support this version? (check with nvidia-smi)

tiko · May 27, 2024, 6:23pm

Oh okay I was just too hyped then I think.

Yes Cuda 12.4 does work, my whisper container uses it as well

tiko · May 27, 2024, 7:19pm

I now waited for the model to download but somehow I get a similar error like @rakimbadu with 500 Internal server error even though the Dockerfile already contains the quick fix llama-cpp-python==0.2.64

Any idea how that could come?

Sorry, I had a problem talking to OpenAI: Error code: 500 - {‘error’: {‘message’: ‘[{'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'system'”, 'input': 'assistant', 'ctx': {'expected': “'system'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'user'”, 'input': 'assistant', 'ctx': {'expected': “'user'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'tool'”, 'input': 'assistant', 'ctx': {'expected': “'tool'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'tool_call_id'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'literal_error', 'loc': ('body', 'messages', 4, 'typed-dict', 'role'), 'msg': “Input should be 'function'”, 'input': 'assistant', 'ctx': {'expected': “'function'”}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'content'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}, {'type': 'missing', 'loc': ('body', 'messages', 4, 'typed-dict', 'name'), 'msg': 'Field required', 'input': {'role': 'assistant', 'function_call': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'tool_calls': [{'id': 'call_rkrAZhv40vTBPile4M51kMuk', 'function': {'arguments': '{“list”: [{“domain”: “weather”, “service”: “forecast”, “service_data”: {“entity_id”: “sensor.weather_forecast”}}]}', 'name': 'execute_services'}, 'type': 'function'}]}}]’, ‘type’: ‘internal_server_error’, ‘param’: None, ‘code’: None}}

BramNH · May 27, 2024, 8:08pm

Did you exactly follow this guide or did you do some modifications?

tiko · May 27, 2024, 8:14pm

For the llama related part yes, I just cloned your repository and updated my cuda version. However I am running my own whisper container but I do not see where this should interfere with the llama container running.

BramNH · May 27, 2024, 8:19pm

No that’s fine. What about Extended OpenAI?

tiko · May 27, 2024, 8:22pm

I used a random string as the OpenAI API key and http://:8000/v1 as the URL. The other settings are all default I did set the Context threshold to 8000 but did not activate Use Tools at the moment

BramNH · May 27, 2024, 8:24pm

Alright, but are you using my fork of Extended OpenAI?
Use Tools only has to be enabled when you define your own yaml functions

tiko · May 27, 2024, 8:26pm

No but I did replace the init.py file with the one from your fork and then restarted homeassistant.
Should I replace it with your fork? Does that work like adding the “normal” one to HACS as a custom repository?

BramNH · May 27, 2024, 8:31pm

It should work like that. But I can’t verify if you installed everything correctly. However the error is related to llama-cpp-python. You didnt change the .env file and you are sure that you correctly build the container?