Local AI & LLM on Home Assistant Yellow with Llama 3, Phi 3, Gemma 2, and TinyLlama

With the recent release of state of the art large language models (LLM’s), there is an increased focus on deploying them on-device or with embedded devices. There’s also an opportunity with Home Assistant (HA) to leverage these new advancements. Included are results from testing and experiments on deploying these modules to a HA Yellow kit including a Raspberry Pi Compute Module 4 which validated that they can be reliably deployed and integrated. Note that the HA Green kit should also work and please let me know if you were able to test it.

One of the main usability concerns are the performance of the LLM’s in terms of tokens per second. I’ve provided the rates of the tested models including their descriptions below. The sample sizes (# of tests per LLM) were too low to provide descriptive statistics but are sufficient for a proof of concept. Subjectively, the 1B and 2B models (Gemma 2 and TinyLlama) were the most fun but 3B models (Phi 3) had the best balance of accuracy and performance (tokens/s). If we can extrapolate, the tokens per second is logarithmically decreasing with the increasing number of parameters.

The verbose flag was used to understand various metrics such as Eval rate presented above. The prompts tested include Please write a haiku and Please write a Python function that implements bubble sort on a list of integers. All responses from all models were accurate.

Model ID # of Parameters Size Overview
llama3 8b 4.7 GB The latest LLM from Meta’s 8b version which has been tested to run on a Raspberry Pi.
llama3:8b-instruct-q2_k 8b 3.2 GB This is the same version but quantized to 2 bits instead of 4. The average tokens per second is slightly higher and this technique could be applied to other models.
tinyllama 1.1b 637 MB At about 5 tokens per second, this was the most performant and still provided impressive responses.
phi3 3.8b 2.3 GB At a little more than 1 tokens per second, this was satisfactory but provided a high accuracy.
gemma:2b-instruct 2b 1.6 GB This was at the threshold of fun and was ranked higher in some leaderboards than tinyllama.

The following graphs visualizes the processing and memory load on the Raspberry Pi when launching tinyllama and asking it a programming question. The model had already been downloaded. In the spike of Processor utilization, it was continuously 100% until the end of the response. The total RAM utilization was less than expected and throughout testing was always under 3 GB even for the biggest model tested, llama3-8b.

Click here to demo tinyllama

I want to thank home-llm for their existing work on this. However, the setup recommends or requires a GPU where the intent of this project is to use embedded systems like the one’s traditionally found on Home Assistant setups. The intent of this project is to provide a plug-and-play solution.

This specific solution utilizes Ollama on a Raspberry Pi deployed through Docker containers. There were unique challenges including overcoming the lack of development of LLM’s on ARM systems. Here are the requirements to recreate the results based on my testing,

  • SSH access to the machine
  • 1.5 Ghz Quad-Core processor
  • 4 GB of RAM
  • 10 GB of disk memory

:warning: This involves running (currently) unsupported software :warning:

An overview of the steps taken,

  1. Gain ssh access to a terminal with the ability to execute docker commands
  2. Run docker pull --platform linux/arm64 ollama/ollama to install Ollama, the software that runs LLM’s
  3. Run docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama to run Ollama in the background
  4. Run docker exec -it ollama ollama run tinyllama to play with a performant LLM.

You should now be able to run a chat interface through your terminal. An important consideration is how to use these LLM’s responsibly, and this solution will not actively support models without standard content filtering.

Other References

Next Steps

  • Add-Ons for Home Assistant use a similar containerized environment like Docker. This makes it feasible to provide an opinionated add on for running LLM’s with a web UI.
  • The LLM will be fed the status of the Home Assistant and pertinent information which the user can quickly ask in natural language.
  • Multilingual support will also be tested.
  • Test out LLaVA such that we can generate images and also provide image inputs.
  • Analyze whether the LLM’s should have write access to the machine.
3 Likes

The HA team is currently working with the Nvidia Jerson team, to potentially leverage the Jetpack as an accelerator for all STT, TTS and LLM locally. Running anything on a low power device not designed for it is going to be painfully slow, not accurate, lock up the system or all the above. In my ideal word, some kind of external box like the Jetson dev kit with an API to support remote calls from any satellite with need instant response would be ideal. If someone could create satellites like a POE powered Josh One would be amazing!

1 Like

I ran a 3B, 8B and 14B model with the prompts HA generates. I have too many devices — Ollama API calls basically ignore more than half of them, only pays attention to the last third or so. It’s as if the prompt is getting middlechomped by Ollama.

How do I set the context size with the built-in Ollama integration?

For context:

Jun 19 00:58:20 roxanne.dragonfear ollama[2547720]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Jun 19 00:58:20 roxanne.dragonfear ollama[2547720]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
Jun 19 00:58:20 roxanne.dragonfear ollama[2547720]: ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
Jun 19 00:58:20 roxanne.dragonfear ollama[2547720]: ggml_cuda_init: found 1 CUDA devices:
Jun 19 00:58:20 roxanne.dragonfear ollama[2547720]:   Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
Jun 19 00:58:20 roxanne.dragonfear ollama[2547720]: llm_load_tensors: ggml ctx size =    0.30 MiB
Jun 19 00:58:27 roxanne.dragonfear ollama[2547720]: llm_load_tensors: offloading 32 repeating layers to GPU
Jun 19 00:58:27 roxanne.dragonfear ollama[2547720]: llm_load_tensors: offloading non-repeating layers to GPU
Jun 19 00:58:27 roxanne.dragonfear ollama[2547720]: llm_load_tensors: offloaded 33/33 layers to GPU
Jun 19 00:58:27 roxanne.dragonfear ollama[2547720]: llm_load_tensors:        CPU buffer size =   281.81 MiB
Jun 19 00:58:27 roxanne.dragonfear ollama[2547720]: llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: n_ctx      = 2048
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: n_batch    = 512
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: n_ubatch   = 512
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: flash_attn = 0
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: freq_base  = 500000.0
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: freq_scale = 1
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.50 MiB
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: graph nodes  = 1030
Jun 19 00:58:28 roxanne.dragonfear ollama[2547720]: llama_new_context_with_model: graph splits = 2
Jun 19 00:58:28 roxanne.dragonfear ollama[2656479]: INFO [main] model loaded | tid="139937710837760" timestamp=1718758708

That’s the context that the API call from HA creates.

BTW I have verified that increasing the context length from 2048 to 4096 (in Open-WebUI) lets my HA-generated prompt work correctly. So yes the models need more context. / we need an ability to customize it

Following. I seem to have the same issue

1 Like