With the recent Ollama integration into Home Assistant, I’ve been exploring its capabilities and finding it quite good. However, I believe there’s even more potential if we could run Ollama directly as an addon on the same hardware. Currently, I’m using an Asus Chromebox 3 with an Intel® Core™ i7-8550U Processor and 16GB of RAM, and running Ollama locally as an addon has been a positive experience. By leveraging small language models like tinyllama
, tinydolphin
, phi
, etc., I’m achieving quick response times of 2-3 seconds from my Assist device, which is an esp32-s3-box
integrated with the new Ollama integration.
Perhaps voting for your own request would be a good idea. I did.
I’ve just put together such an addon: GitHub - SirUli/homeassistant-ollama-addon: Provides an Home Assistant addon configuration for Ollama.
Great addon, works for me. But I’m unsure how to enable GPU support. With HAOS, I’m not even sure the gpu has a driver. Is there a way to get this working? The readme points to the ollama website, (and maybe I’m just not piecing it together) but that seems to be a guide for a different environment.
@SirUli once your add on is running, can you change the model? The readme doesn’t give much detail for that.
to me it looks like you just delete the integration and re-add to use a different model
Anyone tries this Ollama addon with the new Hailo 10H M.2 module (Hailo 10 series) AI accelerator hardware?
PS: Apparently the less expensive modules in the Hailo family like the Hailo 8 series will not work unless the LLM model has specifically using thier compiler on very powerful hardware.
Are you using these models for assist or only as a fallback?
In general I’d be interested in knowing what people have tried with local gen AI, not sure if there is somewhere it is already being discussed.
I also would be interested in how to select a model
Run the integration as described in the Readme. If you need more models at the same time, you need to make sure that you have the integration twice and sufficient resources. If you want to change the model, you delete the integration and run the integration again.
Issue:
I’ve been trying to use the native Ollama Integration with the HACS Ollama Add-On powered by tinyllm but cannot get things working. No matter what I do the intent recognition fails if I have Assist or Home-LLM (v1-v3) selected.
Any help is appreciated!
Background
I’ve kept everything standardized but receive an Unknown Error whenever I attempt to use the conversation.process
service.
action: conversation.process
data:
text: What lights are on?
agent_id: conversation.tinyllama
I can see in the Ollama Add-On that attempts were made to use the chat:
time=2025-03-12T16:37:29.563Z level=INFO source=images.go:432 msg="total blobs: 6"
time=2025-03-12T16:37:29.563Z level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-12T16:37:29.563Z level=INFO source=routes.go:1292 msg="Listening on [::]:11434 (version 0.6.0)"
time=2025-03-12T16:37:29.563Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-12T16:37:29.566Z level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered"
time=2025-03-12T16:37:29.566Z level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="15.4 GiB" available="12.8 GiB"
[GIN] 2025/03/12 - 16:37:35 | 200 | 8.517514ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:39:37 | 200 | 1.240583ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:39:38 | 500 | 503.917579ms | 192.168.40.216 | POST "/api/pull"
[GIN] 2025/03/12 - 16:39:41 | 200 | 1.058605ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:39:58 | 200 | 1.155844ms | 192.168.40.216 | GET "/api/tags"
time=2025-03-12T16:39:58.760Z level=INFO source=download.go:176 msg="downloading 2af3b81862c6 in 7 100 MB part(s)"
time=2025-03-12T16:40:07.040Z level=INFO source=download.go:176 msg="downloading af0ddbdaaa26 in 1 70 B part(s)"
time=2025-03-12T16:40:08.318Z level=INFO source=download.go:176 msg="downloading c8472cd9daed in 1 31 B part(s)"
time=2025-03-12T16:40:09.614Z level=INFO source=download.go:176 msg="downloading fa956ab37b8c in 1 98 B part(s)"
time=2025-03-12T16:40:10.909Z level=INFO source=download.go:176 msg="downloading 6331358be52a in 1 483 B part(s)"
[GIN] 2025/03/12 - 16:40:12 | 200 | 14.185524102s | 192.168.40.216 | POST "/api/pull"
[GIN] 2025/03/12 - 16:40:12 | 200 | 1.45732ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:40:44 | 200 | 1.586762ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:40:58 | 200 | 1.376906ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:41:27 | 400 | 20.711739ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:41:53 | 400 | 9.56381ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:42:01 | 200 | 94.635µs | 192.168.40.129 | GET "/"
[GIN] 2025/03/12 - 16:43:05 | 200 | 1.269183ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:43:10 | 200 | 1.364042ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:43:10 | 200 | 1.291184ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:43:28 | 200 | 1.252031ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 16:43:42 | 400 | 10.410082ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:47:12 | 400 | 7.757886ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:48:08 | 400 | 22.350764ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:48:11 | 400 | 11.196124ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:52:12 | 400 | 22.39579ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:52:18 | 400 | 8.089691ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:54:59 | 400 | 22.844492ms | 192.168.40.216 | POST "/api/chat"
Debug from the Voice Assistant provided a bit more clarity:
init_options:
start_stage: intent
end_stage: intent
input:
text: What lights are on?
pipeline: 01jp5mkv51zkejm6kxvy1x7nnp
conversation_id: null
stage: done
run:
pipeline: 01jp5mkv51zkejm6kxvy1x7nnp
language: en
conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
runner_data:
stt_binary_handler_id: null
timeout: 300
events:
- type: run-start
data:
pipeline: 01jp5mkv51zkejm6kxvy1x7nnp
language: en
conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
runner_data:
stt_binary_handler_id: null
timeout: 300
timestamp: "2025-03-12T16:57:31.674291+00:00"
- type: intent-start
data:
engine: conversation.tinyllama
language: en
intent_input: What lights are on?
conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
device_id: null
prefer_local_intents: true
timestamp: "2025-03-12T16:57:31.674383+00:00"
- type: error
data:
code: intent-failed
message: Unexpected error during intent recognition
timestamp: "2025-03-12T16:57:31.788200+00:00"
- type: run-end
data: null
timestamp: "2025-03-12T16:57:31.788282+00:00"
intent:
engine: conversation.tinyllama
language: en
intent_input: What lights are on?
conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
device_id: null
prefer_local_intents: true
done: false
error:
code: intent-failed
message: Unexpected error during intent recognition
I can confirm the TinyLLM is working on the Ollama Add-On when I set it to No Control
Action:
action: conversation.process
data:
text: What lights are on?
agent_id: conversation.tinyllama
Response:
response:
speech:
plain:
speech: >-
The lights mentioned in the given text are not specified. It is unclear
which lights are on.
extra_data: null
card: {}
language: en
response_type: action_done
data:
targets: []
success: []
failed: []
conversation_id: 01JP5MSN3WYM9A1HXGJS28ADB8
Ollama Add-On Logs:
load_tensors: CPU model buffer size = 606.53 MiB
time=2025-03-12T16:59:47.982Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch = 512
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 0
llama_init_from_model: freq_base = 10000.0
llama_init_from_model: freq_scale = 1
llama_init_from_model: n_ctx_pre_seq (8192) > n_ctx_train (2048) -- possible training context overflow
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 22, can_shift = 1
llama_kv_cache_init: CPU KV buffer size = 176.00 MiB
llama_init_from_model: KV self size = 176.00 MiB, K (f16): 88.00 MiB, V (f16): 88.00 MiB
llama_init_from_model: CPU output buffer size = 0.13 MiB
llama_init_from_model: CPU compute buffer size = 544.01 MiB
llama_init_from_model: graph nodes = 710
llama_init_from_model: graph splits = 1
time=2025-03-12T16:59:48.233Z level=INFO source=server.go:624 msg="llama runner started in 0.50 seconds"
[GIN] 2025/03/12 - 16:59:51 | 200 | 3.847565655s | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 16:59:56 | 200 | 7.355661487s | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 17:00:03 | 200 | 1.077693ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 17:00:05 | 400 | 10.554958ms | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 17:00:14 | 200 | 1.240289ms | 192.168.40.216 | GET "/api/tags"
[GIN] 2025/03/12 - 17:00:22 | 200 | 5.820094733s | 192.168.40.216 | POST "/api/chat"
[GIN] 2025/03/12 - 17:00:25 | 200 | 2.512705131s | 192.168.40.216 | POST "/api/chat"