Ollama Addon: Running SLMs Locally on the Same Box and Not Just Remotely

With the recent Ollama integration into Home Assistant, I’ve been exploring its capabilities and finding it quite good. However, I believe there’s even more potential if we could run Ollama directly as an addon on the same hardware. Currently, I’m using an Asus Chromebox 3 with an Intel® Core™ i7-8550U Processor and 16GB of RAM, and running Ollama locally as an addon has been a positive experience. By leveraging small language models like tinyllama , tinydolphin , phi , etc., I’m achieving quick response times of 2-3 seconds from my Assist device, which is an esp32-s3-box integrated with the new Ollama integration.

Perhaps voting for your own request would be a good idea. I did.

Feature Request Guidelines 📔.

2 Likes

I’ve just put together such an addon: GitHub - SirUli/homeassistant-ollama-addon: Provides an Home Assistant addon configuration for Ollama.

4 Likes

Great addon, works for me. But I’m unsure how to enable GPU support. With HAOS, I’m not even sure the gpu has a driver. Is there a way to get this working? The readme points to the ollama website, (and maybe I’m just not piecing it together) but that seems to be a guide for a different environment.

@SirUli once your add on is running, can you change the model? The readme doesn’t give much detail for that.

to me it looks like you just delete the integration and re-add to use a different model

2 Likes

Anyone tries this Ollama addon with the new Hailo 10H M.2 module (Hailo 10 series) AI accelerator hardware?

PS: Apparently the less expensive modules in the Hailo family like the Hailo 8 series will not work unless the LLM model has specifically using thier compiler on very powerful hardware.

Are you using these models for assist or only as a fallback?

In general I’d be interested in knowing what people have tried with local gen AI, not sure if there is somewhere it is already being discussed.

I also would be interested in how to select a model

Run the integration as described in the Readme. If you need more models at the same time, you need to make sure that you have the integration twice and sufficient resources. If you want to change the model, you delete the integration and run the integration again.

Issue:
I’ve been trying to use the native Ollama Integration with the HACS Ollama Add-On powered by tinyllm but cannot get things working. No matter what I do the intent recognition fails if I have Assist or Home-LLM (v1-v3) selected.

Any help is appreciated!

Background
I’ve kept everything standardized but receive an Unknown Error whenever I attempt to use the conversation.process service.

action: conversation.process
data:
  text: What lights are on?
  agent_id: conversation.tinyllama


I can see in the Ollama Add-On that attempts were made to use the chat:

time=2025-03-12T16:37:29.563Z level=INFO source=images.go:432 msg="total blobs: 6"
time=2025-03-12T16:37:29.563Z level=INFO source=images.go:439 msg="total unused blobs removed: 0"
time=2025-03-12T16:37:29.563Z level=INFO source=routes.go:1292 msg="Listening on [::]:11434 (version 0.6.0)"
time=2025-03-12T16:37:29.563Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-03-12T16:37:29.566Z level=INFO source=gpu.go:377 msg="no compatible GPUs were discovered"
time=2025-03-12T16:37:29.566Z level=INFO source=types.go:130 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="15.4 GiB" available="12.8 GiB"
[GIN] 2025/03/12 - 16:37:35 | 200 |    8.517514ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:39:37 | 200 |    1.240583ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:39:38 | 500 |  503.917579ms |  192.168.40.216 | POST     "/api/pull"
[GIN] 2025/03/12 - 16:39:41 | 200 |    1.058605ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:39:58 | 200 |    1.155844ms |  192.168.40.216 | GET      "/api/tags"
time=2025-03-12T16:39:58.760Z level=INFO source=download.go:176 msg="downloading 2af3b81862c6 in 7 100 MB part(s)"
time=2025-03-12T16:40:07.040Z level=INFO source=download.go:176 msg="downloading af0ddbdaaa26 in 1 70 B part(s)"
time=2025-03-12T16:40:08.318Z level=INFO source=download.go:176 msg="downloading c8472cd9daed in 1 31 B part(s)"
time=2025-03-12T16:40:09.614Z level=INFO source=download.go:176 msg="downloading fa956ab37b8c in 1 98 B part(s)"
time=2025-03-12T16:40:10.909Z level=INFO source=download.go:176 msg="downloading 6331358be52a in 1 483 B part(s)"
[GIN] 2025/03/12 - 16:40:12 | 200 | 14.185524102s |  192.168.40.216 | POST     "/api/pull"
[GIN] 2025/03/12 - 16:40:12 | 200 |     1.45732ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:40:44 | 200 |    1.586762ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:40:58 | 200 |    1.376906ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:41:27 | 400 |   20.711739ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:41:53 | 400 |     9.56381ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:42:01 | 200 |      94.635µs |  192.168.40.129 | GET      "/"
[GIN] 2025/03/12 - 16:43:05 | 200 |    1.269183ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:43:10 | 200 |    1.364042ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:43:10 | 200 |    1.291184ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:43:28 | 200 |    1.252031ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 16:43:42 | 400 |   10.410082ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:47:12 | 400 |    7.757886ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:48:08 | 400 |   22.350764ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:48:11 | 400 |   11.196124ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:52:12 | 400 |    22.39579ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:52:18 | 400 |    8.089691ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:54:59 | 400 |   22.844492ms |  192.168.40.216 | POST     "/api/chat"

Debug from the Voice Assistant provided a bit more clarity:

init_options:
  start_stage: intent
  end_stage: intent
  input:
    text: What lights are on?
  pipeline: 01jp5mkv51zkejm6kxvy1x7nnp
  conversation_id: null
stage: done
run:
  pipeline: 01jp5mkv51zkejm6kxvy1x7nnp
  language: en
  conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
  runner_data:
    stt_binary_handler_id: null
    timeout: 300
events:
  - type: run-start
    data:
      pipeline: 01jp5mkv51zkejm6kxvy1x7nnp
      language: en
      conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
      runner_data:
        stt_binary_handler_id: null
        timeout: 300
    timestamp: "2025-03-12T16:57:31.674291+00:00"
  - type: intent-start
    data:
      engine: conversation.tinyllama
      language: en
      intent_input: What lights are on?
      conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
      device_id: null
      prefer_local_intents: true
    timestamp: "2025-03-12T16:57:31.674383+00:00"
  - type: error
    data:
      code: intent-failed
      message: Unexpected error during intent recognition
    timestamp: "2025-03-12T16:57:31.788200+00:00"
  - type: run-end
    data: null
    timestamp: "2025-03-12T16:57:31.788282+00:00"
intent:
  engine: conversation.tinyllama
  language: en
  intent_input: What lights are on?
  conversation_id: 01JP5MMDRS1N6C5S5DZ8S9DK06
  device_id: null
  prefer_local_intents: true
  done: false
error:
  code: intent-failed
  message: Unexpected error during intent recognition

I can confirm the TinyLLM is working on the Ollama Add-On when I set it to No Control

Action:

action: conversation.process
data:
  text: What lights are on?
  agent_id: conversation.tinyllama

Response:

response:
  speech:
    plain:
      speech: >-
        The lights mentioned in the given text are not specified. It is unclear
        which lights are on.
      extra_data: null
  card: {}
  language: en
  response_type: action_done
  data:
    targets: []
    success: []
    failed: []
conversation_id: 01JP5MSN3WYM9A1HXGJS28ADB8

Ollama Add-On Logs:

load_tensors:          CPU model buffer size =   606.53 MiB
time=2025-03-12T16:59:47.982Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_pre_seq (8192) > n_ctx_train (2048) -- possible training context overflow
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 22, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =   176.00 MiB
llama_init_from_model: KV self size  =  176.00 MiB, K (f16):   88.00 MiB, V (f16):   88.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.13 MiB
llama_init_from_model:        CPU compute buffer size =   544.01 MiB
llama_init_from_model: graph nodes  = 710
llama_init_from_model: graph splits = 1
time=2025-03-12T16:59:48.233Z level=INFO source=server.go:624 msg="llama runner started in 0.50 seconds"
[GIN] 2025/03/12 - 16:59:51 | 200 |  3.847565655s |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 16:59:56 | 200 |  7.355661487s |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 17:00:03 | 200 |    1.077693ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 17:00:05 | 400 |   10.554958ms |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 17:00:14 | 200 |    1.240289ms |  192.168.40.216 | GET      "/api/tags"
[GIN] 2025/03/12 - 17:00:22 | 200 |  5.820094733s |  192.168.40.216 | POST     "/api/chat"
[GIN] 2025/03/12 - 17:00:25 | 200 |  2.512705131s |  192.168.40.216 | POST     "/api/chat"