My Journey to a reliable and enjoyable locally hosted voice assistant

I have been watching HomeAssistant’s progress with assist for some time. We previously used Google Home via Nest Minis, and have switched to using fully local assist backed by local first + llama.cpp (previously Ollama). In this post I will share the steps I took to get to where I am today, the decisions I made and why they were the best for my use case specifically.

Links to Additional Improvements

Here are links to additional improvements posted about in this thread.

New Features

Fixing Unwanted HA / LLM Behaviors

Optimizing Performance

Hardware Details

I have tested a wide variety of hardware from a 3050 to a 3090, most modern discrete GPUs can be used for local assist effectively, it just depends on your expectations of capability and speed for what hardware is required.

I am running HomeAssistant on my UnRaid NAS, specs are not really important as it has nothing to do with HA Voice.

Voice Hardware:

  • 1 HA Voice Preview Satellite
  • 2 Satellite1 Small Squircle Enclosures
  • 1 Pixel 7a used as a satellite/ hub with View Assist

Voice Server Hardware:

  • Beelink MiniPC with USB4 (the exact model isn’t important as long as it has USB4)
  • USB4 eGPU enclosure

GPUs

The below table shows GPUs that I have tested with this setup. Response time will vary based on the model that is used.

GPU Model Class Response Time (after prompt caching) Notes
RTX 3090 24GB 20B-30B MoE, 9B Dense 1 - 2 seconds Efficiently and quickly runs models that are optimal for this setup.
RX 7900XTX 24GB 20B-30B MoE, 9B Dense 1 - 2 seconds Efficiently and quickly runs models that are optimal for this setup.
RTX 5060Ti 16GB 20B MoE, 9B Dense 1.5 - 3 seconds Quick enough to run models that are optimal for this setup with responses < 3 seconds.
RX 9060XT 16GB 20B MoE, 9B Dense 1.5 - 4 seconds Quick enough to run models that are optimal for this setup with responses < 4 seconds.
RTX 3050 8GB 4B Dense 3 seconds Good for running small models with basic functionality.

Models

The below table shows the models I have tested using this setup with various features and their performance.

All models below are good for basic tool calling. Advanced features are listed with the models quality at reliably reproducing the desired behavior.

Model Multi device tool calls (1) Understands context cues (2) Parses misheard commands (3) Ignores unexpected text from false positives (4)
GGML Gemma4 26B-A4B Q4_K_M (thinking off) :green_circle: :green_circle: :green_circle: :green_circle:
GGML GPT-OSS:20B MXFP4 (med reasoning effort) :green_circle: :green_circle: :green_circle: :green_circle:
Unsloth GLM 4.7 Flash (30B) Q4_K_XL (reasoning enabled) :green_circle: :green_circle: :yellow_circle: :yellow_circle:
Unsloth Qwen3-VL:8B-Instruct Q6_K_XL :green_circle: :green_circle: :yellow_circle: :yellow_circle:
Unsloth Qwen3-30B-A3B-Instruct Q4_K_XL :green_circle: :yellow_circle: :red_circle: :yellow_circle:
Unsloth Qwen3.5-35B-A3B MXFP4_MOE :yellow_circle: :yellow_circle: :yellow_circle: :green_circle:
Unsloth GLM 4.7 Flash (30B) Q4_K_XL (reasoning disabled) :yellow_circle: :green_circle: :red_circle: :yellow_circle:
Unsloth Qwen3:4b-Instruct 2507 Q6_K_XL :yellow_circle: :red_circle: :red_circle: :red_circle:

(1) Handles commands like “Turn on the fan and off the lights”
(2) Understands when it is in a particular area and does not ask “which light?” when there is only one light in the area, but does correctly ask when there are multiple of the device type in the given area.
(3) Is able to parse misheard commands (ex: “turn on the pan”) and reliably execute the intended command
(4) Is able to reliably ignore unwanted input without being negatively affected by misheard text that was an intended command.

Voice Server Software:

Model Runner:

llama.cpp is recommended for optimal performance, see my reply below for details.

Speech to Text (Voice In):

The following are Speech to Text options that I have tested:

Software Model Notes
Wyoming ONNX ASR Nvidia Parakeet V3 Tested to be more accurate than V2 with the same speed.
Wyoming ONNX ASR Nvidia Parakeet V2 Accurate and fast model which can run on CPU via OpenVINO or on Nvidia GPU
Rhasspy Faster Whisper Nvidia Parakeet V2 Slower due to running directly via ONNX CPU which is slower than OpenVINO

Text to Speech (Voice Out):

Software Notes
Kokoro TTS Provides ability to mix and match multiple voices / tones to get desired output. Handles all text well.
Piper running on CPU (TTS) Has multiple voices which can be picked from, works for general text but struggles with currency, phone numbers, and addresses.

Home Assistant LLM Integrations

  • LLM Conversation Provides improvements to the base conversation to improve default experience talking with Assist
  • LLM Intents to provide additional tools for Assist (Web Search, Place Search, Weather Forecast)

The Journey

My point in posting this is not to suggest that what I have done is “the right way” or even something others should replicate. But I learned a lot throughout this process and I figured it would be worth sharing so others could get a better idea of what to expect, pitfalls, etc.

The Problem

Throughout the last year or two we have noticed that Google Assistant through these Nest Minis has gotten progressively dumber / worse while also not bringing any new features. This is generally fine as the WAF was still much higher than not having voice, but it became increasingly annoying as we were met with more and more “Sorry, I can’t help with that” or “I don’t know the answer to that, but according to XYZ source here is the answer”. It generally worked, but not reliably and was often a fuss to get answers to arbitrary questions.

Then there is the usual privacy concern of having online microphones throughout your home, and the annoyance that every time AWS or something else went down you couldn’t use voice to control lights in the house.

Starting Out

I started by playing with one of Ollama’s included models. Every few weeks I would connect Ollama to HA, spin up assist and try to use it. Every time I was disappointed and surprised by its lack of abilities and most of the time basic tool calls would not work. I do believe HA has made things better, but I think the biggest issue was my understanding.

Ollama models that you see on Ollama are not even close to exhaustive in terms of the models that can be run. And worse yet, the default :4b models for example are often low quantization (Q4_K) which can cause a lot of problems. Once I learned about the ability to use HuggingFace to find GGUF models with higher quantizations, assist was immediately performing much better with no problems with tool calling.

Testing with Voice

After getting to the point where the fundamental basics were possible, I ordered a Voice Preview Edition to use for testing so I could get a better idea of the end-to-end experience. It took me some time to get things working well, originally I had WiFi reception issues where the ping was very inconsistent on the VPE (despite being next to the router) and this led to the speech output being stuttery and having a lot of mid-word pauses. After adjusting piper to use streaming and creating a new dedicated IoT network, the performance has been much better.

Making Assist Useful

Controlling device is great, and Ollama’s ability to adjust devices when the local processing missed a command was helpful. But to replace our speakers, Assist had to be capable of the following things:

  • Ability to give Day and Week Weather Forecasts
  • Ability to ask about a specific business to get opening / closing times
  • Ability to do general knowledge lookup to answer arbitrary questions
  • Ability to play music with search abilities entirely with voice

At first I was under the impression these would have to be built out separately, but I eventually found the brilliant llm-intents integration which provides a number of these services to Assist (and by extension, Ollama). Once setting these up, the results were mediocre.

The Importance of Your LLM Prompt

For those that want to see it, here is my prompt.

This is when I learned that the prompt will make or break your voice experience. The default HA prompt won’t get you very far, as LLMs need a lot of guidance to know what to do and when.

I generally improved my prompt by taking my current prompt and putting it into ChatGPT along with a description of the current behavior and desired behavior of the LLM. Then back-and-forth attempts until I consistently got the desired result. After a few cycles of this, I started to get a feel of how to make these improvements myself.

I started by trying to get weather working, the first challenge was getting the LLM to even call the weather service. I have found that having dedicated # sections for each service that is important along with a bulleted list of details / instructions works best.

Then I needed to make the weather response formatted in a way that was desirable without extra information. At first, the response would include extra commentary such as “sounds like a nice summery day!” or other things that detracted from the conciseness of the response. Once this was solved, a specific example of the output worked best to get the exact response format that was desired.

For places and search, the problem was much the same, it did not want to call the tool and instead insisted that it did not know the user’s location or the answer to specific questions. This mostly just needed some specific instructions to always call the specific tool when certain types of questions were asked, and that has worked well.

The final problem I had to solve was emojis, most responses would end with a smiley face or something, which is not good to TTS. This took a lot of sections in the prompt, but overall has completely removed it without adverse affects.

Solving Some Problems Manually

NOTE: Not sure if a recent Home Assistant or Music Assistant update improved things, but the LLM is now able to naturally search and play music without the automation. I am leaving this section in as an example, as I still believe automations can be a good way to solve some problems when there is not an easy way to give the LLM access to a certain feature.

It is certainly the most desirable outcome that every function would be executed perfectly by the LLM without intervention, but at least in my case with the model I am using that is not true. But there are cases where that really is not a bad thing.

In my case, music was one of this case. I believe this is an area that improvements are currently be made, but for me the automatic case was not working well. I started by getting music assistant setup. I found various LLM blueprints to create a script that allows the LLM to start playing music automatically, but it did not work well for me.

That is when I realized the power of the sentence automation trigger and the beauty of music assistant. I create an automation that triggers on Play {music}. The automation has a map of assist_satellite to media_player in the automation, so it will play music on the correct media player based on which satellite makes the request. Then it passes {music} (which can be a song, album, artist, whatever) to music assistant’s play service which performs the searching and starts playing.

Example Automation
alias: Music Shortcut
description: ""
triggers:
  - trigger: conversation
    command:
      - Play {music}
    id: play
  - trigger: conversation
    command: Stop playing
    id: stop
conditions: []
actions:
  - choose:
      - conditions:
          - condition: trigger
            id:
              - play
        sequence:
          - action: music_assistant.play_media
            metadata: {}
            data:
              media_id: "{{ trigger.slots.music }}"
            target:
              entity_id: "{{ target_player }}"
          - set_conversation_response: Playing {{ trigger.slots.music }}
      - conditions:
          - condition: trigger
            id:
              - stop
        sequence:
          - action: media_player.media_stop
            metadata: {}
            data: {}
            target:
              entity_id: "{{ target_player }}"
          - set_conversation_response: Stopped playing music.
variables:
  satellite_player_map: |
    {{
      {
        "assist_satellite.home_assistant_voice_xyz123": "media_player.my_desired_speaker",
      }
    }}
  target_player: |
    {{
      satellite_player_map.get(trigger.satellite_id, "media_player.default_speaker")
    }}
mode: single

Training a Custom Wakeword

The next problem to solve was the wakeword. For WAF the default included options weren’t going to work. After some back and forth we decided on Hey Robot. I use this repo to train a custom microwakeword which is usable on the VPE and Satellite1. This only took ~30 minutes to run on my GPU and the results have been quite good. There are some false positives, but overall the rate is similar to the Google Homes that have been replaced and with the ability to automate muting it is possible we can solve that problem with that until the training / options become better.

The End Result

I definitely would not recommend this for the average Home Assistant user, IMO a lot of patience and research is needed to understand particular problems and work towards a solution, and I imagine we will run into more problems as we continue to use these. I am certainly not done, but that is the beauty of this solution - most aspects of it can be tuned.

The goal has been met though, overall we have a more enjoyable voice assistant that runs locally without privacy concerns, and our core tasks are handled reliably.

Let me know what you think! I am happy to answer any questions.

46 Likes

I’m still playing with voice assistant, but did some stuff to get music working using media players and voice assistants linked to the same areas and music assistant.

This is an example of an sentence automation I use to play a music assistant playlist:-

alias: Voice - Play music by playlist genre
description: ""
triggers:
  - trigger: conversation
    command:
      - >-
        (play | put on) [(some | a)] [{genre}] (music | tunes | tracks |
        playlist) [in] [the] [{def_area}]
conditions: []
actions:
  - action: script.get_ma_playlist_id_from_name
    data:
      playlistname: >-
        {{ trigger.slots.genre  | replace('jacking', 'jackin') | replace('old
        school', 'oldskool') | replace('tim liquor','tinlicker') | replace('tim
        licker','tinlicker') | replace('tin licker','tinlicker') | replace('tin
        liquor','tinlicker') | replace('anjunadeep','anjuna_deep') | replace('
        ', '_') | lower }}
    response_variable: playlist_info
  - action: media_player.shuffle_set
    metadata: {}
    data:
      shuffle: true
    target:
      area_id: >
        {% if (trigger.slots.def_area | length >0)%} {{trigger.slots.def_area}}
        {%else%} {{area_id(trigger.device_id) }} {%endif%}
  - set_conversation_response: I've put on some {{ trigger.slots.genre }} music.
    enabled: true
  - action: music_assistant.play_media
    metadata: {}
    data:
      media_id: "{{ playlist_info.uri }}"
      enqueue: replace
    target:
      area_id: >
        {% if (trigger.slots.def_area | length >0)%} {{trigger.slots.def_area}}
        {%else%} {{area_id(trigger.device_id) }} {%endif%}
    enabled: true
mode: single

And the matching script:-

alias: Get MA playlist ID from name
description: ""
mode: single
variables:
  playlistname: "{{ playlistname }}"
sequence:
  - action: music_assistant.get_library
    data:
      limit: 10
      search: "{{ playlistname }}"
      media_type: playlist
      config_entry_id: 01JPFPPNTCVYQAA9JSBY4319HS
    response_variable: ma_playlist
  - repeat:
      count: "{{ ma_playlist['items'] | length }}"
      sequence:
        - variables:
            playlistinfo:
              name: "{{ ma_playlist['items'][repeat.index -1].name }}"
              uri: "{{ ma_playlist['items'][repeat.index -1].uri }}"
        - if:
            - condition: template
              value_template: >-
                {{ ma_playlist['items'][repeat.index -1].name | lower ==
                playlistname | lower }}
          then:
            - stop: Returning playlist info as a dictionary.
              response_variable: playlistinfo

6 Likes

How many entity’s are you exposing to Qwen 4B?

I’m using Qwen 14B non thinking and exposing just 53 entities makes it behave very unreliable.

Sometimes it appears to ignore or forget entities, sometimes features like brightness or volume are not set by the model.

You are describing context overrun. Your entity description plus tool description plus full prompt cannot exceed the context window set by your model. (default for qwen I think is 8K)look in ollama you will see it telling you how much it overran and adjust.

You can adjust the number of exposed ens, exposed tools shrink your prompt, or if you have enough vram and your model supports it, crank the context window of the model up.(or all of the above)

Sounds like you’re in 4 or 8k land and that would be expected at around 50 something depending on the length of your names etc.

2 Likes

Right now I have 32. On top of what Nathan suggested, depending on which entities you have, maybe consider if all of those devices will be addressed individually. You can create many different types of groups in HA which would only be one entity to pass in.

2 Likes

Thanks for the hint. Infact I’m also using qwen3 4B instruct with its base 8k context. Since I’m using an A2000 ADA with 16Gb VRAM I now doubled the context. Results are better but not perfect. For example „turn on the light in living room“ sometimes turns on a light in another room, or also a fan or socket.

I would love to use an 7-8B model of Qwen instruct. Do you know of any available?

By the way, your post helped me a lot, please keep on updating if you make further progress. Tank you!

Good suggestion, I started to group all the lights for reduced entities

You can absolutely fit gpt-oss:20b in a 16g card. It’s my mainline local inference and tbh is WAY more capable than qwen. You still have to manage the context but… Gibennthesamd context size, I’ve been more successful there. In Fridays party (no you don’t need the whole thing) I’m talking about building context how, what is needed and why. If you’re fitting in context, and it still misbehaves, then you have a grounding problem.

Welcome to the see saw. Too much context - not done right, too little context - not done.at all :sunglasses:

1 Like

Are you using the base qwen from Ollama? This are typically quite heavily quantized which is why I recommend picking from hugging face and getting a better model.

1 Like

Okay this is getting better and better. I tryed loading the ollama version of gpt-oss:20B into my 16G card but it did not fit. Any tips how i can make this work?

Also: I am looking for a way so voice assist can memorize things, like preferenes or own findings. Is there any way to achive this?

Thank you agian very much :slight_smile:

initialy i was using the “latest” quant of huggingface, i think that is Q4_K_S. Right now I am running Q8_0 - not sure if that’s optimal. Any recommendations?

1 Like

I’ll look at your card specifically on OSS 20 b but it was absolutely designed to fit ina 16G card… We should be able to figure it out. Whatever model you do end up in push that context sincoyas big as you can without overrunning…

keep trying models. You want long context, reasoning tool use models.

Also everything’s you just asked about is in the Friday thread… Sorry I’m 220 posts deep now but it’s in there mem needs some specific considerations and input the caveats there too.

2 Likes

Had an interesting issue I ran into. I still prefer to have use local first enabled as it is a tad bit faster, and the “chime” is more pleasant than “Turned on the light” response. However, I was noticing some weird behavior when using What is the weather? where the answer was nonsensical, but asking What is the weather today? correctly used the llm_intents script.

Now that Home Assistant 2025.12 shows you the tools / intents that are called and their responses, I was able to get more insight here. It turns out that Home Assistant has a weather intent HassGetWeather which was being called locally, but I didn’t have any weather entities exposed to assist so it was effectively trying to run that and then falling back to the LLM and the LLM was apparently just making up values based on the sensors it had access to.

For now I just overwrote the local intent by creating an automation that triggered on the sentence What is the weather and re-implemented the logic, using the AI Task service to summarize the information. This is a workaround, I would really love it if Home Assistant exposed all of the intents that are available as well as a way to disable which ones you want to immediately hand off to the LLM.

Example Automation
alias: Override HassGetWeather
description: ""
triggers:
  - trigger: conversation
    command:
      - What is the weather
      - What's the weather
      - How is the weather
conditions: []
actions:
  - action: weather.get_forecasts
    metadata: {}
    target:
      entity_id: weather.forecast
    data:
      type: hourly
    response_variable: hourly_forecast
  - variables:
      items: "{{ 24 - now().hour }}"
      formatted_forecast: >
        {% set forecasts = hourly_forecast["weather.forecast"]["forecast"] %} {%
        for item in forecasts[:items] %} - Time: {{ as_timestamp(item.datetime)
        | timestamp_custom('%-I%p', true) | lower }}-{{
        (as_timestamp(item.datetime) + 3600) | timestamp_custom('%-I%p', true) |
        lower }}
          Temperature: {{ item.temperature | int }}
          General Condition: {{ item.condition }}
          Precipitation: {% if item.precipitation_probability < 20 %}unlikely{% elif item.precipitation_probability < 50 %}possible{% else %}likely{% endif %}
        {% endfor %}
  - action: ai_task.generate_data
    metadata: {}
    data:
      task_name: Summarize Weather
      instructions: >-
        You are a weather forecaster. Below is an hourly weather forecast, and
        your task is to summarize this information in one sentence. 

        Summarize the forecast below in one to two sentences:

        {{ formatted_forecast }}
    response_variable: summary
  - set_conversation_response: "{{ summary.data }}"
mode: single
3 Likes

Had some family visit for the holidays and that exposed some issues with the current setup. The main problem being wake word activation, I found an improved OpenWakeWord training script for the ViewAssist device which helped.

However, the bigger problem was that anytime there was a false activation the LLM would always end the response with a question, which created a loop. I had originally used a silence prompt to respond with " " but that seemed to cause issues where the speaker would make a static noise, and for some reason it seems less willing to say something like that vs a true word / phrase.

We also noticed that when we were trying to activate with a command, if it heard you wrong the response was way too wordy, it often gave examples device names or areas which is entirely unnecessary. I adjusted my prompt for unclear request handling, and this has dramatically improved things.

Handling Unclear Requests prompt section
# Handling Unclear Requests

When you receive input, FIRST determine if it is a request directed at you. Follow this decision hierarchy:

## Identify Questions First (Highest Priority)

- If the input contains any question - including question marks, interrogative phrasing ("should I", "am I", "what", "how", "why", "can I", "do you think", etc.), or rhetorical questions - treat it as a request for information and ANSWER IT. Questions are inherently directed at you, regardless of how casual, conversational, or rhetorical they sound.
- Do not treat questions as "conversation not directed at you" even if they don't explicitly address you by name or sound like internal monologue.
- Questions seeking advice, opinions, information, or reassurance should always be answered.

## When to Remain Silent

- If the input is a complete, coherent STATEMENT (not a question) that appears to be part of a conversation not directed at you (someone talking to someone else, a statement that doesn't address you and doesn't seek information): respond "Sorry." and do not ask follow-up questions.
- If the input is clearly not a request or question meant for you (conversation fragments, background noise interpreted as text): respond "Sorry." and do not ask follow-up questions.

## When to Ask for Repetition

- If the input seems garbled, nonsensical, or like you may have misheard, but appears to be an attempt to ask you a question or make a request: respond "Can you repeat that?"
- If the input is incomplete or unclear but seems like it could be a question or request directed at you: respond "Can you repeat that?"
- After asking "Can you repeat that?" once, if the user responds "No" or declines, do not ask again. Simply acknowledge with "Okay" or remain silent.

## When to Ask for Specific Clarification

- If you understand the user wants to do something but don't know which device, room, or area: ask a short, specific follow-up question. For example: "Which room?" or "Which device?" or "What would you like to control?"
- ABSOLUTELY NEVER provide examples, list options, or say "for example" when asking for clarification. Ask only the question itself, such as "Which fan?" or "Which room?" Do not add any additional text after the question.

## General Rules for All Clarification Responses

- Never give long explanations about not understanding. Keep all confusion responses to one short sentence ending with a question mark.
- When the user provides a clear request after you asked for clarification, you MUST use the appropriate tools (weather tool, search tool, device controls, etc.) to fulfill that request. Do not provide answers based on conversation context alone — always use the required tools.
- If the user responds "No", "Nevermind", or declines to provide clarification after you asked for it, simply acknowledge with "Okay" or "Understood" and wait for their next request. Do not ask follow-up questions or offer additional help unless the user makes a new request.
4 Likes

I have created a script which leverages Frigate and its Home Assistant integration to get information about what is happening on cameras outside.

This sends the current camera image to an AI task (must use a vision capable model) along with information from Frigate on the count and activity of object types.

This enables asking Home Assistant questions like “Who is at my door?” or “I just heard a noise in the backyard, do you see anything?”

Note the question time will be longer as it has to run the vision analysis as well.

Camera Analysis Script
sequence:
  - variables:
      camera_snake_case: "{{ camera | lower | replace(' ', '_') }}"
      primary_objects: ['person', 'bear']
      secondary_objects: ['dog', 'cat', 'raccoon', 'squirrel', 'car', 'bicycle', 'rabbit']
      sensor_info_text: |
        # Information from AI NVR

        # Primary Objects

        {% for obj in primary_objects %}
          {% set sensor_id = 'sensor.' ~ camera_snake_case ~ '_' ~ obj ~ '_count' %}
          {% set sensor_state = states(sensor_id) %}
          {% if sensor_state is not none and sensor_state != 'unknown' and sensor_state != 'unavailable' %}
          {% set active_sensor_id = 'sensor.' ~ camera_snake_case ~ '_' ~ obj ~ '_active_count' %}
          {% set active_state = states(active_sensor_id) | default(0) %}
          - Count of {{ obj }}s: {{ sensor_state }} ({{ active_state }} of which are active).
          {% endif %}
        {% endfor %}

        {% set last_face = states('sensor.' ~ camera_snake_case ~ '_last_recognized_face') | default('') %}
        {% if last_face and last_face != 'unknown' and last_face != 'None' and last_face != '' %}
        - Name of recognized person: {{ last_face }}.
        {% endif %}

        # Secondary Objects

        {% for obj in secondary_objects %}
          {% set sensor_id = 'sensor.' ~ camera_snake_case ~ '_' ~ obj ~ '_count' %}
          {% set sensor_state = states(sensor_id) %}
          {% if sensor_state is not none and sensor_state != 'unknown' and sensor_state != 'unavailable' %}
          {% set active_sensor_id = 'sensor.' ~ camera_snake_case ~ '_' ~ obj ~ '_active_count' %}
          {% set active_state = states(active_sensor_id) | default(0) %}
          - Count of {{ obj }}s: {{ sensor_state }} ({{ active_state }} of which are active).
          {% endif %}
        {% endfor %}
      instructions_text: >
        {{ sensor_info_text }}

        # How to provide analysis

        ## General Guidelines

        The AI NVR sensor data above is authoritative and indicates the actual
        presence of objects in the camera view. Use these sensor counts as the
        definitive source of information about what is present.

        ## What to Report

        Report ONLY object types that have an active count greater than zero. Do
        not describe object types with zero active counts, even if the total count
        is greater than zero. Focus exclusively on actively moving or present
        objects.

        ## Response Format

        For each object type with active count greater than zero, provide a concise
        summary that includes:
        - What the object(s) is/are
        - Location in the frame (e.g., foreground, background, left side, center)
        - Activity or movement being engaged in
        - Any relevant identifying details (only if significant)

        Keep each object type description to 1-3 sentences maximum. Be concise and
        factual. Do not describe stationary objects, non-active objects, or provide
        exhaustive lists of every object visible.

        ## What to Exclude

        - Do not describe object types with zero active counts
        - Do not describe stationary or parked objects that are not active
        - Do not provide detailed lists of every object visible
        - Do not describe general scene elements or environmental details
        - Do not use headers, markdown formatting, or structured lists in the response

        ## When No Active Objects

        If all object types have zero active counts, simply state that no active
        objects are present in the frame.
  - action: ai_task.generate_data
    metadata: {}
    data:
      task_name: Camera Frame Analysis
      instructions: "{{ instructions_text | trim }}"
      attachments:
        media_content_id: media-source://camera/camera.{{ camera_snake_case }}
        media_content_type: application/vnd.apple.mpegurl
        metadata:
          title: Back Deck Cam
          thumbnail: /api/camera_proxy/camera.{{ camera_snake_case }}
          media_class: video
          navigateIds:
            - {}
            - media_content_type: app
              media_content_id: media-source://camera
      entity_id: ai_task.ollama_ai_task
    response_variable: analysis
  - variables:
      response:
        instructions: >
          # Camera Analysis Response Guidelines

          You have received camera analysis data from the vision model. Provide
          a concise, natural response to the user's question about the camera
          view.

          ## Response Format

          - Summarize the analysis in a conversational, natural way suitable for
          text-to-speech
          - Focus on answering the user's specific question (e.g., "who is at the
          door", "what's in the backyard", "is anyone outside")
          - Keep responses brief and to the point - typically 1-3 sentences
          - Only mention active objects and their relevant details
          - If no active objects are present, state that clearly
          - Do not repeat technical details or sensor counts unless directly
          relevant to the user's question
          - Use natural language - avoid repeating the analysis verbatim

          ## Example Response Style

          If analysis shows: "One person is visible in the foreground, standing
          near the front door and appears to be waiting."

          Good response: "There's one person at the front door waiting."

          Bad response: "Based on the camera analysis, there is one person
          visible in the foreground, standing near the front door and appears to
          be waiting."
        output: "{{ analysis.data }}"
  - stop: Returning activity on camera
    response_variable: response
fields:
  camera:
    selector:
      select:
        options:
          - Back Deck Cam
          - Back Gate Cam
          - Corner Cam
          - Front Cam
          - Front Door Cam
          - Side Cam
    required: true
alias: Camera Analysis
description: >-
  Analyzes camera feeds to identify active objects, people, and activity. Use
  this tool when users ask about what is happening outside, who is at the door,
  what is in the backyard, or any questions about activity visible on security
  cameras. Provides information about people, animals, vehicles, and other
  objects detected in the camera view.
icon: mdi:camera-metering-matrix

1 Like

Thats what I was looking for! Many thanks setting this up. This saves my time to do it on my own! Great!

1 Like

What “weather.forecast” provider / integration you are using.
I have “weather.home”. By using this, the automation breaks at my side at
“UndefinedError: ‘dict object’ has no attribute ‘precipitation_probability’”

I use PirateWeather, that is interesting though that the format is different for some weather providers.

1 Like

Different providers provide different things. Most have temp and rainfall but the things like windspeed, UV, etc will be provider by provider.

Your tool should account for missing data and inform the llm what to do in case data isn’t available.

Yeah, I believe this is probably an issue within GitHub - skye-harris/llm_intents: Exposes internet search tools for use by LLM-backed Assist in Home Assistant

I will make an issue for it