Have you ever wanted to ask your smart home whether your package has been delivered or the garbage has been collected yet? LLM Vision can now do just that!
v1.3 can remember events, so you can ask about them later. You need to set up ‘Event Calendar’ in the integration settings.
After you’ve completed the set up, you will see a calendar entity. This not only gives you a visual overview in the calendar view of your dashboard but it also allows for integration with other services.
So basically it works, but first I get an event with a picture when motion was detected. After a while I get an updated notification with an updated AI message about the scene. But the picture also gets updated. Which means the person could already be out of view from the camera, so I basically get an empty picture without a person in it.
Is there a way to keep the original picture from the moment it detects motion?
Glad you find this useful!
v1.3 should solve that (just got released, you should see an update in HACS soon). In the blueprint you can now choose between a live preview (what you have right now) and ‘snapshot’.
You will have to re-import the blueprint to update it. (This will keep your automations)
Thanks for sharing this!
I am working on a data_analyzer that will take a graph or other ‘visual data’ as input along with a sensor and update it’s value.
Would you mind sharing your workflow? That would be helpful to improve the action.
Thanks!
Sure! My setup is most probably more complicated than it needs to be. I just got all of it working and have not optimized the flow. Suggestions are welcome :). This is how it works:
I run a separate docker container with a python script on my Home Assistant machine. This container provide an HTTPS endpoint (using Flask) that generates a PNG for a specific Grafana dashboard / panel. The image is returned in the response. There is also another endpoint that will return the last generated image (reading from disk). Generating the PNG using the Grafana API can take quite some time depending on requested size. In my case, more than 10 seconds. Grafana API uses the Authorization Bearer header for the API key.
In Home Assistant, when I want to analyze a graph, I first use the RESTful Command (with a custom timeout, 30s) to trigger the PNG graph generation. This is because the Downloader integration will timeout after 10s and does not support setting a custom timeout.
Once the REST command returns, I use the Downloader to download the cached image provided by the python/docker container.
Now with an image on disk accessible by Home Assistant, I use llmvision.image_analyzer with the image_file: /config/downloader/grafana-fetch/{{ trigger.event.data.filename }} to generate a description of the graph.
Hi (while I work through my file access issues!) I just had a thought … is there a reason video clips can’t be submitted rather than images to the AI engines. At the moment you extract multiple frames but engines like Gemini accept video files directly?
I ask because while you can submit video directly into Gemini neither this nor the native Gemini integration into HA support videos natively
I don’t know if it would change the results / quality, I am just merely curious (it would definitely slow the response as the upload and processing would obviously be longer)
That’s a good question. There are two main reasons for this:
LLM Vision supports multiple providers and not all support video files directly (some like Groq don’t even support multiple images per request). Adding this feature would complicate the code for possibly very little benefit.
This would drive up the cost. I don’t know how exactly Gemini analyzes videos but I guess it looks at 1 frame every second (see table below). LLM Vision limits the number of frames through max_frames and only consideres ‘important’ frames based on how much movement is in a frame. This optimization means the analysis of one event costs roughly $0.00007 (3*image input + output). If we assume an average event detected by your cameras/Frigate is about 15 seconds that would cost $0.0003 which means it would be about four times as expensive.
Thanks for sharing your workflow!
I assume you have an automation to update a sensor with the latest value of the chart. What sensor type do you use? Did you create a helper in Home Assistant?
The data_analyzer I’m working on will only help with the last step as it unifies reading and updating the sensor value directly. It might still come in handy for your use case.
Very good point, especially if it’s per frame. Gemini is free for 15 requests a second (obviously with a top end limit). That would probably exceed my quota and/or delay processing a lot.
The actual (bed weight) sensor is posting its value to MQTT (at 10Hz), and I later have a sensor in home assistant for showing the current weight based on this sensor. The MQTT data is also “feed into” Grafana via InfluxDB. I only use the sensor value in HA to know when I’m in bed or no longer in bed, to know when to trigger the analysis and to know what range to fetch the Grafana graph between.
update a sensor with the latest value of the chart
Do you mean what I use the generated text produced for the graph I send for analysis? If so, I use the text (and the graph) to send on Telegram.
This is the automation upon download completed (yes, should add a condition to verify the download event is for what I expect it to be (checking URI):
- id: 'grafana_fetch_0x01'
alias: 'On Grafana Fetched: Notify on Telegram'
trigger:
- trigger: event
event_type: downloader_download_completed
action:
- action: telegram_bot.send_photo
data:
url: <URL to latest graph>
- action: llmvision.image_analyzer
metadata: {}
data:
max_tokens: 8192
model: gpt-4o-mini
include_filename: false
temperature: 1.0
provider: <provider_id>
message: You are a friendly companion bot to the owner of an apartment. The owner is named Johan. You are happy, witty and funny and you admire your owner. This graph represent the weight observed in a bed. The graph is exact and precise. Analyze the graph with respect to sleep quality, duration, movements and other interesting conclusions. Answer as if you told the person who just woke up about their sleep, and don't mention the graph or anythinng about weight. Assume a big sharp increase in weight, more than 50 kg, is when person is going to bed, and an equal sharp drop is when they exit the bed. Any change to the weight can be seen as the user was moving position, similar to how the weight changes if one moves while standing on a bathroom scale. End with a summary (with durations in hours and minutes) of time spent in bed, time spent laying still (summarizing time when laying stil), longest time laying still and a sleep score from 1 to 10.
image_file: /config/downloader/grafana-fetch/{{ trigger.event.data.filename }}
response_variable: response
- action: telegram_bot.send_message
data_template:
parse_mode: false
message: "{{ response.response_text|string }}"
Ah sorry I misunderstood your automation then!
I thought you’d update a sensor based on the analysis of a graph. Your automation still looks really cool though!
So long as you don’t mind your data being used for their improvement (yay privacy!) then it’s free, for UK at least!
Works quite well for me - Halloween was amusing: “A woman dressed as a zombie nurse is walking towards the camera. She is wearing a white nurse’s uniform covered in fake blood. She is also wearing a red cross hat.”
Much simpler would be if there was a data analyzer that you just gave a home assistant sensor entity, a start and end date for the range, potentially also a sample size or a way to downsample the data, and a prompt. A graph is not necessarily needed, raw values would be more precise, but that might also be too much data. For model tuning there is this JSONL thing, maybe can he used for prompting with data as well (have not looked into it).
Yeah ideally I’d want to use Ollama. The main problem isn’t even processing power. The open source vision models that I’ve tried so far are just very very far behind OpenAI, Gemini etc…
If you have raw data that is always more precise. You could use OpenAI Extended Conversation for that. It can read sensors and also interact with entities. LLM Vision is not really needed in that case though because no ‘vision’ is involved.
True, I do think sending all data for an night (to get the resolution I need to detect movements, 10Hz*8h=288k samples + timestamp per sample) will be too many tokens/too expensive. But might work well for other sensor types with less data.