LLM Vision: Let Home Assistant see!

I’m trying to configure the llm vision component with Anthropic. I’m getting an invalid key error during setup. The key starts with “sk-ant-api03…”. I’m on the evaluation plan. Do I have to purchase a plan to use this?

Just realized that my remaining balance is 0.00. But the key should still work or not?

If I remember correctly you need to add some funds. I think you can get $5 for free though if you confirm your phone number.

I added some funds and the key was accepted.
Thanks

1 Like

Thanks for this project. I’ve been having some fun with it.

I have an automation where at night when the alarm is armed it would wake me up if a person was detected on the driveway. However this was always a pain as rain or even reflections would trigger it even though I have blueiris setup nicely. Could never get it to work properly.

Now I have it setup so that if blueiris thinks there is a person, it will ping the llm to double check before waking me up to let me know. Works 100% more reliably.

3 Likes

I had the same idea. I just implemented this using Frigate and LLM Vision.

In case others have Ring cameras and use this, some notes on its use (having just worked through it all with @valentinfrlch)

TL;DR: to get Ring + LLM Vision working together use an action in an automation to save a short video clip in a local folder then point LLM Vision at it. I always use the same file name so it’s easy then to point the LLM Vision action at the file

First of all, you need to have Ring cameras and the Ring native HA plugin and Ring-MQTT HACS addon here installed. You need to have streaming video configured using Generic Camera(s) as per instructions here and that all needs to be working. If it’s not, I’m sure @valentinfrlch will agree that this thread isn’t the right place, ask in the dedicated thread for it here

You also need to have rights to save files in your home assistant file store somewhere. If you choose anything but the config folder you may need to permit writing to those folders. How this is done varies depending on how you run it but start here

Now to the point of the post.

I have got LLM Vision working with Ring cameras but the fact that they are not typical CCTV cameras and thus don’t go through a system such as Frigate presents some nuances to how compatible certain aspects of LLM Vision are with them. In short:

:x: Images
:white_check_mark: Video
:x: Streaming

The problems come from Ring not LLM Vision because of how Ring itself works. Detail below, but in summary (so far as I can tell) there is no way to get Ring to snapshot images on demand, which means that the snapshot image is useless for realtime analysis like this

Streaming can’t ever work because LLM Vision relies on taking several frames from the video using entity_image attribute of the camera, rather than processing a stream of video. But the Generic Camera entity doesn’t quite work like that when created to host the ring-mqtt rtsp stream. As per the instructions you set the live stream to point at the rtsp stream exposed by the ring-mqtt addon but the snapshot attribute is pointed at the offical ring addon snapshot attribute.

And hence images doesn’t work either, because the snapshot on ring is not related to motion, it’s configured in the ring app as an image taken every x minutes regardless of activity. The camera.snapshot action has no effect as ring doesn’t have a “take snapshot” function within it (and equally therefore entity_image that LLM Vision Streaming function requires has the same limitation)

Some other pointers:

  • You could call past recorded events not live stream if you want by changing the path you query from ring - see ring-mqtt docs
  • Be aware that it can take a few seconds for ring video to start to live stream from the point the stream is ‘viewed’ (including using actions like camera.record). Part of this is delay in connecting to ring live streaming, some is buffering within HA itself. You can tell streaming to start using an action, which might shave a bit off that time
  • Within Ring any HA live camera activity is shown as live viewing and is recorded like any other ring activity. This means it is subject to the same limitations as the Ring app in terms of duration.
  • Beware if you have a battery Ring camera (or you are precious about your bandwidth) as once you start a live stream it may not always reliably stop (the forums are full of examples, but it’s unclear why it happens). You can issue a stop live stream action in HA which helps prevent this happening.
  • If you want to include an image in any notification following LLM Vision being run, it’s not included in the response variable LLM Vision returns. Normally you would include a snapshot for anything else but Ring, however since Ring snapshots are periodic it’s not going to have any relevance to what was submitted for analysis by LLM Vision. As of writing I don’t know a workaround however I believe @valentinfrlch is potentially going to help Ring users as a by-product of including a debug function that can write the first submitted frame to a folder - which Ring users can use as a workaround for this problem as that frame can be our notification image we can’t otherwise get.
2 Likes

I’m using LLM Vision + gpt4o-mini to identify how my 3D printing is going, works great!

For another project, I want to send a history graph for a sensor value (weight of person in bed during sleep, indicating sleep “quality” and movements). Do anyone know if it possible to grab a graph/history graph for a sensor and feed it to llmvision.image_analyzer?
edit: I solved this using my existing Grafana setup, where I can easily download a graph as PNG: Generating PNG images from a Grafana chart – Correct URL, Settings and Authentication | j3t.ch | Julien Perrochet's Blog

//Johan

1 Like

v1.3 Event Memory

Have you ever wanted to ask your smart home whether your package has been delivered or the garbage has been collected yet? LLM Vision can now do just that!

v1.3 can remember events, so you can ask about them later. You need to set up ‘Event Calendar’ in the integration settings.
After you’ve completed the set up, you will see a calendar entity. This not only gives you a visual overview in the calendar view of your dashboard but it also allows for integration with other services.

To learn how to set this up, see this page in the docs:
https://llm-vision.gitbook.io/getting-started/asking-about-events

Awesome addon I must say. I’m using this blueprint for now to get going: AI Event Notifications. (https://llm-vision.gitbook.io/examples/examples/automations)

So basically it works, but first I get an event with a picture when motion was detected. After a while I get an updated notification with an updated AI message about the scene. But the picture also gets updated. Which means the person could already be out of view from the camera, so I basically get an empty picture without a person in it.

Is there a way to keep the original picture from the moment it detects motion?




Glad you find this useful!
v1.3 should solve that (just got released, you should see an update in HACS soon). In the blueprint you can now choose between a live preview (what you have right now) and ‘snapshot’.
You will have to re-import the blueprint to update it. (This will keep your automations)

Check the discussion of the blueprint here: Camera, Frigate: Intelligent AI-powered notifications - #53 by valentinfrlch

Thanks for sharing this!
I am working on a data_analyzer that will take a graph or other ‘visual data’ as input along with a sensor and update it’s value.
Would you mind sharing your workflow? That would be helpful to improve the action.
Thanks!

1 Like

I have the update. Great work!

One final issue. Is it possible to send the notification to multiple recipients? Now I can only choose my phone.

And another issue with the new blueprint. When I select snapshot, I get no picture. But the image is processed and I do get an AI response.

Sure! My setup is most probably more complicated than it needs to be. I just got all of it working and have not optimized the flow. Suggestions are welcome :). This is how it works:

  1. I run a separate docker container with a python script on my Home Assistant machine. This container provide an HTTPS endpoint (using Flask) that generates a PNG for a specific Grafana dashboard / panel. The image is returned in the response. There is also another endpoint that will return the last generated image (reading from disk). Generating the PNG using the Grafana API can take quite some time depending on requested size. In my case, more than 10 seconds. Grafana API uses the Authorization Bearer header for the API key.
  2. In Home Assistant, when I want to analyze a graph, I first use the RESTful Command (with a custom timeout, 30s) to trigger the PNG graph generation. This is because the Downloader integration will timeout after 10s and does not support setting a custom timeout.
  3. Once the REST command returns, I use the Downloader to download the cached image provided by the python/docker container.
  4. Now with an image on disk accessible by Home Assistant, I use llmvision.image_analyzer with the image_file: /config/downloader/grafana-fetch/{{ trigger.event.data.filename }} to generate a description of the graph.

//Johan

Hi (while I work through my file access issues!) I just had a thought … is there a reason video clips can’t be submitted rather than images to the AI engines. At the moment you extract multiple frames but engines like Gemini accept video files directly?

I ask because while you can submit video directly into Gemini neither this nor the native Gemini integration into HA support videos natively

I don’t know if it would change the results / quality, I am just merely curious (it would definitely slow the response as the upload and processing would obviously be longer)

Chris

That’s a good question. There are two main reasons for this:

  1. LLM Vision supports multiple providers and not all support video files directly (some like Groq don’t even support multiple images per request). Adding this feature would complicate the code for possibly very little benefit.

  2. This would drive up the cost. I don’t know how exactly Gemini analyzes videos but I guess it looks at 1 frame every second (see table below). LLM Vision limits the number of frames through max_frames and only consideres ‘important’ frames based on how much movement is in a frame. This optimization means the analysis of one event costs roughly $0.00007 (3*image input + output). If we assume an average event detected by your cameras/Frigate is about 15 seconds that would cost $0.0003 which means it would be about four times as expensive.

When I tested 1 frame/second vs. only 3 ‘important’ frames per event I found no meaningful difference in the output.

Thanks for sharing your workflow!
I assume you have an automation to update a sensor with the latest value of the chart. What sensor type do you use? Did you create a helper in Home Assistant?

The data_analyzer I’m working on will only help with the last step as it unifies reading and updating the sensor value directly. It might still come in handy for your use case.

Very good point, especially if it’s per frame. Gemini is free for 15 requests a second (obviously with a top end limit). That would probably exceed my quota and/or delay processing a lot.

The actual (bed weight) sensor is posting its value to MQTT (at 10Hz), and I later have a sensor in home assistant for showing the current weight based on this sensor. The MQTT data is also “feed into” Grafana via InfluxDB. I only use the sensor value in HA to know when I’m in bed or no longer in bed, to know when to trigger the analysis and to know what range to fetch the Grafana graph between.

update a sensor with the latest value of the chart

Do you mean what I use the generated text produced for the graph I send for analysis? If so, I use the text (and the graph) to send on Telegram.

This is the automation upon download completed (yes, should add a condition to verify the download event is for what I expect it to be (checking URI):

- id: 'grafana_fetch_0x01'
  alias: 'On Grafana Fetched: Notify on Telegram'
  trigger:
  - trigger: event
    event_type: downloader_download_completed
  action:
  - action: telegram_bot.send_photo
    data:
     url: <URL to latest graph>
  - action: llmvision.image_analyzer
    metadata: {}
    data:
      max_tokens: 8192
      model: gpt-4o-mini
      include_filename: false
      temperature: 1.0
      provider: <provider_id>
      message: You are a friendly companion bot to the owner of an apartment. The owner is named Johan. You are happy, witty and funny and you admire your owner. This graph represent the weight observed in a bed. The graph is exact and precise. Analyze the graph with respect to sleep quality, duration, movements and other interesting conclusions. Answer as if you told the person who just woke up about their sleep, and don't mention the graph or anythinng about weight. Assume a big sharp increase in weight, more than 50 kg, is when person is going to bed, and an equal sharp drop is when they exit the bed. Any change to the weight can be seen as the user was moving position, similar to how the weight changes if one moves while standing on a bathroom scale. End with a summary (with durations in hours and minutes) of time spent in bed, time spent laying still (summarizing time when laying stil), longest time laying still and a sleep score from 1 to 10.
      image_file: /config/downloader/grafana-fetch/{{ trigger.event.data.filename }}
    response_variable: response
  - action: telegram_bot.send_message
    data_template:
      parse_mode: false
      message: "{{ response.response_text|string }}"

Last time I checked Gemini’s Free Tier wasn’t available in Europe, though it seems to me that has changed. I guess I could switch to Gemini then…