LLM Vision: Let Home Assistant see!

Robert_Shed · August 15, 2024, 11:15pm

question, my first time writing a script for it. I wrote this script and got an error. I want it to describe who is at my front door with an image.

alias: Doorbell Image With Echo
sequence:
  - action: llmvision.image_analyzer
    metadata: {}
    data:
      provider: OpenAI
      model: gpt-4o-mini
      target_width: 1280
      detail: high
      max_tokens: 100
      temperature: 0.5
      image_entity:
        - camera.doorbell
        - camera.garage
  - action: notify.alexa_media_animeking_echo_dot
    metadata: {}
    data:
      cache: true
      target: tts.piper
      message: "{{response.response_text}}"
  - action: notify.mobile_app_luffy1987
    metadata: {}
    data:
      message: "{{response.response_text}}"
description: ""
icon: phu:eufy-doorbell
fields:
  prompt:
    selector:
      text: null
    name: prompt
    required: true

Trace Logs

valentinfrlch · August 16, 2024, 5:59am

For scripts you need to provide a response variable so that Home Assistant knows where to store the response and so you can use it in subsequent actions.

You need to add this line to your service call:

response_variable: response

The response is then stored in response.response_text

Check out the examples page which already has scripts you can use as inspiration:

Robert_Shed · August 16, 2024, 6:14am

I have updated the script:

alias: Doorbell Image With Echo
sequence:
  - action: llmvision.image_analyzer
    metadata: {}
    data:
      provider: OpenAI
      model: gpt-4o-mini
      target_width: 1280
      detail: high
      max_tokens: 100
      temperature: 0.5
      image_entity:
        - camera.doorbell
        - camera.garage
    response_variable: response
  - if:
      - condition: state
        entity_id: input_boolean.gaming_pc
        state: "on"
    then:
      - action: notify.gamingking
        metadata: {}
        data:
          message: "{{response.response_text}}"
  - action: notify.alexa_media_animeking_echo_dot
    metadata: {}
    data:
      cache: true
      target: tts.piper
      message: "{{response.response_text}}"
  - action: notify.mobile_app_luffy1987
    metadata: {}
    data:
      message: "{{response.response_text}}"
description: ""
icon: phu:eufy-doorbell
fields:
  prompt:
    selector:
      text: null
    name: prompt
    required: true

Herian · August 29, 2024, 7:36pm

Is there a way to analyze a pdf file like the chatgpt frontend do? if i convert my pdf in jpg, the LLM can’t read it propely

valentinfrlch · August 30, 2024, 8:06am

LLM Vision was created to analyze images - mostly in the context of smart homes. Even if it could read PDFs, there are already tools out there that do a far better job at this and have cruical features (like citing etc.). I recommend taking a look at tools like private-gpt.

Herian · August 30, 2024, 10:15am

Thank you, but i thibk i’m not skilled enough lol.
As far as you know, there’s some ha integration that do this? I have found only some pdf parser

valentinfrlch · August 31, 2024, 7:43am

Well Home Assistant is a smart home platform which has very little to do with PDFs. I would really suggest you try out some other tools. Most run as a docker container which are really quite simple to set up, if you follow the instructions. Docker also has a GUI (Docker Desktop) which might be less intimidating than using the terminal directly.

zuzzy · September 23, 2024, 3:00pm

This looks awesome. I can see how I can take one image from my ring vis the camera entity but I was wondering if there is a way to extract a few images a second apart from that sensor, or even better a brief video segment from the rtsp stream provided through ring-mqtt).

In my head there are two ways, assuming it’s not already a feature to be able to do it, either create a short segment using one script (no idea how!) that writes to a temp file that I reference in an LLM-Vision call, or I can pipe the rtsp into LLM-Vision somehow.

The challenge I have is that the longer after the event I do the call the easier it gets as I can probably use recorded events, or mimic the frigate integration somehow to allow a few static images over a period of time, stage them and make the call. However what I want is to take 1-2s of video and get that sent off so when I get a response the person who walked up to my door is actually still there and hasn’t left several minutes before, which makes the exercise pointless (there is a DHL courier looking bored and waiting at the door for you is very unhelpful five minutes later)

Any ideas? I can’t see any examples that match my use case …

Thanks

Chris

valentinfrlch · September 23, 2024, 5:29pm

What you’re looking for is the video_analyzer action.
It works much the same as image_analyzer except that it takes in one or multiple videos. In addition to video files frigate is also supported! All you need is the event_id which you can find in frigate’s mqtt events.

See this page for documentation: Usage | LLM Vision | Getting Started
To get frigate’s event_id take a look at their Home Assistant notification docs: Home Assistant notifications | Frigate

zuzzy · September 24, 2024, 5:23am

Thanks but as per post I don’t have the video files. i want to point the video analyzer section at a streaming camera entity not a file. Or … I don’t have frigate, I have ring, but I want to replicate the function to send a collection of images takes from a video stream.

I have now found that camera entities have a record function, which I didn’t know. My plan is to capture a short video always to the same file name, so avoiding filling my pi up with thousands of videos, then pointing to that video file in the video analzer section.

I think that will work…

valentinfrlch · September 24, 2024, 9:09am

I see. Yes what you describe should work. Here is a script that implements your idea and should work (I didn’t test it):

alias: Analyze Front Door Video
sequence:
  - action: camera.record
    metadata: {}
    data:
      duration: 30
      lookback: 10
      filename: /config/www/tmp/front_door.mp4
    target:
      entity_id: camera.front_door
  - action: llmvision.video_analyzer
    data:
      provider: OpenAI
      model: gpt-4o-mini
      interval: 3
      max_tokens: 100
      temperature: 0.5
      video_file: /config/www/tmp/front_door.mp4
      message: Describe what happens in the video
    response_variable: response
description: ""
icon: mdi:cube-scan

The response is stored in response.response_text

Neverberoyal · September 26, 2024, 9:34am

Thanks for this. Still using the old version (gpt4 vision), but it imports my meter readings into HA daily. Created a separate sensor which sets itself to the value of the helper (input_number.gas_meter_reading). Had a spare camera which was dirt cheap from AliExpress lying around, hence doing this rather than AI on the edge via esp32.


alias: Gas Meter GPT Reading
description: ""
trigger:
  - platform: time
    at: "14:00:00"
action:
  - service: gpt4vision.image_analyzer
    data:
      provider: Anthropic
      model: claude-3-5-sonnet-20240620
      include_filename: false
      target_width: 512
      detail: high
      max_tokens: 500
      temperature: 0.5
      image_entity:
        - camera.alicam1proxy
      message: >-
        what is the 5 digit number shown in this image? don't reply with any
        other words, only the number. 
    response_variable: gpt_response
  - service: input_number.set_value
    target:
      entity_id: input_number.gas_meter_reading
    data:
      value: |
        {{ gpt_response.response_text | int }}

slack1 · October 3, 2024, 6:51am

I’ve got a case where I’d like to be able to use this integration but it seems that my images are in PNG format rather than JPG, so the addon throws an error:

Logger: homeassistant.helpers.script.websocket_api_script
Source: helpers/script.py:525
First occurred: September 29, 2024 at 6:17:45 PM (4 occurrences)
Last logged: 7:44:41 PM
websocket_api script: Error executing script. Unexpected error for call_service at pos 1: cannot write mode RGBA as JPEG

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/PIL/JpegImagePlugin.py", line 639, in _save
    rawmode = RAWMODE[im.mode]
              ~~~~~~~^^^^^^^^^
KeyError: 'RGBA'

The image is coming from HASS.Agent where it’s using the screenshot sensor to bring a screenshot image into HA as a camera entity.

Would it be best for this integration to be able to handle PNG format images? Or should that agent integration change to use JPG format for the screenshots? I’m not sure if there’s a standard/expectation of jpg for camera entities in HA or not.

It looks like OpenAI at least support png upload as well as jpeg.

valentinfrlch · October 3, 2024, 3:08pm

Thanks for the feedback,
pngs (or any images with transparency) will be converted in the next version.

slack1 · October 3, 2024, 11:20pm

Awesome, great to hear, will keep an eye out for it and test it when I can.

valentinfrlch · October 4, 2024, 4:58pm

You will probably have to download it manually in HACS since it is still a beta.
You can do so in HACS > … > redownload, then select the latest version (v1.2.0-beta.5).

You can find the full changelog here: Release v1.2 Stream Analyzer, New Provider configurations · valentinfrlch/ha-llmvision · GitHub

Any feedback is welcome!
Thanks!

zuzzy · October 6, 2024, 7:52pm

Or indeed the answer to my original ask is in 1.2 -the new stream analzer - which avoids the intermediate file. I’ll test that when I get a chance - brilliant, thank you!! Much easier than having fragments of files all over the place and hooks into ring_mqtt nicely.

One thing I would like to do is submit images as well as video. Can I include image files in video analyzer requests (I found a great prompt someone used where they include photos of key people alongside the video and ask Gemini to call people by their names if they recognise them)

And if you can (and I appreciate I could just test it to find out), more importantly can I do so with stream analyzer? (which I can’t test yet as I havent got the prerelease installed yet)

Thanks!

valentinfrlch · October 10, 2024, 10:41am

This is not yet possible with stream analyzer. It is also not possible with video analyzer. I will add that later (to both stream and video analyzer) as it sounds like a good idea!

If you have any feedback for v1.2 please let me know.
Thanks!

RaA11 · October 10, 2024, 2:26pm

Hi,

I wonder if theres an option to use text input for changeing the Message to AI for exaple: I want to receive notification only if someone with white shirt has been detected in the picture
and after that change the input text to look for someone with blue shirt. for the trigger i want to use frigate person detection and get the messsage only if the picture is what i asked.

Thank you

zuzzy · October 12, 2024, 1:06pm

That’s great thank you! It is accepted in config, I assume it’s just not doing anything.

I’m paused this anyway atm as I need to find out why ring_mqtt is murdering my camera batteries with continuous usage. By the time I trust ring_mqtt and/or have guardrails in place in automations to prevent heavy unintentional streaming it ought to be about the time for 1.2!