LLM Vision: Let Home Assistant see!

For now you can take a look at these examples:

These can be triggered when a person is detected by frigate for example.
The next release will include support for image entities for better support for frigate.

1 Like

For all the frigate users:
v0.3.8 has just been released which now supports entities of the image domain. If you have the frigate integration installed and have object recognition enabled, frigate will provide such image entities for every camera and every object type.

You can even use this new image_entity input together with the existing image_file to submit multiple images from both frigate and the file system in one call.

This could be useful if you want to add a reference photo vs how it looks right now.

Cheers!
-valentin

1 Like

Awesome work with this integration! Thanks for making it! I have it responding to what is currently happening around the house via snapshots, and it gives a very accurate descriptions

Any ideas on getting this to work with Unifi cameras and Extended-OpenAI-Conversation, to look at specific camera past motion events, so I can ask things like “when was the last time someone was in the back yard with a blue shirt on?” or “was there a white car in the driveway, if so, what time?” ?

I’m sure it would cut down on the processing time if things like car, animal, and person detections from those specific cameras could be processed vs all detections from all cameras, over x amount of time.

@WouterN Thanks very much for sharing the doorbell prompt, seems to work perfectly!

This is not currently possible as it requires sending the filename with the request. But I will add an option for that in the next release.

Some prerequisites

  • Object recognition: This is not a must, but it will work much better with
  • Snapshots folder: I propose folder structure as described below:
    This is to reduce the amount of images that need to be analyzed. If the LLM finds you want to know about a ‘white car’, it will only look at images in the car folder.
- snapshots
	- person
		- 2024-06-22-07:42.jpg
		- ...
	- car
		- ...
	- package
	- bicycle
	- ...

There should folder in snapshots with the exact name of the objects your integration can detect.

You then need two automations:

  • One that is triggered whenever there is an object detection and saves the snapshot in the corresponding folder.
  • To avoid these folders from getting too large use Delete Files Home Assistant or a similar integration to delete old files.

Making a request

To make a request, your spec should have two inputs: object, prompt.
To get the folder contents you could use the built-in folder sensor:

sensor:
  - platform: folder
    folder: "/config/snapshots/"

With the following template you can get the contents of your snapshots folder:

{%- set folders = namespace(value=[]) %}
{%- for folder in state_attr("sensor.snapshots", "file_list") | sort %}
    {%- set folders.value = folders.value + [folder] %}
{%- endfor %}

Then you send all files in the object folder.
The include file_name option is needed because even if you have timestamps on your snapshots, LLMs are too susceptible to halluzination.

I hope this helpful. If you do end up writing the spec, please share it as I think this is a very interesting use case.
And if you need help, feel free to reach out.

Thanks for the detailed reply!

I’m trying to figure out if I can get recognition images from the NVR that are already categorized. It does this on the Unifi end somehow, at least for people, cars, animals and packages (and just movement).

I had everything working great a month ago, and now I get a message saying “Error running action” - OpenAI provider is not configured.

I didn’t change a thing other than upgrade the integration. Any ideas where to change it so it works?

I uninstalled the OpenAI integration and re-added it and it worked.

This is probably due to a variable name change. A while back, gpt4vision was only compatible with OpenAI, so I renamed a variable from API_KEY to OPENAI_API_KEY for more consistency. I should have put this into the changelog as a warning, sorry! Glad it’s working again!

1 Like

Thank you for creating this! I’m having trouble getting a script to work when I specify the image_entity, but it works perfectly when I specify the path to the image_file.

This works:

alias: write what you see
sequence:
  - service: gpt4vision.image_analyzer
    data:
      max_tokens: 100
      provider: OpenAI
      model: gpt-4o
      target_width: 512
      temperature: 0.5
      detail: low
      message: Please describe the scene
      image_file: /config/www/test/corridor.jpg
    response_variable: response
  - service: system_log.write
    metadata: {}
    data:
      level: error
      message: log the {{ response.response_text  }}
mode: single

but this does not (same code except I’m specifying my camera entity instead of the image file:

alias: write what you see (streaming)
sequence:
  - service: gpt4vision.image_analyzer
    data:
      max_tokens: 100
      provider: OpenAI
      model: gpt-4o
      target_width: 512
      temperature: 0.5
      detail: low
      message: Please describe the scene
      image_entity:
        - camera.corridor
    response_variable: response
  - service: system_log.write
    metadata: {}
    data:
      level: error
      message: log the {{ response.response_text  }}
mode: single

My possibly inaccurate understanding is that I can just use the image_entity of my camera and it can parse the stream for the image at the time the script is called? When I try running the script I get the following error:

“Failed to call service script/write_what_you_see_streaming. cannot access local variable ‘client’ where it is not associated with a value”

Or maybe I have to use the entity_picture of my camera as the image_entity? These are the attributes from my camera: (token is cut off so I don’t think my security is compromised by pasting this)

Thank you for your help!

1 Like

You are right, there is a mistake in the code. Some results are not properly awaited, which is why you get this error. This should be fixed soon, it has been pushed to the latest dev release so if you want you can test it right now. I think it will be ready tomorrow to push as public/normal release.

Your understanding is correct, if you submit an image or camera entity the integration will fetch the latest frame.

1 Like

Same here

v0.4.7 is out now which should fix this issue.

2 Likes

Thanks.

If I want the response to be a notification on my mobile screen, what should I do? And another question: can I use the response on a dashboard card?

Tnx

To get the response as a notification on your phone or tablet use the notify service in the automation or script where you call the gpt4vision service:

service: notify.mobile_app_your_phone
data:
  title: Front Door
  message: "{{response.response_text}}"

(Assuming your response_variable is response.)

To use the response on your dashboard you could create a input_text helper to store the response in. You can do so in your script/automation with:

service: input_text.set_value
data:
  value: "{{response.response_text}}"
target:
  entity_id: input_text.gpt4vision_response

I have also put together a quick vertical stack in card to run your script and display the results using a markdown card:

type: custom:vertical-stack-in-card
cards:
  - type: tile
    entity: script.analyse_front_door
    icon_tap_action:
      action: toggle
    vertical: false
  - type: markdown
    content: '{{states("input_text.gpt4vision_response")}}'

Note: Requires vertical-stack-in-card

1 Like

Thank you so much!

I’m getting error - status code 500

image

Please open an issue here and don’t forget to attach your logs so that I can help you.

I have tried the vertical stack in card above (thank you) ; and got this:


when running the script - the response is not shown on the card

Here is the script -
alias: Script Chiller card
sequence:

  • service: gpt4vision.image_analyzer
    metadata: {}
    data:
    provider: Anthropic
    include_filename: false
    target_width: 1280
    detail: high
    max_tokens: 100
    temperature: 0.5
    message: what red numbers do you see?
    image_entity:
    - camera.chiller
    response_variable: response
  • service: input_text.set_value
    data:
    value: “{{response.response_text}}”
    target:
    entity_id: input_text.gpt4vision_response