LLM Vision: Let Home Assistant see!

Hey everyone,

LLM Vision is a Home Assistant integration to analyze images, videos and camera feeds using the vision capabilities of multimodal LLMs.
Supported providers are OpenAI, Anthropic, Google Gemini, LocalAI, Ollama and any OpenAI compatible API.

Responses are returned as response variables for easy use with automations. The usage possibilities are limitless. You could request a car’s license plate number when one is detected, create custom delivery announcements, or set an alarm to trigger when suspicious activity is detected.

Features

  • Takes local images, video and camera or image entities as input
  • Compatible with OpenAI’s Models, Anthropic Claude, Google’s Gemini, LocalAI and Ollama
  • Images can be downscaled for faster processing
  • Filename or entity can be attached for more context

Installation

Open your Home Assistant instance and open a repository inside the Home Assistant Community Store.

Resources

Check the docs for detailed instructions on how to set up LLM Vision and each of the supported providers or get inspiration from examples.

14 Likes

Hello, I want when we are talking with voice assistant (extended open ai conversion)
I say a certain sentence, for example: What do you see in the camera?
This happens:
1- My automation or script is executed
2- A photo is taken from the camera I specified
3-Then I send that photo to ha-gpt4vision
4- The response of ha-gpt4vision is converted to sound with tts
I have given more detailed explanations here :

Because I have just met the world of home assistant
Should I have done the same or did I go wrong?
The following is very limited and a script or automation must be defined for each request
For example, one with this prompt: analyze everything you see in the picture
Another one with this prompt:
Explain the working of what you see in the picture
Is there an easier way to do all the requests without automation or scripting?
For example, like this:
How many people do you see on the camera?
Or what is the color of their clothes?
Do they look suspicious?

If I understand correctly you want to be able to ask your voice assistant what it can see in an image captured by a camera entity.
Afaik, extended open ai conversation supports function calling. This means you could write a script (with an input field for your prompt):

  1. Takes your promt as input (field)
  2. Capture a snapshot on your camera
  3. Call gpt4vision.Image_analyzer with the image just captured
  4. Pass gpt4vision’s response to a tts service

This script would look roughly like this:

alias: Example
sequence:
  - service: camera.snapshot
    metadata: {}
    data:
      filename: >-
        /config/www/tmp/f51fbdc6f87267b1346efccc796dd6c450ff71b66ec5e795921d442a0305a0ac.jpg
    target:
      entity_id: camera.front_door
  - service: gpt4vision.image_analyzer
    metadata: {}
    data:
      max_tokens: 100
      model: gpt-4o
      target_width: 1280
      image_file: >-
        /config/www/tmp/f51fbdc6f87267b1346efccc796dd6c450ff71b66ec5e795921d442a0305a0ac.jpg
      message: "{{ prompt }}"
    response_variable: response
  - service: tts.speak
    metadata: {}
    data:
      cache: true
      media_player_entity_id: media_player.entity_id
      message: "{{response.response_text}}"
    target:
      entity_id: tts.piper
fields:
  prompt:
    selector:
      text: null
    name: prompt
    required: true

Note that you’ll want to modify all entity ids.

You could then make use of function calling so gpt can call this function.
For function calling, see this link: https://platform.openai.com/docs/guides/function-calling

you got it almost right I tested something similar with automation before I wrote this for you I wrote a prompt for ha-gpt4ovision: Tell everything you see in the picture in detail Now, when the automation is executed, a snapshot is taken from the camera and everything in the image is explained in detail… But the problem is that promt ha-gpt4ovision is static! I want the promt ha-gpt4ovison variable For example, I should tell it how many people are in the picture and not write an automation for this in advance I don’t know how to explain to you what I mean, but what I want is to write an automation + a fixed prompt for each request. I want the prompt that is given to ha-gpt4ovison to be variable, that is to say once how many people are in the picture and another time to tell it the color of their clothes

Actually, I want the prompt that is sent to ha-gpt4ovision to be variable, that is, when we tell the voice assistant to send the prompt to ha-gpt4ovision, we do not write a static prompt for it.

This is what the input is for:

This is not an automation! It’s a script.

Thanks for this ! You are right, possibilities are endless !!!
I am now writing a spec for extended openai conversation, do you think it will be possible to send multiple files in one service call ?

I think it’s possible. Please open a feature request and I’ll look into it.
Also feel free to share any automations, scripts etc. I will collect then as inspiration for others.

This is really cool; I was playing with it a lot last night. I restream my Frigate Birdseye camera (3 camera feeds in one view) and send it to the service with an increased the image size. The resolution is good enough for GPT4O to describe the scene in each camera accurately. It was even identifying my vehicles in the driveway correctly by make and model. I am also using the file notify service in my automation to store the full response of GPT4O in a .txt file. Nice work on the integration and thank you for making it

1 Like

Hey everyone,

Just over a week since initial release, I am excited to share the first significant update for gpt4vision. The integration has been completely rewritten to allow support for different AI “providers”. This update adds a second provider, LocalAI.

If you already have LocalAI running on your machine, setup will be very easy and can be done entirely through Home Assistant’s UI. Just enter IP address and port and you’re ready to go.

In case you don’t already use LocalAI but want to run your smart home completely locally, check out Quickstart | LocalAI documentation to get started.

  • This update also adds support for sending multiple images at once for even more context.

  • The temperature parameter has beed added for more control over the response

  • Other smaller improvements like better error messages, translations and input validation functions

You can update right now by going to your HACS > GPT-4 Vision > (…) and update information.

:warning: Note that due to the complete rewrite of the integration you’ll need to set up the integration again.

Follow GitHub - valentinfrlch/ha-gpt4vision: Image Analyzer for Home Assistant using GPT-4o if you need any help.

As always, please post any questions you have in this forum. If you’ve found a bug, please create a bug request.

Feel free to open a feature request if you have ideas for additional providers or other feature suggestions.

Enjoy.

-valentin

Awesome, thanks for adding multiple image processing !

Quick update: v0.3.5 was released

This version adds support for Ollama which, just like LocalAI, is a selfhosted alternative to OpenAI. In my testing it was also faster than LocalAI and seems to support multiple images per call, whereas LocalAI doesn’t seem to…

For setup instructions on how to set up Ollama with gpt4vision, follow the docs.

As always, if you have any suggestions please create a feature request. Should you encounter any bugs, please create an issue and I will do my best to help you.

Cheers!
-valentin

1 Like

Integrating OpenAI Extended Conversation and gpt4vision
Some of you wanted to use gpt4vision with OpenAI Extended Conversation. @Simone77 already wrote a spec that works well.
However, as far as I understand it also requires a script running every x minutes that captures a snapshot on all cameras. This means the snapshots are likely out of date by the time you ask about them.

So I finally wrote my own spec which takes a list of cameras (you’ll need to expose them to Assist) and a prompt as parameters. The LLM will dynamically consider which camera entities to include.
It then captures a snapshot on each of the cameras and then passes them all into one single call to gpt4vision:

Example: “Is someone at the front door?”
The LLM understands that you want to know about the front door and therefore only passes your front door camera to gpt4vision.

Or: “What’s happening around the house?”
The LLM will pass all available cameras to gpt4vision and respond appropriately.

- spec:
    name: describe_camera_feed
    description: Get a description whats happening on security cameras around the house
    parameters:
      type: object
      properties:
        message:
          type: string
          description: The prompt for the image analyzer
        entity_ids:
          type: array
          description: List of camera entities
          items:
            type: string
            description: Entity id of the camera
      required:
      - message
      - entity_ids
  function:
    type: script
    sequence:
    - repeat:
        sequence:
          - service: camera.snapshot
            metadata: {}
            data:
              filename: /config/www/tmp/{{repeat.item}}.jpg
            target:
              entity_id: "{{repeat.item}}"
        for_each: "{{ entity_ids }}"
    - service: gpt4vision.image_analyzer
      metadata: {}
      data:
        provider: Ollama
        max_tokens: 100
        target_width: 1000
        temperature: 0.3
        image_file: |-
          {%for camera in entity_ids%}/config/www/tmp/{{camera}}.jpg
          {%endfor%}
        message: "{{message}}"
      response_variable: _function_result

Hope this helps!

2 Likes

The moment you realize you can use ChatGPT to analyze images and return the analysis as a JSON string, a whole new world opens up. The example below is a very simple one with a single variable. However, I have also created more complex examples that include multiple variables, such as counting red, white, and grey cars.

@valentinfrlch thank you for you work!

alias: Carport Cam - OpenAI make and analyze picture
sequence:
  - service: camera.snapshot
    data:
      filename: /config/www/tmp/carport.jpg
    target:
      entity_id:
        - camera.192_168_xx_xx
  - service: gpt4vision.image_analyzer
    data:
      max_tokens: 100
      image_file: /config/www/tmp/carport.jpg
      provider: OpenAI
      model: gpt-4o
      target_width: 1280
      temperature: 0.5
      detail: low
      message: >-
        Please check if there is a white car in the driveway and respond with a
        JSON object. The JSON object should have a single key,
        "car_in_driveway", which should be set to true if there is a white car
        in the driveway and false otherwise.
    response_variable: response
  - choose:
      - conditions:
          - condition: template
            value_template: >-
              {{ ((states('input_text.test') |regex_replace(find='```json

              ', replace='', ignorecase=False) |regex_replace(find='

              ```', replace='', ignorecase=False)  ) |
              from_json).car_in_driveway }}
            enabled: true
        sequence:
          - service: input_boolean.turn_on
            target:
              entity_id: input_boolean.car_in_driveway
            data: {}
    default:
      - service: input_boolean.turn_off
        target:
          entity_id: input_boolean.car_in_driveway
        data: {}
    enabled: true
mode: single

2 Likes

This is amazing, thanks for sharing!
Maybe it could even recognize license plates to check if it’s your car? The detail parameter would probably have to be set to high for this.

May I include this script in the wiki as inspiration for others?

Feel free to add this to the Wiki.
I believe it’s capable of recognizing license plates, as it even can accurately count items such as bicycles.

For another automation task that I use for my doorbell, I use the following instructions and then I use TTS to announce it on a smart speaker. ChatGPT can even identify DHL and Dominos delivery persons without fail. It really impress me.

As a smart camera doorbell assistant, you are tasked with analyzing images captured by the camera doorbell and verbally articulating your observations. Begin every message with the sound “ding dong” to emulate the doorbell’s ring, followed by a succinct analysis of the scene. Your descriptions should be brief and informative, such as “ding dong a group of kids is at the door” or “ding dong a DHL delivery person is waiting outside.” Since your messages will be played through a smart speaker, clarity and conciseness are key. Describe what you see in a neutral and factual manner, focusing on essential details like the identification of visitors, whether they are known contacts or service providers, and avoid including extraneous information to ensure the listener’s quick comprehension and attention.

Examples of replies:

Ding dong, a DHL delivery person is waiting outside with a package.
Ding dong, a Domino's delivery person is at the door with pizza.
2 Likes

Hi @valentinfrlch,

Before posting the script, I cleaned up some unnecessary debug actions, but now I realize that I broke the script. The correct script is:

alias: Carport Cam - OpenAI make and analyze picture
sequence:
  - service: camera.snapshot
    data:
      filename: /config/www/tmp/carport.jpg
    target:
      entity_id:
        - camera.192_168_3_119_2
  - service: gpt4vision.image_analyzer
    data:
      max_tokens: 100
      image_file: /config/www/tmp/carport.jpg
      provider: OpenAI
      model: gpt-4o
      target_width: 512
      temperature: 0.5
      detail: low
      message: >-
        Please check if there is a white car in the driveway and respond with a
        JSON object. The JSON object should have a single key,
        "car_in_driveway", which should be set to true if there is a white car
        in the driveway and false otherwise.
    response_variable: response
 
  - choose:
      - conditions:
          - condition: template
            value_template: >-
              {{ (( response.response_text |regex_replace(find='```json

              ', replace='', ignorecase=False) |regex_replace(find='

              ```', replace='', ignorecase=False)  ) |
              from_json).car_in_driveway }}
            enabled: true
        sequence:
          - service: input_boolean.turn_on
            target:
              entity_id: input_boolean.car_in_driveway
            data: {}
    default:
      - service: input_boolean.turn_off
        target:
          entity_id: input_boolean.car_in_driveway
        data: {}
    enabled: true
mode: single

Thanks!
I have added your updated script here with some small modifications:

Thanks again for sharing!

Any chance of updating this to work with the llm control in 2024.6 using a custom intent?

Sure! I’m always happy to add new features. Can you elaborate a little more on what exactly you want gpt4vision to be able to do?

1 Like

Eny exaples how to use this whit frigate? :love_letter: