LLM Vision is a Home Assistant integration to analyze images, videos and camera feeds using the vision capabilities of multimodal LLMs.
Supported providers are OpenAI, Anthropic, Google Gemini, LocalAI, Ollama and any OpenAI compatible API.
Responses are returned as response variables for easy use with automations. The usage possibilities are limitless. You could request a car’s license plate number when one is detected, create custom delivery announcements, or set an alarm to trigger when suspicious activity is detected.
Features
Takes local images, video and camera or image entities as input
Compatible with OpenAI’s Models, Anthropic Claude, Google’s Gemini, LocalAI and Ollama
Images can be downscaled for faster processing
Filename or entity can be attached for more context
Installation
Resources
Check the docs for detailed instructions on how to set up LLM Vision and each of the supported providers or get inspiration from examples.
Hello, I want when we are talking with voice assistant (extended open ai conversion)
I say a certain sentence, for example: What do you see in the camera?
This happens:
1- My automation or script is executed
2- A photo is taken from the camera I specified
3-Then I send that photo to ha-gpt4vision
4- The response of ha-gpt4vision is converted to sound with tts
I have given more detailed explanations here :
Because I have just met the world of home assistant
Should I have done the same or did I go wrong?
The following is very limited and a script or automation must be defined for each request
For example, one with this prompt: analyze everything you see in the picture
Another one with this prompt:
Explain the working of what you see in the picture
Is there an easier way to do all the requests without automation or scripting?
For example, like this:
How many people do you see on the camera?
Or what is the color of their clothes?
Do they look suspicious?
If I understand correctly you want to be able to ask your voice assistant what it can see in an image captured by a camera entity.
Afaik, extended open ai conversation supports function calling. This means you could write a script (with an input field for your prompt):
Takes your promt as input (field)
Capture a snapshot on your camera
Call gpt4vision.Image_analyzer with the image just captured
you got it almost right I tested something similar with automation before I wrote this for you I wrote a prompt for ha-gpt4ovision: Tell everything you see in the picture in detail Now, when the automation is executed, a snapshot is taken from the camera and everything in the image is explained in detail… But the problem is that promt ha-gpt4ovision is static! I want the promt ha-gpt4ovison variable For example, I should tell it how many people are in the picture and not write an automation for this in advance I don’t know how to explain to you what I mean, but what I want is to write an automation + a fixed prompt for each request. I want the prompt that is given to ha-gpt4ovison to be variable, that is to say once how many people are in the picture and another time to tell it the color of their clothes
Actually, I want the prompt that is sent to ha-gpt4ovision to be variable, that is, when we tell the voice assistant to send the prompt to ha-gpt4ovision, we do not write a static prompt for it.
Thanks for this ! You are right, possibilities are endless !!!
I am now writing a spec for extended openai conversation, do you think it will be possible to send multiple files in one service call ?
I think it’s possible. Please open a feature request and I’ll look into it.
Also feel free to share any automations, scripts etc. I will collect then as inspiration for others.
This is really cool; I was playing with it a lot last night. I restream my Frigate Birdseye camera (3 camera feeds in one view) and send it to the service with an increased the image size. The resolution is good enough for GPT4O to describe the scene in each camera accurately. It was even identifying my vehicles in the driveway correctly by make and model. I am also using the file notify service in my automation to store the full response of GPT4O in a .txt file. Nice work on the integration and thank you for making it
Just over a week since initial release, I am excited to share the first significant update for gpt4vision. The integration has been completely rewritten to allow support for different AI “providers”. This update adds a second provider, LocalAI.
If you already have LocalAI running on your machine, setup will be very easy and can be done entirely through Home Assistant’s UI. Just enter IP address and port and you’re ready to go.
In case you don’t already use LocalAI but want to run your smart home completely locally, check out Quickstart | LocalAI documentation to get started.
This update also adds support for sending multiple images at once for even more context.
The temperature parameter has beed added for more control over the response
Other smaller improvements like better error messages, translations and input validation functions
You can update right now by going to your HACS > GPT-4 Vision > (…) and update information.
Note that due to the complete rewrite of the integration you’ll need to set up the integration again.
This version adds support for Ollama which, just like LocalAI, is a selfhosted alternative to OpenAI. In my testing it was also faster than LocalAI and seems to support multiple images per call, whereas LocalAI doesn’t seem to…
For setup instructions on how to set up Ollama with gpt4vision, follow the docs.
As always, if you have any suggestions please create a feature request. Should you encounter any bugs, please create an issue and I will do my best to help you.
Integrating OpenAI Extended Conversation and gpt4vision
Some of you wanted to use gpt4vision with OpenAI Extended Conversation. @Simone77 already wrote a spec that works well.
However, as far as I understand it also requires a script running every x minutes that captures a snapshot on all cameras. This means the snapshots are likely out of date by the time you ask about them.
So I finally wrote my own spec which takes a list of cameras (you’ll need to expose them to Assist) and a prompt as parameters. The LLM will dynamically consider which camera entities to include.
It then captures a snapshot on each of the cameras and then passes them all into one single call to gpt4vision:
Example: “Is someone at the front door?”
The LLM understands that you want to know about the front door and therefore only passes your front door camera to gpt4vision.
Or: “What’s happening around the house?”
The LLM will pass all available cameras to gpt4vision and respond appropriately.
- spec:
name: describe_camera_feed
description: Get a description whats happening on security cameras around the house
parameters:
type: object
properties:
message:
type: string
description: The prompt for the image analyzer
entity_ids:
type: array
description: List of camera entities
items:
type: string
description: Entity id of the camera
required:
- message
- entity_ids
function:
type: script
sequence:
- repeat:
sequence:
- service: camera.snapshot
metadata: {}
data:
filename: /config/www/tmp/{{repeat.item}}.jpg
target:
entity_id: "{{repeat.item}}"
for_each: "{{ entity_ids }}"
- service: gpt4vision.image_analyzer
metadata: {}
data:
provider: Ollama
max_tokens: 100
target_width: 1000
temperature: 0.3
image_file: |-
{%for camera in entity_ids%}/config/www/tmp/{{camera}}.jpg
{%endfor%}
message: "{{message}}"
response_variable: _function_result
The moment you realize you can use ChatGPT to analyze images and return the analysis as a JSON string, a whole new world opens up. The example below is a very simple one with a single variable. However, I have also created more complex examples that include multiple variables, such as counting red, white, and grey cars.
alias: Carport Cam - OpenAI make and analyze picture
sequence:
- service: camera.snapshot
data:
filename: /config/www/tmp/carport.jpg
target:
entity_id:
- camera.192_168_xx_xx
- service: gpt4vision.image_analyzer
data:
max_tokens: 100
image_file: /config/www/tmp/carport.jpg
provider: OpenAI
model: gpt-4o
target_width: 1280
temperature: 0.5
detail: low
message: >-
Please check if there is a white car in the driveway and respond with a
JSON object. The JSON object should have a single key,
"car_in_driveway", which should be set to true if there is a white car
in the driveway and false otherwise.
response_variable: response
- choose:
- conditions:
- condition: template
value_template: >-
{{ ((states('input_text.test') |regex_replace(find='```json
', replace='', ignorecase=False) |regex_replace(find='
```', replace='', ignorecase=False) ) |
from_json).car_in_driveway }}
enabled: true
sequence:
- service: input_boolean.turn_on
target:
entity_id: input_boolean.car_in_driveway
data: {}
default:
- service: input_boolean.turn_off
target:
entity_id: input_boolean.car_in_driveway
data: {}
enabled: true
mode: single
This is amazing, thanks for sharing!
Maybe it could even recognize license plates to check if it’s your car? The detail parameter would probably have to be set to high for this.
May I include this script in the wiki as inspiration for others?
Feel free to add this to the Wiki.
I believe it’s capable of recognizing license plates, as it even can accurately count items such as bicycles.
For another automation task that I use for my doorbell, I use the following instructions and then I use TTS to announce it on a smart speaker. ChatGPT can even identify DHL and Dominos delivery persons without fail. It really impress me.
As a smart camera doorbell assistant, you are tasked with analyzing images captured by the camera doorbell and verbally articulating your observations. Begin every message with the sound “ding dong” to emulate the doorbell’s ring, followed by a succinct analysis of the scene. Your descriptions should be brief and informative, such as “ding dong a group of kids is at the door” or “ding dong a DHL delivery person is waiting outside.” Since your messages will be played through a smart speaker, clarity and conciseness are key. Describe what you see in a neutral and factual manner, focusing on essential details like the identification of visitors, whether they are known contacts or service providers, and avoid including extraneous information to ensure the listener’s quick comprehension and attention.
Examples of replies:
Ding dong, a DHL delivery person is waiting outside with a package.
Ding dong, a Domino's delivery person is at the door with pizza.
Before posting the script, I cleaned up some unnecessary debug actions, but now I realize that I broke the script. The correct script is:
alias: Carport Cam - OpenAI make and analyze picture
sequence:
- service: camera.snapshot
data:
filename: /config/www/tmp/carport.jpg
target:
entity_id:
- camera.192_168_3_119_2
- service: gpt4vision.image_analyzer
data:
max_tokens: 100
image_file: /config/www/tmp/carport.jpg
provider: OpenAI
model: gpt-4o
target_width: 512
temperature: 0.5
detail: low
message: >-
Please check if there is a white car in the driveway and respond with a
JSON object. The JSON object should have a single key,
"car_in_driveway", which should be set to true if there is a white car
in the driveway and false otherwise.
response_variable: response
- choose:
- conditions:
- condition: template
value_template: >-
{{ (( response.response_text |regex_replace(find='```json
', replace='', ignorecase=False) |regex_replace(find='
```', replace='', ignorecase=False) ) |
from_json).car_in_driveway }}
enabled: true
sequence:
- service: input_boolean.turn_on
target:
entity_id: input_boolean.car_in_driveway
data: {}
default:
- service: input_boolean.turn_off
target:
entity_id: input_boolean.car_in_driveway
data: {}
enabled: true
mode: single