tqvm, very cool project. it is more interesting if ha-gpt4vision can support video. Gemini Flash can support video . we might get better event analysis out of vide and since frigate support event video recording we can use that for better analysis.
If I understand correctly the LLM will not actually analyze the entire video (all frames) but just some selcted frames.
So supporting .mp4 as filetype for the image_path parameter should be possible. gpt4vision would then extract an image every second or so and analyse those. This will also make it compatible with all currently supported LLMs.
Definitely an interesting idea, I’ll try to implement this some time soon!
No worries!
The problem seems to be the max length of the input_text entity. If the text in the set_value service call exceeds that length, nothing is set. The only way I can think of to get arround this, is setting the max_tokens to something small:
service: gpt4vision.image_analyzer
data:
provider: OpenAI
detail: low
temperature: 0.5
message: What do you see in the image?
image_entity:
- camera.front_door
max_tokens: 60
response_variable: response
Edit:
You can change the max length in the helper settings (where you set up your input_text helper. However the max length you can set it to seems to be 255)
1 token is apparently about 4 characters, this is how I got to roughly 60 tokens.
gpt4vision has only existed for a couple of months but has evolved a lot in this time. At first, it was only compatible with OpenAI’s API, hence its name gpt4vision.
Since then, more providers have been added for more flexibility, including some self-hosted options. Because of this, I felt the name ‘gpt4vision’ is no longer descriptive of the project, which is why the name is changing. gpt4vision is now llmvision.
Since your config for your providers is stored under gpt4vision, your configs can no longer be accessed after the update (v1.0.0).
Delete all gpt4vision configs
Remove gpt4vision from HACS (do not download the update)
Restart Home Assistant
Download LLM Vision again
Restart Home Assistant (again)
This should ensure a smooth transition. Unfortunately, it means that you need to enter your API keys again.
If you have any automations or scripts using gpt4vision, you will need to change the service name: Previously:gptvision.image_analyzer New:llmvision.image_analyzer
The service call stays the same otherwise.
If you have any questions or need help, feel free to reach out!
Unfortunately there is no way to expose service entities to Assist (which is what llmvision uses).
However, script entities can be exposed, so you can write a script which reads the response through TTS and call that script through Assist. You can use scripts from here as inspiration: LLM Vision Wiki
For more complex actions I highly suggest using Extended OpenAI Conversation as it allows for dynamic prompt and image entity selection and responds directly in the chat.
Also, Extended OpenAI Conversation can be used through the Assist Interface if that’s what you’re after: Installation Guide.
LLM Vision now supports video!
This provides the LLM with even more context and is especially useful for recognizing actions.
Version v1.0.1 adds the new video_analyzer service, which in addition to local video files, can also take Frigate’s event ids as input. If you have MQTT configured with frigate, you can use video_analyzer as follows:
alias: Example
description: ""
trigger:
- platform: mqtt
topic: frigate/events
condition:
- condition: template
value_template: "{{trigger.payload_json['type'] == 'end'}}"
action:
- service: llmvision.video_analyzer
metadata: {}
data:
provider: OpenAI
model: gpt-4o-mini
message: >-
Briefly describe what the person does. Are they delivering mail? Keep
your response short.
interval: 3
include_filename: false
target_width: 1280
detail: high
max_tokens: 100
temperature: 0.3
event_id: "{{trigger.payload_json['after']['id']}}"
response_variable: response
- service: telegram_bot.send_photo
data:
url: https://<homeassistant_url>/api/frigate/notifications/{{trigger.payload_json["after"]["id"]}}/snapshot.jpg
caption: "{response.respose_text}"
mode: single
This will fetch the recording from frigate, analyze one frame every 3 seconds and send the event snapshot together with a description of what the person does as caption.
The condition ensures that the recording is completed so it can be fetched.
If you have any questions or need help, feel free to reach out!
Currently using this to have openai extended conversation be able to process (frigate) camera’s. However I’m also trying out double-take, I think it should be possible to also use that info to put names to faces in the output of llm vision.
I’m getting this error when press action to test in developer tools Failed to perform the action llmvision.image_analyzer. OpenAI provider is not configured
Not quite sure what you mean by “put names to faces in the output of llm vision”.
The only way I can think of to make llm vision “recognize” faces would be to have all the faces you want it to recognize in the input along the with image you want to analyze. If you enable include_filename and have the persons name set as the filename of your “face image” then maybe it could make an educated guess.