LLM Vision: Let Home Assistant see!

and here is the card:

type: custom:vertical-stack-in-card
cards:

  • type: tile
    entity: script.script_chiller_card
    icon_tap_action:
    action: toggle
    vertical: false
    show_entity_picture: true
  • type: markdown
    content: ‘{{states(“input_text.gpt4vision_response”)}}’

Try this instead:

"{{states('input_text.gpt4vision_response')}}"

(Order or " and ’ is reversed)

I’m afraid it didn’t help.

tqvm, very cool project. it is more interesting if ha-gpt4vision can support video. Gemini Flash can support video . we might get better event analysis out of vide and since frigate support event video recording we can use that for better analysis.

If I understand correctly the LLM will not actually analyze the entire video (all frames) but just some selcted frames.
So supporting .mp4 as filetype for the image_path parameter should be possible. gpt4vision would then extract an image every second or so and analyse those. This will also make it compatible with all currently supported LLMs.

Definitely an interesting idea, I’ll try to implement this some time soon!

Any idea maybe? I don’t want to nag or bother you. the project is amazing.

No worries!
The problem seems to be the max length of the input_text entity. If the text in the set_value service call exceeds that length, nothing is set. The only way I can think of to get arround this, is setting the max_tokens to something small:

service: gpt4vision.image_analyzer
data:
  provider: OpenAI
  detail: low
  temperature: 0.5
  message: What do you see in the image?
  image_entity:
    - camera.front_door
  max_tokens: 60
response_variable: response

Edit:

  • You can change the max length in the helper settings (where you set up your input_text helper. However the max length you can set it to seems to be 255)
  • 1 token is apparently about 4 characters, this is how I got to roughly 60 tokens.

I think the problem was with the syntax : “target”. You have to ommit it. Here is the code that worked for me eventually:


alias: Script Chiller english
sequence:
  - service: gpt4vision.image_analyzer
    metadata: {}
    data:
      provider: OpenAI
      include_filename: false
      target_width: 1280
      detail: high
      temperature: 0.5
      image_entity:
        - camera.camera_pergula
      model: gpt-4o
      message: "what red numbers do you see? "
    response_variable: response
  - service: input_text.set_value
    data:
      value: "{{response.response_text}}"
      entity_id: input_text.test
description: ""

1 Like

and that’s the card:

1 Like

Name Change

gpt4vision has only existed for a couple of months but has evolved a lot in this time. At first, it was only compatible with OpenAI’s API, hence its name gpt4vision.

Since then, more providers have been added for more flexibility, including some self-hosted options. Because of this, I felt the name ‘gpt4vision’ is no longer descriptive of the project, which is why the name is changing. gpt4vision is now llmvision.

What does this mean for you?

Since your config for your providers is stored under gpt4vision, your configs can no longer be accessed after the update (v1.0.0).

  1. Delete all gpt4vision configs
  2. Remove gpt4vision from HACS (do not download the update)
  3. Restart Home Assistant
  4. Download LLM Vision again
  5. Restart Home Assistant (again)

This should ensure a smooth transition. Unfortunately, it means that you need to enter your API keys again.

:warning: If you have any automations or scripts using gpt4vision, you will need to change the service name:
Previously: gptvision.image_analyzer
New: llmvision.image_analyzer
The service call stays the same otherwise.

If you have any questions or need help, feel free to reach out!

Cheers!

3 Likes

Good luck! Good idea

Hello!

Is there a way to use this integration with Assist/OpenAI Conversation (and not with Extended OpenAI Conversation)?

Thank you in advance and thanks for your work!

Unfortunately there is no way to expose service entities to Assist (which is what llmvision uses).
However, script entities can be exposed, so you can write a script which reads the response through TTS and call that script through Assist. You can use scripts from here as inspiration: LLM Vision Wiki
For more complex actions I highly suggest using Extended OpenAI Conversation as it allows for dynamic prompt and image entity selection and responds directly in the chat.

Also, Extended OpenAI Conversation can be used through the Assist Interface if that’s what you’re after: Installation Guide.

1 Like

Video Support

LLM Vision now supports video!
This provides the LLM with even more context and is especially useful for recognizing actions.

Version v1.0.1 adds the new video_analyzer service, which in addition to local video files, can also take Frigate’s event ids as input. If you have MQTT configured with frigate, you can use video_analyzer as follows:

alias: Example
description: ""
trigger:
  - platform: mqtt
    topic: frigate/events
condition:
  - condition: template
    value_template: "{{trigger.payload_json['type'] == 'end'}}"
action:
  - service: llmvision.video_analyzer
    metadata: {}
    data:
      provider: OpenAI
      model: gpt-4o-mini
      message: >-
        Briefly describe what the person does. Are they delivering mail? Keep
        your response short.
      interval: 3
      include_filename: false
      target_width: 1280
      detail: high
      max_tokens: 100
      temperature: 0.3
      event_id: "{{trigger.payload_json['after']['id']}}"
    response_variable: response
  - service: telegram_bot.send_photo
    data:
      url: https://<homeassistant_url>/api/frigate/notifications/{{trigger.payload_json["after"]["id"]}}/snapshot.jpg
      caption: "{response.respose_text}"
mode: single

This will fetch the recording from frigate, analyze one frame every 3 seconds and send the event snapshot together with a description of what the person does as caption.

The condition ensures that the recording is completed so it can be fetched.

If you have any questions or need help, feel free to reach out!

Cheers!

5 Likes

Currently using this to have openai extended conversation be able to process (frigate) camera’s. However I’m also trying out double-take, I think it should be possible to also use that info to put names to faces in the output of llm vision.

or what would be the best approach here?

I’m getting this error when press action to test in developer tools
Failed to perform the action llmvision.image_analyzer. OpenAI provider is not configured

You will need to configure the OpenAI provider first (you’ll need an API key). Check the docs for instructions how to do that:

Not quite sure what you mean by “put names to faces in the output of llm vision”.

The only way I can think of to make llm vision “recognize” faces would be to have all the faces you want it to recognize in the input along the with image you want to analyze. If you enable include_filename and have the persons name set as the filename of your “face image” then maybe it could make an educated guess.

This setup would not use double-take though.

I already created an provider and still get the error.