The Anything sensor, Proof of concept

Hey there, I’m Kenny! Picture me as a digital diy/script kiddie with a side of Python, but not quite a code maestro just yet. More of a DIY enthusiast who loves tinkering around.

I’m super into AI and LLMs. Check this out: Gemini just rolled out their new Gemini-PRO with an amazing offer of 60 free API calls a minute. Naturally, I had to try and create a do-it-all sensor. With vision tech, you could turn almost anything into a sensor. I’ve tried a bunch of things and so far, so good. From a ‘what’s in my fridge’ sensor to a baby monitor, or even a ‘find my keys’ gadget, the possibilities are endless. I even made a demo for mood and presence detection.

If you’re curious, I’ll be uploading the Python code on GitHub tomorrow. If it gets some buzz, maybe it’ll inspire a custom component. And hey, if this message seems a bit more polished, blame ChatGPT – I might be slightly tipsy, and it’s my sober-minded sidekick tonight!

DEMO-VIDEO: https://www.youtube.com/watch?v=2HbrGA65bEQ

3 Likes
Publish vision/state code using webcam-test:
import cv2
import paho.mqtt.client as mqtt
import google.generativeai as genai
import json
import os
import time
from dotenv import load_dotenv
import re

# Load environment variables
load_dotenv()
GEMINI_API_KEY = os.getenv('GEMINI_API_KEY')

# Ensure the API key is present
if not GEMINI_API_KEY:
    print("Error: GEMINI_API_KEY is missing. Please check your environment variables.")
    exit(1)

# Configure the Gemini Pro API key
genai.configure(api_key=GEMINI_API_KEY)

# MQTT Broker settings
MQTT_BROKER = '192.168.0.xxx'
MQTT_PORT = 1883
MQTT_USERNAME = 'bla'
MQTT_PASSWORD = 'bma'

# MQTT Topics
GEMINI_PRESENCE_STATE_TOPIC = 'homeassistant/sensor/geminiPresence/state'
GEMINI_AGE_STATE_TOPIC = 'homeassistant/sensor/geminiAge/state'
GEMINI_MOOD_STATE_TOPIC = 'homeassistant/sensor/geminiMood/state'

# Initialize and connect to MQTT broker
client = mqtt.Client()
client.username_pw_set(MQTT_USERNAME, MQTT_PASSWORD)
client.connect(MQTT_BROKER, MQTT_PORT, 60)

def analyze_frame(frame):
    """Analyze the given frame using the Gemini Pro API."""
    model = genai.GenerativeModel('gemini-pro-vision')
    ret, buffer = cv2.imencode('.jpg', frame)
    if not ret:
        print("Failed to encode image")
        return None, None, None

    image_data = {
        'mime_type': 'image/jpeg',
        'data': buffer.tobytes()
    }
    contents = [
        "Your task is to analyse the amount of people you see, calculate the average age, "
        "and describe the mood in the scene. You can only reply with: count_of_people = number, "
        "average_age = number, and mood = Happy,Sad,Neutral,Drunk,Psycho,Crying,High",
        image_data
    ]

    try:
        response = model.generate_content(contents=contents)
        response.resolve()

        if not response.parts:
            feedback = response.prompt_feedback or "No candidates returned."
            print(f"Error or prompt blocked: {feedback}")
            return None, None, None

        # Extract count of people, average age, and mood from response
        count_of_people = extract_count_from_response(response.text)
        average_age = extract_age_from_response(response.text)
        mood = extract_mood_from_response(response.text)
        return count_of_people, average_age, mood
    except Exception as e:
        print(f"Error during API call: {e}")
        return None, None, None

def extract_count_from_response(text):
    # Use regular expression to find the count of people
    match = re.search(r'count_of_people = (\d+)', text)
    if match:
        return int(match.group(1))
    else:
        print("Could not extract count of people")
        return 0

def extract_age_from_response(text):
    # Use regular expression to find the average age
    match = re.search(r'average_age = (\d+)', text)
    if match:
        return int(match.group(1))
    else:
        print("Could not extract average age")
        return 0

def extract_mood_from_response(text):
    # Logic to extract mood from the response
    mood_match = re.search(r'mood = (\w+)', text)
    if mood_match:
        return mood_match.group(1)
    else:
        print("Could not extract mood")
        return "neutral"  # Default mood if not detected

def publish_sensor_states(count_of_people, average_age, mood):
    """Publish the sensor states to the MQTT broker."""
    if count_of_people is None or average_age is None or mood is None:
        print("No valid data to publish")
        return

    presence_state_payload = {"count": count_of_people}
    age_state_payload = {"average_age": average_age}
    mood_state_payload = {"mood": mood}

    print(f"Publishing to MQTT: {presence_state_payload}, {age_state_payload}, {mood_state_payload}")
    client.publish(GEMINI_PRESENCE_STATE_TOPIC, json.dumps(presence_state_payload), qos=0, retain=True)
    client.publish(GEMINI_AGE_STATE_TOPIC, json.dumps(age_state_payload), qos=0, retain=True)
    client.publish(GEMINI_MOOD_STATE_TOPIC, json.dumps(mood_state_payload), qos=0, retain=True)

def capture_and_analyze():
    """Capture video feed from the default camera, analyze it, and update sensor states."""
    cap = cv2.VideoCapture(0)  # 0 for the default webcam

    try:
        while True:
            ret, frame = cap.read()
            if not ret:
                print("Failed to read frame from camera")
                break

            print("Analyzing frame...")
            count_of_people, average_age, mood = analyze_frame(frame)

            if count_of_people is not None and average_age is not None and mood is not None:
                print(f"Detected {count_of_people} people with an average age of {average_age} and mood: {mood}")
                publish_sensor_states(count_of_people, average_age, mood)
            else:
                print("No valid data detected in frame")

            time.sleep(1)  # Sleep time between captures, can be adjusted
    finally:
        cap.release()
        print("Camera released")

if __name__ == "__main__":
    capture_and_analyze()

All that code needs to be surrounded in three backticks (```) to make sense. Or select it all and press </> in the toolbar.

1 Like

Thanks, didnt notice.

I don’t think I understand the purpose and intent of this script/integration. Are you just using it to read sensor values in MQTT in sentence format? Or perhaps asking a question about a video feed and having Gemini vision answer the question - like an advanced object detection algorithm? Please elaborate. Thanks!

The purpose of this script is to harness the capabilities of a camera and Gemini AI to create a dynamic sensor system. Essentially, it turns visual information into actionable data. Here’s a step-by-step explanation of how it operates:

  1. Capture Video Feed: The script connects to a camera’s RTSP stream to continuously capture live video frames.

  2. Analyze with Gemini AI: Each frame is sent to Gemini’s AI, which is a sophisticated object detection algorithm. This AI is capable of analyzing the content of the video frames in real-time.

  3. Extract Data: The AI processes the frame to extract specific information, such as the number of people present, their average age, and the mood depicted in the scene.

  4. Publish to MQTT: The extracted data is then formatted into a structured payload and published to an MQTT broker. This effectively updates the sensor state in a home automation system like Home Assistant.

  5. Home Automation Integration: Home Assistant listens to the MQTT topics and updates the sensor values accordingly, which can then be used in various home automation scenarios.

By using this method, virtually anything that can be visually detected and analyzed by the AI can be converted into a sensor’s data—for instance, identifying whether a room is occupied, estimating crowd size, or gauging the general mood of a space. Are your keys on the table, is the table clean, …

2 Likes

Another example of output from this prompt that turns into mqtt sensors.

                    "You are a home assistant sensor, you're only output can be in the format like Bed_is_made=no/yes, room_clean=no/yes, toys_on_bed=number,light_is_on=no/yes, door_top_left_corner_open=no/yes, location_of_pikachu/lion/elephant_toy=bed/floor/unavailable/..., Aproximate_lux_value=number",

Analyzing frame...
 Bed_is_made=no, room_clean=no, toys_on_bed=3, light_is_on=yes, door_top_left_corner_open=no, location_of_pikachu/lion/elephant_toy=bed/floor/rocking_chair, Aproximate_lux_value=15

all from 1 frame

2 Likes

Transmit images from inside one’s home to Google’s cloud-based image analyzer? :thinking:

1 Like

now now, there is only so much that we can do locally. But imagine for example, you have a camera that detects motion at the side or back of your house. You could for example do something like take a single snapshot, and ask Google’s cloud-based image analyser, what is the probability that the people in this image, are likely to behave in a threatening manner.

Then you could get a notification on your phone (or TV) if Google thinks that it’s not just your neighbours or friends coming to visit you.

1 Like

I would be impressed if Gemini Pro could infer that, with high accuracy, based on analyzing a single image.

1 Like

Kinda want to test…

Api data is kept private

Yeah, but people have an irrational fear of Google.
You will often hear people say silly things like “they sell my data to advertisers” which is of course nonsense, because the data Google knows about you - is the most valuable thing they have, if they sell it, they are no longer the single keeper of that data and thus the data loses value.

Can use openai, its better but this is free. So easy to test with.

Would you happen to know their retention policy?

Given that it learns by having access to a large amount of data, I imagine whatever one submits becomes part of this pool of data and (just a guess) is kept indefinitely.

Similar to how the photos one stores in the cloud using Google Photos becomes raw material for refining their image analysis tools. The difference is that you decide how to long to keep those photos there whereas it’s unclear (to me) for photos submitted for use by Gemini.

Google says it does not train its generative AI models on inputs or outputs from its cloud customers.

It’s bad practice to train on user generated data. When it was training on user generated data that was ingested from the entire internet, that was fine - because it’s a cross section.

But obviously if you provide an image or video analyser, then people are going to be interested in submitting only a very small category of images to it so they can test it out, or make use of it somehow. It won’t be getting the broad selection of images that it ingested from across the internet. Thus it becomes a risk that the AI can become biased towards certain types of images or objects in those images.

As an example - the AI woman on Twitter, I think her name is Janelle - discovered that early AI, before OpenAI exploded on to the scene - decided that if presented with green grass, then it MUST have sheep on it. So when provided with a picture of a kitchen, it correctly identified it as a kitchen, but if the kitchen had a green carpet, then it would tell you that it was a kitchen with sheep in it, even though there was no sheep in it. Because it had become biased that green = grass, and grass = sheep.

The first line of the link you posted states they may use it to improve Gemini Pro.

The article goes on to say the data will be anonymized.

Did you test ?

Yeah but it’s probably not used for training, it’s probably used for determining what things people are asking, that cause the AI model to output potentially harmful or illegal information. Pretty much like ChatGPT - I was giving that a test last night, and it’s quite difficult now to get it to spit out something potentially harmful - and even stuff that has swear words in it now has a disclaimer that it might be breaking content policy.