How Bayes Sensors work, from a Statistics Professor (with working Google Sheets!)

I think I would like to see a bit of groundswell behind this proposal - the negated state should have consequence for the calculation. Implementing this would encourage my Bayes sensors to “turn off” / “dial down” the probability when their contributing inputs are false

I have a PR:

I just need to improve the tests, unfortunately I am a bit pressed for time at the moment

Edit: PR is reviewer approved and awaiting merge.
Edit2: @teskanoo the PR is now merged, not sure what release it will be in. Hopefully 2022.10
Edit3: Released in 2022.10 - this is a breaking change but I also included some repairs which should detect and notify for most broken configs.

1 Like

This explanation is top! I wish all teachers would be like this. Keep up the good work mate.

New spreadsheet for the 2022.10 update
For generating configs (only works for entities that are binary at the moment)

2 Likes

Thank you for creating the updated Bayesian Tester spreadsheet. It’s extremely helpful.

I’ve created residents asleep sensor in my setup which works well. It triggers on with very high accuracy. But I’m struggling to implement a way to track the sensor to turn off when residents are awake. At the moment, the sensor triggers off at the point when the TOD sensor turns off.

I have a few things that I observe on a regular basis but don’t know how to implement them?

  • My phone alarm or Google Lenovo clock alarm triggers once in the morning. Sometimes it snoozes. But once the alarm has been stopped I’m up and wake. Both my phone and Google clock are available in HA via the companion App and Google Assitant integrations.
  • there’s usually motion in the bedroom followed by the kitchen within bout 5mins of each other.
  • I usually play the radio in the kitchen on. google speak whilst making breakfast.
- platform: tod
  name: Night Time Sleeping Hours
  after: "20:00"
  before: "07:00"


- platform: "bayesian"
  name: "Residents Asleep"
  unique_id: "4ff91613-8a74-4500-b00d-4ce4ab85a28a"
  prior: 0.33
  probability_threshold: 0.9
  observations:
    - platform: "state"
      entity_id: media_player.living_room_tv 
      prob_given_true: 0.88
      prob_given_false: 0.69
      to_state: 'off'

    - platform: "state"
      entity_id: group.all_lights
      prob_given_true: 0.97
      prob_given_false: 0.75
      to_state: 'off'

    - platform: "state"
      entity_id: binary_sensor.house_occupied_residents
      prob_given_true: 0.99
      prob_given_false: 0.81
      to_state: 'on'

    - platform: "state"
      entity_id: binary_sensor.night_time_sleeping_hours
      prob_given_true: 0.88
      prob_given_false: 0.01
      to_state: 'on'

For these two I would use templates to detect if you are after your alarm time (this depends what happens to the state of the alarm sensor once the alarm has finished - but you could always store than in a helper to stop that)

      value_template: >-
        {% if as_timestamp(now()) > as_timestamp(states('sensor.google_speaker_alarms')) %}
           true
        {% else %}
           false
        {% endif %}

I personally use this one quite a lot

    - platform: "template" # is harvsg home with a charging phone
      prob_given_true: 0.7 # when everyone is asleep my phone will be charging and I will be home, but 30% of the time I am away from home.
      prob_given_false: 0.1 # sometimes I do a top-up charge at home.
      value_template: >-
        {% if is_state('person.harvsg', 'home')
           and is_state('sensor.phone_charger_type', 'ac') %}
           true
        {% else %}
           false
        {% endif %}

As a gerneral rule: if you want to use instantaneous moments to affect the state of a bayesian sensor you need to find a way to make that instant moment last longer. Options include using an automation to change the state of a helper - sensor.hallway_then_kitchen_montion_helper and then another automation that resets that to off when you go to bed.
Or by using that {{as_timestamp(now()) - as_timestamp(states.sensor.hallway_motion.last_changed) < 300}} technique

Does anyone use Grafana, Influxdb or history to help guide their bayesian sensor setups?

I track a number of sensors in Influxdb and Grafana. And with so much historical data at hand I wonder if there’s a way to make good use of it to inform the probability of certain observations?

I’m not sure how to go about it in an effective way.

For example, I’d like to use motion sensors in the house as an additional observation in an ’asleep’ sensor. I already have an asleep bayesian sensor which works well. But adding the motion sensors would bring another level of accuracy. Usually, there is little or no motion whilst I’m asleep.

My thinking is there away to review historical data (influx, Grafana or history), between 00:00 - 07:00 for the past 90 days. And then calculate the average number of motion events? Or some other metric that could be used to find a correlation or trend that could be turned into an observation in a bayes sensor?

1 Like

I just thought about the same thing. It should be possible to get data values for any sensor to use with bayes based on history. For example a “Home Occupied” sensor. I currently have a simple input_boolean that gets triggered based on device trackers and motion trackers. So I can find correlation between this input boolean which I know to be working reliably, and any other sensor, and a script should be able to calculate what’s the more probable value for any sensor whenever the Home Occupied sensor if true or false.

Sorry, I totally butchered that description :smiley: The point is, I started to write a script that can get history info from Hass. Here’s what I have so far, might be a good starting point for anyone who wants to do the same. At the moment it doesn’t do much, but it shows you how you can get historical data from Hass API. You can run it on any machine that has Python, does not have to be Hass instance. The only dependency is requests library ( pip install requests ).

It’s a low-priority project for me, so I may or may not post any updates for this.

TOKEN = "XXXXXXXXXXXXXX"
ENTITY_ID = "switch.humidifier_plug"
BAYES_REFERENCE_ENTITY_ID = "input_boolean.home_occupied"
HASS_API_URL = "http://192.168.1.20:8123/api"

import requests
from datetime import datetime, timedelta

def last_day_of_month(any_day):
    next_month = any_day.replace(day=28) + timedelta(days=4)
    return next_month - timedelta(days=next_month.day)


def hass_date_to_datetime(s):
    try:
        return datetime.strptime(s, r"%Y-%m-%dT%H:%M:%S.%f+00:00")
    except:
        return datetime.strptime(s, r"%Y-%m-%dT%H:%M:%S+00:00")

dt_fmt = r"%Y-%m-%d-%H-%M-%S-%f"

month_start = datetime.now()
month_start = datetime(year=month_start.year, month=month_start.month, day=1)
month_end = last_day_of_month(month_start)



headers = {'Authorization': f'Bearer {TOKEN}',
           'Content-Type': 'application/json'}

url = "{HASS_API_URL}/history/period/"


reference_bayes_states = requests.get(url + f"{month_start.year}-{month_start.month}-1T00:00:00+00:00?end_time={month_end.year}-{month_end.month}-{month_end.day}T00%3A00%3A00%2B00%3A00&filter_entity_id={ENTITY_ID}",
                        headers=headers)

target_entity_states = requests.get(url + f"{month_start.year}-{month_start.month}-1T00:00:00+00:00?end_time={month_end.year}-{month_end.month}-{month_end.day}T00%3A00%3A00%2B00%3A00&filter_entity_id={ENTITY_ID}",
                        headers=headers)


for state in reference_bayes_states.json()[0]:
    if state['state'] != "unknown":
        print(state['state'])

for state in target_entity_states.json()[0]:
    if state['state'] != "unknown":
        print(state['state'])
1 Like

Ok, took less time and effort than I thought. So, I think it kinda works, but I didn’t yet have time to think of a smart algorithm, so it’s just brute-forcing it’s way through states. It does 2 requests to Hass API, but then it iterates over every second between dates you specify, and it checks states of 2 entities - the target one which you want to add to bayesian sensor, and the reference one which tells it “what state should Bayes be”. The “Home Occupied” based on device_trackers from the example above.

It is SLOW but it seems to work. Working prototype first, optimization later :smiley:

TOKEN = "XXXXX"
ENTITY_ID = "switch.humidifier_plug"
BAYES_REFERENCE_ENTITY_ID = "input_boolean.home_occupied"
HASS_API_URL = "http://192.168.1.20:8123/api"
START_TIME = "2023.01.15 10:00"
END_TIME = "2023.01.15 16:00"
TIMEZONE_OFFSET = 0  # Timezone offset from GMT for your local time. Positive or negative number. For example if your timezone is GMT+2 - use 2 here. If it's GMT-4 then use -4.

from datetime import datetime, timedelta
import requests

def last_day_of_month(any_day):
    next_month = any_day.replace(day=28) + timedelta(days=4)
    return next_month - timedelta(days=next_month.day)


def hass_date_to_datetime(s):
    try:
        return datetime.strptime(s, r"%Y-%m-%dT%H:%M:%S.%f+00:00") + timedelta(hours=TIMEZONE_OFFSET)
    except:
        return datetime.strptime(s, r"%Y-%m-%dT%H:%M:%S+00:00") + timedelta(hours=TIMEZONE_OFFSET)

def human_time_to_datetime(s):
    return datetime.strptime(s, r"%Y.%m.%d %H:%M")

dt_fmt = r"%Y-%m-%d-%H-%M-%S-%f"

month_start = datetime.now()
month_start = datetime(year=month_start.year, month=month_start.month, day=1)
month_end = last_day_of_month(month_start)

START_TIME = human_time_to_datetime(START_TIME)
END_TIME = human_time_to_datetime(END_TIME)

headers = {'Authorization': f'Bearer {TOKEN}',
           'Content-Type': 'application/json'}

url = f"{HASS_API_URL}/history/period/"


reference_bayes_states = requests.get(url + f"{month_start.year}-{month_start.month}-1T00:00:00+00:00?end_time={month_end.year}-{month_end.month}-{month_end.day}T00%3A00%3A00%2B00%3A00&filter_entity_id={BAYES_REFERENCE_ENTITY_ID}",
                        headers=headers).json()[0]

target_entity_states = requests.get(url + f"{month_start.year}-{month_start.month}-1T00:00:00+00:00?end_time={month_end.year}-{month_end.month}-{month_end.day}T00%3A00%3A00%2B00%3A00&filter_entity_id={ENTITY_ID}",
                        headers=headers).json()[0]


def state_at_time(states, dt):
    start_state = None
    for _state in states:
        last_changed = hass_date_to_datetime(_state['last_changed'])
        if last_changed <= dt:
            start_state = _state
            continue
        if last_changed > dt:
            if start_state == None:
                return "OUT OF RANGE"
            return start_state['state']
    return start_state['state']


# Now we can either go second-by second between some dates and check state data, bruteforcing it... or we can go over target_entity_dates and calculate ranges between these. Bruteforcing is slow but is more true and reliable

data = {}

# This is the SLOOOOOOOOOOOOOOW part
seconds = (END_TIME-START_TIME).total_seconds()
for second in range(int(seconds)):
    dt = START_TIME + timedelta(seconds=second)
    if second % 100 == 0:
        print(dt)

    target_state = state_at_time(target_entity_states, dt)
    reference_state = state_at_time(reference_bayes_states, dt)
    if reference_state not in data:
        data[reference_state] = {"seconds": 0}
    if target_state not in data[reference_state]:
        data[reference_state][target_state] = 0
    data[reference_state]['seconds'] += 1
    data[reference_state][target_state] += 1

print(data)
for reference_state, _d in data.items():
    reference_seconds = _d.pop("seconds")
    for target_state, _dd in _d.items():
        print(f"{target_state} while reference is {reference_state}: {_dd/reference_seconds}")

Example output:

{'on': {'seconds': 18143, 'off': 17077, 'on': 600, 'unavailable': 466}, 'off': {'seconds': 3457, 'off': 3457}}
off while reference is on: 0.9412445571294714
on while reference is on: 0.033070605743261865
unavailable while reference is on: 0.025684837127266713
off while reference is off: 1.0

In my case “humidifier plug” is actually a “Coffe maker plug” right now, it’s repurposed but I didn’t update it’s entity_id. So from this we can say that between START_TIME = “2023.01.15 10:00” and END_TIME = “2023.01.15 16:00”, while we were home Coffe maker was ON 0.033 of the time, and off 0.94 of the time. And when we’re not home coffe maker is 1.0 off (always off, never use it while nobody is home). Which seems to make sense. We turn it on for 15-30 minutes a day (1-2 brews, each one on a timer that turns it off after 15 minutes so that it wont evaporate all the coffee if we forget about it).

So, given this information, bayes sensor should be, prob_given_true: 0.033 and prob_given_false: 0.0

2 Likes

This is great. Can’t wait to give it a shot and see what patterns / data insights I can discover. Thanks for sharing.

1 Like

TLDR : Is statistical independence not fully satisfied between all your sensors and so contributing to an overestimation of the final posterior?

I’m trying to understand this diachronic form of Bayes, which I believe is expressed in the code and then spreadsheet, where the computed posterior is then applied to the next stage as the prior vs a Venn diagram approach to this.

What I couldn’t figure out is how this would work given the Venn diagram for each sensor if the sensors produced the exact same readings (given that they have the same probabilities). In this case, in the Venn diagram, the enclosed spaces for each sensor would perfectly overlap, confirming everything that the first sensor detected, resulting in P(H | Sensor1) == P(H | Sensor2) == P(H | Sensor1 and Sensor2) for the spaces enclosed in the Venn diagram. However, the diachronic calculations would produce a higher degree of confidence in the result, yet there would be no additional information since the second sensor produced the exact same results as the first. I just couldn’t reconcile how these two different approaches could produce the same result.

I finally realized that the difference between my Venn diagram approach, assuming the second sensor produced the same results as the first (based on the exact same probabilities) was that I was ignoring statistical independence. When you assert statistical independence, then the Venn diagram approach produces the same result as the diachronic calculation approach. However, if they are not independent, then the result should be derated and might not be any better than with just the first sensor, which makes sense as you are not adding as much more information than you might realize.

I then considered, by how much does the result change as the degree of dependence changes by writing an empirical simulation and simulating a change in the correlation of the two sensors, from 100% to 0%. The output confirms that the conditional probability of the two sensors does indeed scale between what it would be with just one sensor (if they produce the exact same data) and the diachronic value computed with the 2 sensors. What was interesting (unless I screwed this up) is that it didn’t do so linearly, but as a curve - i’ll attach the picture. In my case, I was modeling with a prior of .1 and P+ as .90 and P- as .1 . In that case, one sensor should give .5 and two sensors (if independent) should give .9 and you can see from the graph, it does go between those two values, but non linearly.

This is interesting because, perhaps this is an area that is contributing to error in this approach for people? That is, the spreadsheet computes a degree of confidence yet the real life confidence might actually be lower if the two sensors don’t satisfy this kind of independence? People were complaining that they had a hard time getting this to work, and I wonder if this has been considered as an error factor?

1 Like

I think this sensor may have been broken, or I’m completely missing something. (Running 2024.5.3)

My presence detector has never worked well, and I finally decided to spend some time on making it work. I happened across this thread, and duplicated the results in the initial post (many thanks for clarifying the statistics of this). I created my own sensor, and it is reporting results that shouldn’t be possible based on these calculations.

Screenshot from 2024-05-14 09-06-00

When I implement these numbers, the sensor reports a probability of 0.87 (instead of the expected 0.95). If all are 0, the sensor should report 0.9 I believe (the prior should just cascade through to the final probability as there is “no new informatinon”). But if just the “wife home” sensor goes to 0, the sensor reports 0.12 – which should be impossible.

Anyone else see behavior that doesn’t seem correct or do I have some issue that I’m just not seeing? Here is my sensor, for the record:

  - platform: bayesian
    prior: 0.9
    name: 'Family Home Bayesian'
    probability_threshold: 0.95
    observations:
      - entity_id: 'binary_sensor.me_home'
        prob_given_true: 0.99  
        prob_given_false: 0.67
        platform: 'state'
        to_state: 'on'
      - entity_id: 'binary_sensor.wife_home'
        prob_given_true: 0.99  
        prob_given_false: 0.67
        platform: 'state'
        to_state: 'on'
      - entity_id: 'binary_sensor.kid_home'
        prob_given_true: 0.95  
        prob_given_false: 0.8 
        platform: 'state'
        to_state: 'on'

Don’t read anything into the values, I am just trying to get this thing understood before actually implementing it. Right now either the sensor is broken or my ability to follow logic. How could this sensor possibly report a probability of 0.12 in any condition?

Inkblotadmirer, you say “the prior should just cascade through to the final probability as there is “no new information”)”. This is something that I’m tying to reconcile as well, the treatment of the prior in the spreadsheet & the code vs how I understand a 2 feature naive bayes classifier should work. Asking ChatGPT “give me the formula for naive bayes classifier with 2 features”

Unless Im not reading that formula correctly, it doesn’t have the prior propagating through as the spreadsheet and code seem to do. Instead it just takes the one starting/initial prior and multiplies that by the product of the P(Feature(i) | hypothesis) divided by the product of all the P(Feature(i)) s, resulting in a slightly different answer.

Ok, I should have taken one more step with ChatGPT, but it answered the discrepancy between the two approaches:

(Did I say how much I love ChatGPT?)

Got to say, the term Bayesian will have a whole new resonance from now on!

So funny. I noticed that also, per recent events. I just wonder what the back story is behind the naming of the yacht.

The owner developed data analysis software.

Too bad he didn’t compute P(sink | overconfidence in the design, non diligence of the crew, freak weather). (too soon?)

So after reading up on this stuff, there appears to be two basic Bayes schemes, and as this code is implemented its called the sequential Bayes, and its results can differ from the non sequential scheme, however, its supposed to help consider for non independence of the features, which is what I was concerned with above.

Probably. My fault for mentioning it.