The Future of the Bayesian Integration - Thoughts wanted

If I understood correctly negative observations are taken into account to determine the propability since 2022.10? And the prop_given_true when observation is False is 1 minus prop_given_true when observation is true?

In that case I have some doubts if this is correct. Of course statistically this make sense but in the physical world this can cause some faults.

I noticed this with my Bayesian sleeping sensor which I use. To predict if everyone in my house is sleeping one of the observations I use is a template to check the time delta since the last time a motion sensor was updated. When this delta is less than 30min it is very unlikely that everyone is sleeping (Prop_given_true= 0.05). But when the delta becomes 31 min the observation becomes False and results in a very high propability that everyone is sleeping which is not correct.

Another generic example:
If I see a man with an umbrella walking on the street it is very likely that it is raining, P is for instance 0.99. But if I don’t see that man with an umbrella I can’t say that the probability is very low that it is raining because not every man is carrying an umbrella when it’s raining. Of course it says something but definitely not 1-0.99 = 0.01

Maybe it is an idea that the user can set all 4 propabilities (true positive, true negative, false positive and false negative)

If I understood correctly negative observations are taken into account to determine the propability since 2022.10?

This is correct

And the prop_given_true when observation is False is 1 minus prop_given_true when observation is true?

This is also correct, prob_given_true is the 'probability of {entity state} == {to_state} given {bayesain sensor = ‘on’} i.e ‘probability of motion in last 30mins given everyone sleeping’. This is different from ‘probability of everyone sleeping given motion in the last 30mins’ (This seems to be the number you are using). This may seem like a subtle and irrelevant distinction but is actually what defines Bayesian statistics and gives it its magic sauce.

As such it it is a mathematical truth (for binary observations) that the probability of observation given outcome + the probability of not observation given outcome = 1. Because for any outcome the only the observation can only be observed or not observed.

For completeness prob_given_false is the 'probability of {entity state} == {to_state} given {bayesain sensor = ‘off’}

Both the instances you describe can be corrected by configuring the probabilities correctly.

binary_sensor:
  - platform: bayesian
    prior: 0.35
    name: "everyone_sleeping"
    observations:
      - entity_id: binary_sensor.montion_in_las_30s
        prob_given_true: 0.05 # When everyone is sleeping I am very unlikely to detect motion, so when motion is detected it will have a significant impact on the probability
        prob_given_false: 0.5 # When not everyone is sleeping I expect to detect motion half the time. This means that the absence of motion won't make much difference to the probability, as you desire.
        platform: "state"
        to_state: "on"

When this delta is less than 30min it is very unlikely that everyone is sleeping (Prop_given_true= 0.05)

As described above here you are confusing the probability of the data given the outcome with the probability of the outcome given the data.

If I see a man with an umbrella walking on the street it is very likely that it is raining, P is for instance 0.99. But if I don’t see that man with an umbrella I can’t say that the probability is very low that it is raining because not every man is carrying an umbrella when it’s raining.

You are exactly right, however we are not interested in ‘If I see a man with an umbrella walking on the street it is very likely that it is raining’ we are interested in ‘If it is raining, some people will have an umbrella’ and ‘If it is not raining I am very unlikely to see an umbrella’.
In bayesian stats it is the chance of the observation/entitiy-state given the event/situation/context/bayesian-sensor. AKA the chance of an umbrella given it is raining. From that information, and a prior, we work out the probability it is raining.

prob_umbrella_given_raining = 0.6 # When it is raining a slight majority (60%) of people will have umbrellas 
prob_umbrella_given_not_raining = 0.01 # When it is not raining it would be very odd 1% to see an umberalla  

You can read more here: https://www.home-assistant.io/integrations/bayesian/#theory

Of course it says something but definitely not 1-0.99 = 0.01

This is exactly one of the behaviours that was fixed in 2022.10 and why prob_given_false is now required

Maybe it is an idea that the user can set all 4 propabilities (true positive, true negative, false positive and false negative)

These are taken into account when prob_given_true and prob_given_false are set correctly. In fact the ‘in_bed’ example in the official documentation walks through a motion sensor that gives a false negative half the time someone is in bed.

1 Like

Thanks for your extensive reply. appreciate it :+1: I have indeed misinterpreted the propabilities of the data. I am going to rework my sensors!

1 Like

Yes, I saw that. It is actually also binary, even in buckets. It is easy to realize it by template returning bool if the value is in specified range, which I actually do in some cases already. I’m talking about analog (linear, exponential, …) possibility of control of the values.

Ok, do do that the use would need to create some probability function or distribution over the possible range of values. I’m not entirely sure how that would be achieved.

If you can think of (1) some reasonably common use cases where this would have a definite benefit over buckets (2) could be done in a way that a user without a background in stats could do and (3) how to implement this in code I might consider it.

But I have to say that implementing continuous probability distributions in bayesian functions stretch the limits of my understanding of Bayesian stats.

In Bayesian - Accept more than 1 sate for numeric entities by HarvsG · Pull Request #80268 · home-assistant/core · GitHub (not yet implemented) there are a series of buckets. The prior will be updated by whichever bucket the observation is in, if there are no observations then the prior won’t be updated. So this isn’t binary.

From my perspective, Mr. Bayes, in his times, didn’t expect his theorem be used continuously for automations, calculating new results in milliseconds :slight_smile: My proposal is actually like that - using his “static” formula with new input values whenever they change, even slightly, even every milli. It is not breaking his theorem IMHO.

The bucket solution is evaluating some range to true/false if I understand it well. If I’m out of that range, it is not observed at all. If I’d like to evaluate f.e. range of outdoor temperature for heating/cooling (in my country it can be from -20°C to +35°C), I’d expect linear probability change. Of course it can be realized by a lot of other ways, but I see this clean and simple. Just using a template for the threshold which returns a float between f.e. 0.3 and 0.8. Let the template output on the HA user, just accept template for threshold, given_true, given_false. Out of the range values are exceptions raising an error/unknown state.

Anyway, it is definitely up to you, you are the “pro” and author :slight_smile:

P.S. just an example of one of my Bayes sensor using “buckets” (beside other observed sensors):

      - platform: template
        value_template: "{{ now().time() <= strptime(['0845', '0655'][states('binary_sensor.workday_today') | bool(true)], '%H%M').time() }}"
        prob_given_true: 0.960
        prob_given_false: 0.050
      - platform: template
        value_template: "{{ strptime(['0845', '0655'][states('binary_sensor.workday_today') | bool(true)], '%H%M').time() < now().time() <= strptime('2050', '%H%M').time() }}"
        prob_given_true: 0.250
        prob_given_false: 0.700
      - platform: template
        value_template: "{{ strptime('2050', '%H%M').time() < now().time() <= strptime(['2300', '2150'][states('binary_sensor.workday_tomorrow') | bool(true)], '%H%M').time() }}"
        prob_given_true: 0.600
        prob_given_false: 0.250
      - platform: template
        value_template: "{{ strptime(['2300', '2150'][states('binary_sensor.workday_tomorrow') | bool(true)], '%H%M').time() < now().time() }}"
        prob_given_true: 0.860
        prob_given_false: 0.120

This I’d like to avoid by just adjusting the threshold value.

1 Like

I was inspired by @HarvsG spreadsheet
and created a python notebook that helps jumpstart my template automatically.

Problem

My cats set my motion sensors off.

Well not all of them. A handful of my motion sensors can be placed around so that it only picks up human beings.

The trouble is,there are places that I want motion sensors not to look like eyesores (the kitchen) or creepy (in a bathroom). Low placement under cabinets is great for discrete placement, but picks up my wandering kittens.

Solution

I intuitively set up a Bayesian sensor that:

  • watches for well placed motion in adjacent rooms to the kitchen
  • watches for light patterns that I believe indicate humans and not felines

It worked ok but needed some tuning. @HarvsG spreadsheet is a lifesaver but gathering the hours per day was a chore.

Over Engineering

So I wrote this python notebook which will spit out a starting point for your own bayes sensor.

Here are the steps:

  • Make sure you have History Explorer installed

  • Create a permanent or temporary dashboard with all of the sensors you might believe could help you. Something like this (yellow highlights are my cats in the kitchen)

  • Add one more thing to that dashboard: a “truth sensor” (or manually edit: see later) that reflect events (like “real” motion vs “kitty” motion)

  • Export 4-5 days of statistics in the History Explorer using its “Export as CSV”
    export as csv

  • If you don’t have a “truth sensor”, edit that csv in some tool and add a new fake named sensor, manually putting in on/off events that mark the start and end of reality. For example, I copied my motion sensor and deleted rows corresponding to “kitty motion”.

  • Save the Jupyter notebook code at the end here with the file extension .ipynb.

  • Upload this notebook to the free Jupyter Lab.

  • Upload the csv you exported

  • edit the “CONFIGURATION” section at the top

    • set your csv filename
    • set the name of your “truth” column in the csv
    • set how many correlated features (sensor + sensor state) combinations that you would like this to discover
  • Run the notebook by clicking on the triple arrow in the top toolbar

  • See example output at the bottom

The future
It would be great if I could turn this into a plugin (it would require numpy, and some bayesian libraries like scikit running on the homeassistant computing power) which would let you:

  • select some schedul representing ground truth for your goal (ie. I was asleep these hours, or this is REAL motion)
  • select a list of sensors you think my be interesting
  • click ‘go’ and have it give you a suggestion.

I am not an expert in this ecosystem so this is all I’ll have for now.

Hopefully this is useful to someone.

Example Output (fully automated)

Now this output needs some hand tweaks. For example, while the kitchen motion “on” is a great indicator of motion, there is no need to ALSO have the kitchen motion “off” observation, so I’ll just delete that.

# binary_sensor.kitchen_occupancy is on 23.93% of the time
- platform: bayesian
  name: Probably binary_sensor.kitchen_occupancy # CHANGE ME
  prior: 0.24
  probability_threshold: 0.8 # EXPERIMENT WITH ME
  observations:
    # When 'binary_sensor.den_occupancy' is 'on' ...
    - platform: state
      entity_id: binary_sensor.den_occupancy
      to_state: on
      prob_given_true: 0.30 # (1.7 hours per day)
      prob_given_false: 0.18 # (3.3 hours per day)
    # When 'binary_sensor.front_yard_occupancy' is 'on' ...
    - platform: state
      entity_id: binary_sensor.front_yard_occupancy
      to_state: on
      prob_given_true: 0.20 # (1.1 hours per day)
      prob_given_false: 0.06 # (1.1 hours per day)
    # When 'binary_sensor.hall_occupancy' is 'off' ...
    - platform: state
      entity_id: binary_sensor.hall_occupancy
      to_state: off
      prob_given_true: 0.59 # (3.4 hours per day)
      prob_given_false: 0.93 # (16.9 hours per day)
    # When 'binary_sensor.hall_occupancy' is 'on' ...
    - platform: state
      entity_id: binary_sensor.hall_occupancy
      to_state: on
      prob_given_true: 0.41 # (2.4 hours per day)
      prob_given_false: 0.07 # (1.3 hours per day)
    # When 'binary_sensor.kitchen_occupancy' is 'off' ...
    - platform: state
      entity_id: binary_sensor.kitchen_occupancy
      to_state: off
      prob_given_true: 0.01 # (0.0 hours per day)
      prob_given_false: 0.99 # (18.3 hours per day)
    # When 'binary_sensor.kitchen_occupancy' is 'on' ...
    - platform: state
      entity_id: binary_sensor.kitchen_occupancy
      to_state: on
      prob_given_true: 0.99 # (5.7 hours per day)
      prob_given_false: 0.01 # (0.0 hours per day)
    # When 'binary_sensor.laundry_occupancy' is 'on' ...
    - platform: state
      entity_id: binary_sensor.laundry_occupancy
      to_state: on
      prob_given_true: 0.18 # (1.0 hours per day)
      prob_given_false: 0.06 # (1.1 hours per day)
    # When 'light.dining_room_main_lights' is 'on' ...
    - platform: state
      entity_id: light.dining_room_main_lights
      to_state: on
      prob_given_true: 0.14 # (0.8 hours per day)
      prob_given_false: 0.03 # (0.5 hours per day)

Jupyter code

{
  "metadata": {
    "language_info": {
      "codemirror_mode": {
        "name": "python",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8"
    },
    "kernelspec": {
      "name": "python",
      "display_name": "Python (Pyodide)",
      "language": "python"
    }
  },
  "nbformat_minor": 4,
  "nbformat": 4,
  "cells": [
    {
      "cell_type": "code",
      "source": "import dateutil.parser as dparser\nimport numpy as np\nimport os\nimport pandas\nfrom sklearn.feature_selection import SelectKBest\nfrom sklearn.feature_selection import chi2\nfrom sklearn.naive_bayes import MultinomialNB\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import accuracy_score",
      "metadata": {
        "trusted": true
      },
      "execution_count": 1,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# CONFIGURATION\n###########################################\nHISTORY_EXPLORER_EXPORT_FILE='entities-exported.csv'\nTRUTH_COLUMN = 'binary_sensor.kitchen_occupancy'\nINCLUDE_TRUTH_COLUMN_AS_OBSERVATION = True\nIGNORE_COLUMNS_WHEN_CHOOSING_FEATURES = [\n    'datetime',\n    'Unnamed: 0'\n]\nTOP_N_FEATURES = 8\nSAVE_DEBUG_OUTPUT = True",
      "metadata": {
        "trusted": true
      },
      "execution_count": 2,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# read history explorer export\ndf = pandas.read_csv(HISTORY_EXPLORER_EXPORT_FILE)",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# remove junk entries\n###########################################\ndf = df[df.State != \"unavailable\"]\ndf = df[df.State != \"unknown\"]",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# Take the sensor name and pivot it into a\n# column with it's state underneath it\n###########################################\nsensor = None\nunrolled = []\nfor idx, row in df.iterrows():\n  try:\n    dt = dparser.parse(row[0])\n    value = row[1]\n    unrolled.append([sensor, dt, value])\n  except dparser.ParserError:\n    sensor = row[0]\n    continue\n    \nsorted_df = pandas.DataFrame(unrolled, columns=['sensor', 'datetime', 'value']).sort_values(by=['datetime']).reset_index()\npivoted_df = pandas.pivot_table(sorted_df, index='datetime', columns='sensor', values='value', aggfunc='last')",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# Sanity Check the truth column\n###########################################\nif TRUTH_COLUMN not in pivoted_df:\n    raise SystemExit(f\"TRUTH_COLUMN '{TRUTH_COLUMN}' does not exist in your CSV file. Here are the columns:{pivoted_df.columns}\")",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# fill in the data to make sure there\n# is one data point every minute\n###########################################\nprevious_values = [np.nan for x in pivoted_df.columns]\nnew_data = []\nfor dt, row in pivoted_df.iterrows():\n    cur_and_prev = list(zip(list(row), previous_values))\n    filled_row = [x[0] if not pandas.isnull(x[0]) else x[1] for x in cur_and_prev]\n    previous_values = list(filled_row)\n    \n    new_row = [dt]\n    new_row.extend(filled_row)\n    \n    new_data.append(new_row)\n    \ncolumns = ['datetime']\ncolumns.extend(pivoted_df.columns)\nfilled_state_df = pandas.DataFrame(new_data, columns=columns)\n\n# truncate data to the minute\nfilled_state_df['datetime'] = filled_state_df['datetime'].apply(lambda x: x.floor('min'))\nfilled_state_df = filled_state_df.groupby('datetime', group_keys=False).tail(1)\n\n# fill in missing records for every minute of the day\nprevious_row = []\nprevious_dt = None\nnew_data = []\nfor idx, row in filled_state_df.iterrows():\n    dt = row[0]\n    \n    if previous_dt is not None:\n        # create missing times between last row and this one\n        next_dt = previous_dt + pandas.Timedelta(seconds=60)\n        while dt > next_dt:\n            previous_row[0] = next_dt\n            new_data.append(list(previous_row))\n            next_dt = next_dt + pandas.Timedelta(seconds=60)\n    new_data.append(list(row))\n    \n    previous_row = list(row)\n    previous_dt = dt\n    \ndf = pandas.DataFrame(new_data, columns=filled_state_df.columns)",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# drop ignored columns\n###########################################\nif len(IGNORE_COLUMNS_WHEN_CHOOSING_FEATURES) > 0:\n    for ignore in IGNORE_COLUMNS_WHEN_CHOOSING_FEATURES:\n        if ignore in df.columns:\n            df = df.drop(ignore, axis=1)\n        else:\n            print(f\"Ignored column '{ignore}' did not exist in the data\")",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# create columns for each unique state\n###########################################\norig_columns = df.columns\nfor col in orig_columns:\n    \n    if not INCLUDE_TRUTH_COLUMN_AS_OBSERVATION:\n        if col == TRUTH_COLUMN:\n            continue\n\n    if col.startswith(\"sensor.\"):\n        # numeric\n        continue\n        \n    unique_states = df[col].unique()\n    if len(unique_states) > 0:\n        for state in unique_states:\n            new_col = f\"{col}/{state}\"\n            df[new_col] = df[col].copy().apply(lambda x: x == state)\n        if col != TRUTH_COLUMN:\n            df.drop(col, axis=1, inplace=True)\n        \ndf.replace('off', False, inplace=True)\ndf.replace('on', True, inplace=True)",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# create debug file\n###########################################\nif SAVE_DEBUG_OUTPUT:\n    name = \"debug.csv\"\n    if os.path.exists(name):\n      os.remove(name)\n    df.to_csv(name, sep=',')",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# train and find features\n###########################################\nX = df.drop([TRUTH_COLUMN], axis=1)    \ny = df[TRUTH_COLUMN]\n\n# Split the data into training and testing sets\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\n# Feature selection using chi-squared\nchi2_selector = SelectKBest(chi2, k=TOP_N_FEATURES)\nX_train_chi2 = chi2_selector.fit_transform(X_train, y_train)\nX_test_chi2 = chi2_selector.transform(X_test)\n\n\n# Train a Naive Bayes classifier on the selected features\nclf = MultinomialNB()\nclf.fit(X_train_chi2, y_train)\n\n# Make predictions on the test set\ny_pred = clf.predict(X_test_chi2)\n\n# Calculate and print the accuracy of the classifier\naccuracy = accuracy_score(y_test, y_pred)\nprint(f\"# Accuracy with {TOP_N_FEATURES} selected features: {accuracy * 100:.2f}%\")\n\n# Get the indices of the selected features\nselected_feature_indices = chi2_selector.get_support(indices=True)\n\n# Get the names of the selected features\nselected_features = X.columns[selected_feature_indices]",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": "###########################################\n# print useful information for you to \n# create your baysian settings\n###########################################\ntotal_hours = len(X)/60\ntarget_on_hours = len(df[df[TRUTH_COLUMN] == 1])/60\ntarget_off_hours = len(df[df[TRUTH_COLUMN] == 0])/60\nprior = target_on_hours/total_hours\n\nprint(f\"# {TRUTH_COLUMN} is on {target_on_hours/total_hours*100:.02f}% of the time\")\nprint(f\"- platform: bayesian\")\nprint(f\"  name: Probably {TRUTH_COLUMN} # CHANGE ME\")\nprint(f\"  prior: {prior:.02f}\")\nprint(f\"  probability_threshold: 0.8 # EXPERIMENT WITH ME\")\nprint(f\"  observations:\")\n\nfor feature in selected_features:\n    hours_given_true = (len(df[(df[TRUTH_COLUMN] == 1) & (df[feature] == 1)])/60)\n    hours_per_day_given_true = hours_given_true/total_hours*24\n    \n    hours_given_false = (len(df[(df[TRUTH_COLUMN] == 0) & (df[feature] == 1)])/60)\n    hours_per_day_given_false = hours_given_false/total_hours*24\n    \n    prob_given_true = hours_given_true/target_on_hours\n    prob_given_false = hours_given_false/target_off_hours\n    \n    # make sure it is never 0 or 100%\n    prob_given_true = 0.01 if prob_given_true < 0.01 else 0.99 if prob_given_true > 0.99 else prob_given_true\n    prob_given_false = 0.01 if prob_given_false < 0.01 else 0.99 if prob_given_false > 0.99 else prob_given_false\n    \n    entity_id, to_state = feature.split('/')\n    \n    if to_state is None:\n        print(f\"# skipping feature: {feature} could not be parsed\")\n        continue\n    print(f\"    # When '{entity_id}' is '{to_state}' ...\")\n    print(f\"    - platform: state\")\n    print(f\"      entity_id: {entity_id}\")\n    print(f\"      to_state: {to_state}\")\n    print(f\"      prob_given_true: {prob_given_true:.02f} # ({hours_per_day_given_true:.01f} hours per day)\")\n    print(f\"      prob_given_false: {prob_given_false:.02f} # ({hours_per_day_given_false:.01f} hours per day)\")",
      "metadata": {
        "trusted": true
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}
2 Likes

This looks great, if you look at my goals for the Bayesian component, this overlaps significantly with points 3, 4 and 5. As you have shown, as long as you know the time periods of ‘ground truth’ and the sensors you think are correlated you can automate the production of the config.
Instead of building a plug-in it would be great to do this within core home-assistant’s config flow. We could even take it a step further and run a search on the history/state machine and pick/suggest sensors that are correlated with the ground truth time periods.

1 Like

That sounds awesome. I know a lot about programming but not the home assistant development ecosystem.

I will begin reading about config flow as you suggested.

Do you have a recommended list of baysian modules already that are knows to work well on homeassistant hardware (I’m using scikit and pandas)? After reading some code for custom components I see there is a space to list requirements - just curious if there were some to avoid to ensure maximum compatibility for others.

I am personally excited about having cousin “probably is” sensors for all of my physical sensors!

I’ve made a few notable changes to the python code in the last day or so already to further improve it’s output.

  • support for automatic numeric state experimentation. calculate the min, max, median, 25th percentile and 75th percentile… generate an above/below hypothesis for each of those
  • support for automatic “state has been X for over X minutes” experiments (5, 15, 30, 60, 180 for now) - generate template observations
  • make sure to quote values (bare on and off were doing bad things)

I’m not enormously familiar with config flow either, and particularly in order to have a nice UI we are going to want to do things like show device history graphs and have time selectors so users can input the ‘ground-truth’ in a friendly way, I am not sure how to do this, or even if it is possible. Its probably worth being familar with Options-Flow as well. This part of the docs is also helpful. I’ve asked on the discord for some idea of feasibility. Edit: I forgot I had asked this question before, apparently you can get state history in the config flow. Edit2: and here is an example of accessing hass in a config flow. And here is an example of how you can use the hass object to get state history

I don’t I’m afraid. The actual code only uses Bayes’ rule here! I think for most cases using a whole package/dependency is overkill. Perhaps it will be necessary when we want to work backwards to help the user decide which devices would be best to include in/add to the config to improve model performance as you have just shown in your automated experimentation examples - I would be very grateful for your experience and advice on this. It would be amazing to open up the config/option flow and have home assistant tell you “Adding hallway motion off for 60 mins will take your “everyone_is_asleep” sensor from 90% to 96% accuracy.”

I would love to have a look at those, do you have a git(hub) link?

Edit: Just had a thought looking at your code. I suspect we will find we cant do everything we want to in the config flow (I hope we can) but I wonder if we could fork the history-explorer card into a bayesian config generator card…

Edit: We should use the template sensor/binary sensor as a template for our config flow. And we should copy the idea of giving the option of either a binary sensor (with a threshold) or a percentage probability sensor core/homeassistant/components/template/config_flow.py at dev · home-assistant/core · GitHub

OK just ran your code and found it gives ouptuts like this

    # When 'light.study_light' is 'off' ...
    - platform: state
      entity_id: light.study_light
      to_state: off
      prob_given_true: 0.46 # (2.5 hours per day)
      prob_given_false: 0.88 # (16.5 hours per day)
    # When 'light.study_light' is 'on' ...
    - platform: state
      entity_id: light.study_light
      to_state: on
      prob_given_true: 0.54 # (2.8 hours per day)
      prob_given_false: 0.12 # (2.2 hours per day)

This is tautology

    # When 'light.study_light' is 'off' ...
    - platform: state
      entity_id: light.study_light
      to_state: off
      prob_given_true: 0.46 # (2.5 hours per day)
      prob_given_false: 0.88 # (16.5 hours per day)

Is perfectly adaquate


You don’t need to generate mirror image entries like this where prob_given_trues and prob_given_falses sum to 1 . The Bayesian integration infers it since https://github.com/home-assistant/core/pull/67631 and you should get a repair warning for using a config that does since Add to issue registry if user has mirrored entries for breaking in #67631 by HarvsG · Pull Request #79208 · home-assistant/core · GitHub


I edited your code to ‘fix’ the problem:

###########################################
# create columns for each unique state
###########################################
orig_columns = df.columns
for col in orig_columns:
    
    #if col == TRUTH_COLUMN:
    #    if not INCLUDE_TRUTH_COLUMN_AS_OBSERVATION:
    #        continue

    if col.startswith("sensor."):
        # numeric
        continue
        
    unique_states = df[col].unique()
    if len(unique_states) < 2:
        print(f"{col} does not vary so it is uninformative so drop it")
        df.drop(col, axis=1, inplace=True)
    if len(unique_states) == 2: #This is binary so we don't need duplicate work
        print(f"{col} is binary")
        new_col = f"{col}/{unique_states[0]}"
        df[new_col] = df[col].copy().apply(lambda x: x == unique_states[0])
        if col == TRUTH_COLUMN:
            TRUTH_COLUMN = new_col
        df.drop(col, axis=1, inplace=True)
    if len(unique_states) > 2:
        print(f"{col} has > 2 states")
        for state in unique_states:
            new_col = f"{col}/{state}"
            df[new_col] = df[col].copy().apply(lambda x: x == state)
        if col != TRUTH_COLUMN:
            df.drop(col, axis=1, inplace=True)

I like that you’re working on this. I think the Bayesian sensor is vastly under-appreciated. After the useless “year of voice” that we’ve had in 2023, I would love to see 2024 become a “year of automation” that focuses on things like making the Bayesian integration much easier to use, e.g. by making it UI-configurable, automatically deriving probabilities based on historical data, and better documentation.

3 Likes

I see a bayesian helper on the “add integration” menu, but it seems to do nothing?

Clicking this:
image

Just goes to the Helpers page. I don’t see a bayesian option anywhere in the list of available helpers, when I try to add a helper.

What am I missing?

Last update they made it so all YAML-configured helpers would show up in the Helper list, even though you can’t do anything with them…

https://www.home-assistant.io/blog/2024/08/07/release-20248/#integrations-and-helpers-set-up-via-yaml-now-visible-in-the-ui

1 Like

I am working on a UI set up - it will come with time

4 Likes

In my Search around Bayesian sensor adding to person entity I bumped into this post. @HarvsG any thoughts on the possibility to make it available for the person entity hence it improves presence/person detection.

Combining bayesian with the person would make Automations (for me) less vulnerable to false positieves or device malfunctions (simply replace tracker linked to person.x instead of changing al entities in several automations.

Ooo, think is an interesting and good point.

I’m not sure what the entity filter is on the person entity

Edit: it seems that it is filtered for device_trackers only and I’m not sure of a built-in way I could optionally turn the Bayesian sensor into a device tracker.

In the meantime there is some discussion of doing this with an automation here that you might find helpful:

It relies on a little know action device_tracker.see which allows one to manually change the state of a device_tracker.

If you find a simple, elegant and ideally blueprint able way of doing this then I could add it to the Bayesian Docs.

1 Like

This is quite an old thread (I see the number of active installations is down to 254 since it started :slightly_frowning_face:) and things have probably (heh) moved on, but…

I use the integration quite a lot (without, I confess, understanding much of the theory) and I find that one of its limitations is the fact that the HA sensors it generates are binary - which seems counter-intuitive, given you’re trying to decide whether something is probably true.

In practical terms, the probability attribute is extremely useful and the integration would benefit from making it a sensor in its own right.

For room-level presence detection, for example, I have a Bayesian entity for each room and a template which compares their probability attributes and picks the highest. This compensates for phones being left in the other room, dogs triggering motion sensors and all the other stuff that makes binary presence detection go wrong.

Just a thought…

1 Like

Not sure if it’s a good Idea when the bayesian is fed by the Same device_tracker that is going to be changed by the .see action

Yes, I agree with this. Especially as the probability could be converted into a binary sensory with the aid of a threshold helper.

However it would be a big breaking change to switch this. So I think the solution in future is going to be to create a template sensor from the probability attribute. I believe these can now be blueprinted, so my plan is to make an official blueprint and add it to the docs.

I expect the user base to pick up once the Config Flow UI PR is merged

2 Likes