Detecting unresponsive devices

It sometimes happens that devices silently go down without any chance to tell Home Assistant about that.

Most of the time I see this with battery powered devices like door/window sensors (but also others). The last battery level it reports was for example 45% or so and the next day the device is offline and will never come back.

If you want to know about this, here is my solution to detect this issue:
(REMARK: not the latest version, see UPDATE post)

input_select:
  connectivity_check_blacklist:
    name: "Connection check: Devices blacklist"
    options:
      - device-1-which-should-never-be-alarmed
      - device-2-which-should-never-be-alarmed
      - ...
    icon: mdi:block-helper

automation:
  - id: "connectivity_check"
    alias: "Connection check: Check connectivity for all devices"
    mode: restart
    trigger:
      - platform: time
        at: "08:10"
    action:
      - service: script.connectivity_check

script:
  connectivity_check:
    alias: "Connection check: Check connectivity for all devices"
    mode: queued
    variables:
      devices: >-
        {# create empty list to hold the devices #}
        {% set ns = namespace(devices = []) %}

        {# loop through ALL entities #}
        {% for entity in (states | map(attribute='entity_id') | list) %}
          {# ignore everything which does not have a device #}
          {% if device_attr(entity, 'name_by_user') != none %}
            {# this is to avoid doubled entries, so remove possible former entries and add a new one #}
            {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
            {% set ns.devices = ns.devices + [(device_attr(entity, 'name_by_user'), area_name(entity))] %}
          {% endif %}
        {% endfor %}

        {# loop again through ALL entities #}
        {% for entity in (states | map(attribute='entity_id') | list) %}
          {# ignore everything which does not have a device #}
          {% if device_attr(entity, 'name_by_user') != none %}
            {# look at last updated and last changed #}
            {% if states[entity].last_changed > (now() - timedelta(hours=8)) or states[entity].last_updated > (now() - timedelta(hours=8)) %}
              {# remove every entry where any entity of the device was updated or changed within the last 8 hours #}
              {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
            {% endif %}
          {% endif %}
        {% endfor %}

        {# remove any device from the blacklist #}
        {% for device in ns.devices %}
          {% if device[0] in state_attr('input_select.connectivity_check_blacklist', 'options') %}
            {% set ns.devices = ns.devices | reject('in', [(device[0], device[1])]) | list %}
          {% endif %}
        {% endfor %}

        {# here we have our unresponsive devices list #}
        {{ ns.devices }}
    sequence:
      - repeat:
          # loop through our devices
          for_each: "{{ devices }}"
          sequence:
            # send message
            - service: script.message_warning_device_offline
              data_template:
                device: "{{ repeat.item[0] }}"
                area: "{{ repeat.item[1] }}"

Itā€™s fully automated, no need to do any configurations on your devices or entities. Any future device is covered also automatically.

What is it doing?

  • first step: create a list of all devices by looping through every entity and get the device (if set) into a list (together with the area)

  • second step: loop again through all entities and check the last update and last change; if the entity changed anyhow then remove the corresponding device from the list (no need to worry, this device seems to be alive)

  • third step: remove any device defined in the blacklist (I use an input_select helper for this, if anyone has a better ideaā€¦)

  • last step: the remaining devices in the list must be those where no entity changed within the last 8 hours, so send an alarm message

The messaging script ā€œscript.message_warning_device_offlineā€ turns everything into a Pushover message but this is not the scope of this post. Feel free to adapt it to your needs.

The solution is working fine with my Home Assistant setup and my devices. There may be configurations where this is not working as expected.

Blacklist
I defined it in a YAML file, but it should be possible to setup a dropdown helper in the UI for this purpose. Using the UI gives the option to change the blacklist in the UI.

Remark
It seems, after restarting Home Assistant, entities of dead devices get a refreshed ā€˜last_changedā€™ and/or ā€˜last_updatedā€™. For the script it looks like this devices are online. But after the defined 8 hours this devices were detected again.

4 Likes

Thank you for this which is outstanding, I may be using it. You might want to add something in there aboiut devices that return a state of ā€œUnknownā€ or ā€œUnavailableā€ as well.

Definitely an improvementā€¦ :+1:t2:

But is it needed?
If an ā€˜unavailableā€™ or ā€˜unknownā€™ state lead to the result that the device is not removed from the list (and so it is alarmed) then the purpose is reached, isnā€™t it?

But basically I agreeā€¦ itā€™s usually a bad idea to hope that undefined states lead to proper results. :wink:

UPDATE

Since it is running for some time now and I also had some thinking about this I found an important improvement.
It comes from this: When checking if a device is alive or not it is not useful to look at ALL itā€™s entities since some of the entities can (and will) be changed from Home Assistant processes.

Letā€™s take for example a battery powered thermostat. The climate entity may change all the time by automations or whatever. So this will not give any information about the real device status. Instead we have to look at entities, which will not be changed by HAā€¦ means sensors and binary sensors.

Hereā€™s an updated version of the script:

script:
  connectivity_check:
    alias: "Connection check: Check connectivity for all devices"
    mode: queued
    variables:
      devices: >-
        {# create empty list to hold the devices #}
        {% set ns = namespace(devices = []) %}

        {# loop through binary sensor entities #}
        {% for entity in (states.binary_sensor | map(attribute='entity_id') | list) %}
          {# ignore everything which does not have a device #}
          {% if device_attr(entity, 'name_by_user') != none %}
            {# this is to avoid doubled entries, so remove possible former entries and add a new one #}
            {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
            {% set ns.devices = ns.devices + [(device_attr(entity, 'name_by_user'), area_name(entity))] %}
          {% endif %}
        {% endfor %}

        {# loop through sensor entities #}
        {% for entity in (states.sensor | map(attribute='entity_id') | list) %}
          {# ignore everything which does not have a device #}
          {% if device_attr(entity, 'name_by_user') != none %}
            {# this is to avoid doubled entries, so remove possible former entries and add a new one #}
            {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
            {% set ns.devices = ns.devices + [(device_attr(entity, 'name_by_user'), area_name(entity))] %}
          {% endif %}
        {% endfor %}

        {# loop again through binary sensor entities #}
        {% for entity in (states.binary_sensor | map(attribute='entity_id') | list) %}
          {# ignore everything which does not have a device #}
          {% if device_attr(entity, 'name_by_user') != none %}
            {# look at last changed #}
            {% if states[entity].last_changed != 'unknown' and states[entity].last_changed != 'unavailable' %}
              {% if states[entity].last_changed > (now() - timedelta(hours=16)) %}
                {# remove every entry where any entity of the device was changed within the last 16 hours #}
                {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
              {% endif %}
            {% endif %}
            {# look at last updated #}
            {% if states[entity].last_updated != 'unknown' and states[entity].last_updated != 'unavailable' %}
              {% if states[entity].last_updated > (now() - timedelta(hours=16)) %}
                {# remove every entry where any entity of the device was updated within the last 16 hours #}
                {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
              {% endif %}
            {% endif %}
          {% endif %}
        {% endfor %}

        {# loop again through sensor entities #}
        {% for entity in (states.sensor | map(attribute='entity_id') | list) %}
          {# ignore everything which does not have a device #}
          {% if device_attr(entity, 'name_by_user') != none %}
            {# look at last changed #}
            {% if states[entity].last_changed != 'unknown' and states[entity].last_changed != 'unavailable' %}
              {% if states[entity].last_changed > (now() - timedelta(hours=8)) %}
                {# remove every entry where any entity of the device was changed within the last 8 hours #}
                {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
              {% endif %}
            {% endif %}
            {# look at last updated #}
            {% if states[entity].last_updated != 'unknown' and states[entity].last_updated != 'unavailable' %}
              {% if states[entity].last_updated > (now() - timedelta(hours=8)) %}
                {# remove every entry where any entity of the device was updated within the last 8 hours #}
                {% set ns.devices = ns.devices | reject('in', [(device_attr(entity, 'name_by_user'), area_name(entity))]) | list %}
              {% endif %}
            {% endif %}
          {% endif %}
        {% endfor %}

        {# remove any device from the blacklist #}
        {% for device in ns.devices %}
          {% if device[0] in state_attr('input_select.connectivity_check_blacklist', 'options') %}
            {% set ns.devices = ns.devices | reject('in', [(device[0], device[1])]) | list %}
          {% endif %}
        {% endfor %}

        {# here we have our devices list #}
        {{ ns.devices }}
    sequence:
      - repeat:
          # loop through our devices
          for_each: "{{ devices }}"
          sequence:
            # send message
            - service: script.message_warning_device_offline
              data_template:
                device: "{{ repeat.item[0] }}"
                area: "{{ repeat.item[1] }}"
1 Like

i will definitely check this out. thanks!

Could you please provide an example of your script.message_warning_device_offline script?

My experience with notifications has been hit or miss, so any examples will help a lot.

Actually itā€™s a system of several scripts. :wink:

My notification system is based on Pushover so Iā€™m not sure whether you will find it helpfully.

Understandable. Thanks.

Thanks for your script. As Iā€™ve not yet used scripts that call other actions with variables, I am currently trying to figure out how to adapt your script to send a simple notification instead, with a comma separated list of unresponsive devices. And as I was searching, I found First script with data_template where a moderator wrote three years ago: ā€œdata_template: was deprecated many releases ago in favor of data:ā€. I assume there is a good reason why you use data_template:, could you explain? Thanks a lot!

(Iā€™m also so unfamiliar with jinja that it surprises me that repeat: in your script first is a command that starts a loop but later seems to be a variable holding an array of items, see repeat.item[0], how can it be both, is the variable automatically assigned when calling a loop?)

Youā€™re finding is right, ā€ždata_templateā€œ is deprecated. Just replace it with a simple ā€ždataā€œ, it should work the same.

The ā€žrepeat:ā€œ action has nothing to do with Jinjia.
See here: https://www.home-assistant.io/docs/scripts/

If this happens often enough for you to write an automationā€¦ shouldnā€™t you be asking why?

2 Likes

Sounds like valid logic. :wink:
But in some cases I do not see myself as the creator of the issue but as the one who has to deal with it.

I have Z-Wave sensors on nearly all windows/doors. Some of them report 30% battery but work for months. Sometimes they show 40% forever but are dead because of empty battery.

A Z-Wave button report 100% right after charging but goes down to 0% without any value in between.

In oppositeā€¦ an outdoor sensor (also Z-Wave) is quite accurate and reliable in battery level measurement.

Obviously it hardly depends on how smart the vendor has implemented the battery level management. I can only deal with what is built in.

Itā€™s simply my experience that I cannot fully trust the reported battery level of several devices.

Are you using rechargeable batteries?

Only that one device. Itā€˜s designed to use a small rechargeable battery.

All other devices (window sensors, fire detection sensors, ā€¦) have itā€˜s own battery sizes and they are not rechargeable.

im sorry but can you please show a noob, like myself, on how to use this?
i copied n saved your script into my config folder as seen here

i assume i have to put my devices in here?
for example, here is a zwave Ring Keypad v2.

and here is an entity that im interested in monitoring

what to put in line 7 of your script?
only put in ā€œbinary_sensor.keypad_v2_motion_detectionā€ without quote, correct?
next, how to set up automation to run once a week? if an entity does not report status change within 3 days, then device needs battery change alert to my Telegram?

Noā€¦ you do not have to put any of your devices anywhere. :wink:

You only need the ā€šinput_selectā€˜ and the ā€šautomationā€˜ from the first post and the updated script from the third post.

The script will find your sensor by itself. In fact, it goes through every sensor and binary_sensor, no matter where it comes from.

The only thing you need to adapt:

  • the ā€šinput_selectā€˜ to define devices which should be ignored
  • maybe the automation when you want to run it at another time

alright, i made it this far:

i assume i put this section of the code in my HA configuration.yaml right?

and that goes back to an earlier question. what is the syntax of the device to ignore? can you just give me samples of your code that you put into input_select?

lastly, once your script report a list of dead devices, how to view?

The input select can be as YAML in ā€˜configuration.yamlā€™ but you can also define it as input helper in the GUI. Just the name is important, the script looks for ā€˜input_select.connectivity_check_blacklistā€™. If you want to use another name you need to adapt the script.

Device names in options: Just use the friendly name as you can see it in the GUI.
Example:

input_select:
  connectivity_check_blacklist:
    name: "VerbindungsprĆ¼fung: Blacklist"
    options:
      - My-Handy
      - Button 001
      - Shelly Dimmer 2 349454729EA8
    icon: mdi:block-helper
1 Like