List devices being offline for too long (based on newly introduced last_reported state attribute)

Hi All,

Newbie HA user posting for the 1st time (slowly migrating my stuff from vanilla Node-RED and MQTT while hoping to not make too many obvious mistakes or post stupid topics…).

Certainly I’m not the first one dealing with devices silently disappearing from my smart home. But I didn’t find an immediate solution addressing my desire to list devices that have been offline for too long. What I want:

  1. Classify some (not all) of my devices into importance categories (for simplicity classifying the devices themselves and not their individual entities)
  2. Calculate the time of ‘last known contact’ (any contact across all entities for one device)
  3. List devices that are offline for too long, e.g. last known contact more than 24 hours ago

I saw that HA recently introduced a new attribute to its state called last_reported and this seems to do (mostly) what I need for monitoring the availability of my devices. (At times last_changed and last_updated seem to be emotionally discussed; but in my understanding this last_reported is also updated even if no real change to the state occurred and therefore translates into time of last known contact.)

I’m tagging some of my devices with an optional custom label (‘Availability check:daily’) and use the Ninja template below in a Markdown card on a Dashboard. It loops through all states and aggregates for these labelled devices by their most recent last_reported timestamp (note that I aggregate by device instead of each entity as this is the level I’m interested in). In my case all devices labeled with ‘Availability check:daily’ with a last_known_contact more than 24 hours ago are listed.

All of this is work-in-progress and super-inefficient (not even sure if it isn’t crashing bigger HA installations due to excessive looping). Still it might be helpful for others. Or others might have suggestions how to improve this (e.g. making more customizable, providing as a blueprint, having active notifications, not ‘abusing’ the templates for such logic, …). I’m also ok if someone says that this is an inefficient/stupid/naive/ugly approach and that I’m better off using solution XYZ.


Screenshot from Markdown card on Dashboard (listing all devices without contact during the last hour):


Ninja template:

{# use namespace as a workaround to add data as it iterates over all states #}
{%- set ns = namespace(d1={} ) %}
  
{# get all states #}
{%- for state in states | sort(attribute='entity_id') %}
  {# get all devices for label #}
  {%- for d in label_devices('Availability check: daily') %}
    {# check if current state belongs to labelled device #}
    {%- if d == device_id(state.entity_id) %}
      {# relevant / labelled device found, determine its proper name #}
      {%- set device_name = device_attr(d, "name_by_user") %}
      {%- if device_name is none %}
        {%- set device_name = device_attr(d, "name") %}
      {%- endif %}

      {%- if device_name not in ns.d1 %} 
        {# first time device encountered #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- elif ns.d1[device_name] < state.last_reported %}
        {# more recent last_reported found #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- endif %}
    {%- endif %}  
  {%- endfor %}
{%- endfor %}

{# now we have a dictionary for every labelled device with last_reported timestamp #}
{# {{ ns.d1 }} #}

{# define cut off timestamp to be considered 'too old' #}
{%- set timestampCutoff = now() - timedelta( hours = 24, minutes = 0 ) %}

<table border="1">
<tr><th>Device</th><th>Last known contact</th></tr>
{# loop over dictionary and only list entities that reported before timestampCutoff #}
{%- for key, value in ns.d1.items() %}
  {%- if as_timestamp(value|e) < as_timestamp(timestampCutoff) %}
<tr><td>{{ key|e }}</td><td>{{ ((now() - as_datetime(value|e))) }} ago</td></tr>
  {%- endif %}
{%- endfor %}
</table>

Note: It also has the limitation that the last_reported doesn’t ‘survive’ a HA restart, so that timers are reset on every restart. Guess the workaround here could be to run this templates every X hours and store the last known contact times somewhere where they survive a HA restart…

2 Likes

Very good idea! But it is not working for me…

You’re right - there have been a number of posts about this lately (some of them emotional :grin:).

I’m a bit puzzled by them, though, because devices don’t “silently disappear”. Honestly, they don’t. Anyone who has this experience has a serious issue which needs addressing.

Several of the devices on your list look as if they might be Zigbee sensors - are they the problem?

I like your idea of assigning devices to categories by importance, by the way.

Do you get any error message or simply an empty table?

If an error message, can you share that?

If just an empty table, did you make sure to label at least one device with the custom label ‘Availability check: daily’ (needs to be exactly like that)? Or maybe all your labelled devices already reported back within the last 24 hours? To verify this you can also adjust the cutoff time and lower it from 24 hours to maybe just one hour by adjusting the hours and minutes in this line {%- set timestampCutoff = now() - timedelta( hours = 24, minutes = 0 ) %}

Maybe that is obvious to others, but you can also paste the template in the ‘Template’ tab of the ‘Developers tools’ for faster turn-around/testing.

Agreed. Devices shouldn’t go offline. But it looks like at least for some (including me) every now and then the reality is different. And I admit that it can be related to my ‘zoo’ of different devices. Zigbee, Zwave, Wifi, Cloud integration, Homekit, local gateways, …

Just a few examples of the last years or so in my setup:

  1. My Sonoff temp sensor at times decides to require a re-pairing (for Zigbee2Mqtt there are multiple reports that certain devices / manufacturers have similar problems)
  2. My Daikin cloud integration had an API change and with this the HA integration required a re-authentication
  3. The access point used in my basement just to provide connectivity to my custom reader device for my natural gas meter had a fault and therefore the reader device (connected via Wifi) went offline
  4. My wife unknowingly unplugged a local gateway used to control our windows
  5. A Hue / Zigbee bulb was faulty and as such didn’t respond anymore
  6. The water leak sensor in the washing room was running out of battery

This all doesn’t happen on a weekly (or even monthly basis). And for devices frequently used a problem is obvious (e.g. the living room light doesn’t work anymore). But for other convenience or security devices problems might be less obvious. Or even critical in case of a water leak sensor that might only be required every other year.

To sum it up, even in a perfect functioning setup I still would like to have the re-assurance that everything is up and alive. And if not, I would like to be notified, similar to a ‘oil low’ light in my car.

1 Like

Hello, I’m testing your card, but it’s not working.
I have 74 unavailable entities, which belong to 2 devices that are turned off, but they are not listed on the card.

Any idea?

I have to admit that I don’t really know what ‘unavailable device’ for HA means.

Maybe as a start for debugging use the template below (might be best for simplicity to paste this into the Templates tab of Developer tools). This runs through every state and lists the most recent last_reported timestamp for every device. Disclaimer: For this I really don’t know how this works for installations with many devices.

Check if you can find your unavailable devices in the output. If this doesn’t produce any result (and also doesn’t crash) there might be differences between our HA versions? I’m currently on 2024.4.2.

{# Collect all last_reported from labelled devices #}
{# hack to all growing dictionary #}
{%- set ns = namespace(d1={} ) %}
 
{# {{ relative_time(now()) }} #}
  
{# get all states #}
{%- for state in states | sort(attribute='entity_id') %}
  {# get all devices for label #}
    {# check if current state belongs to labelled device #}
    {%- set d = device_id(state.entity_id) %}
      {# relevant / labelled device found, determine its proper name #}
      {%- set device_name = device_attr(d, "name_by_user") %}
      {%- if device_name is none %}
        {%- set device_name = device_attr(d, "name") %}
      {%- endif %}
 
      {%- if device_name not in ns.d1 %} 
        {# first time device encountered #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- elif ns.d1[device_name] < state.last_reported %}
        {# more recent last_reported found #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- endif %}
{%- endfor %}
 
{# now we have a dictionary for every labelled device with last_reported timestamp #}
{{ ns.d1 }}

Got thie error in Developer tools:

TemplateError: Must provide a device or entity ID

Core
2024.4.3
Supervisor
2024.04.0
Operating System
12.2
Frontend
20240404.2

Maybe HA states allow entity states without being linked to any device.

I have added another if not none check and this template puts out an ugly list of all most recent last_reported states per device (at least for me).

{# Collect all last_reported from labelled devices #}
{# hack to all growing dictionary #}
{%- set ns = namespace(d1={} ) %}
 
{# {{ relative_time(now()) }} #}
  
{# get all states #}
{%- for state in states | sort(attribute='entity_id') %}
  {# get all devices for label #}
    {# check if current state belongs to labelled device #}
    {%- set d = device_id(state.entity_id) %}
    {%- if d is not none %}
      {# relevant / labelled device found, determine its proper name #}
      {%- set device_name = device_attr(d, "name_by_user") %}
      {%- if device_name is none %}
        {%- set device_name = device_attr(d, "name") %}
      {%- endif %}
 
      {%- if device_name not in ns.d1 %} 
        {# first time device encountered #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- elif ns.d1[device_name] < state.last_reported %}
        {# more recent last_reported found #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- endif %}
    {%- endif %}
{%- endfor %}
 
{# now we have a dictionary for every labelled device with last_reported timestamp #}
{{ ns.d1 }}

This generated the list with several devices.

And now?
What is the code to generate the card?

Do you see any of your previously mentioned unavailable devices in that list? If so, which timestamp do they list?

And just for clarification. Earlier you mentioned ~70 unavailable entities, but this template checks for the ‘overarching’ devices. So you would need to identify the actual devices to which these unavailable entities belong to and should search for these device names in the list.

And then of course these devices also need to be labelled with exact label name.

These are the devices:

'1mmw': datetime.datetime(2024, 4, 25, 11, 30, 33, 978550, tzinfo=datetime.timezone.utc), 

and

'Energia': datetime.datetime(2024, 4, 25, 11, 30, 20, 549987, tzinfo=datetime.timezone.utc), 

Both seem to have pushed some data today before noon (2024, 4, 25, 11, 30, 20). According to this they have not been offline for more than 24 hours; hence they wouldn’t show up in the dashboard card.

Is it possible that these devices actively report on some entities while other of their entities aren’t available anymore?

For testing purpose you could also try to reduce the cutoff timestamp of my original template from 24 hours down to maybe just 30 mins. In this case these devices you have listed (again assuming they have also been properly labeled) should then appear in the markdown card.

I lost this information.
I restarted HA in the morning, so I change to 6h

This my code now:

{# Collect all last_reported from labelled devices #}
{# hack to all growing dictionary #}
{%- set ns = namespace(d1={} ) %}
 
{# {{ relative_time(now()) }} #}
  
{# get all states #}
{%- for state in states | sort(attribute='entity_id') %}
  {# get all devices for label #}
    {# check if current state belongs to labelled device #}
    {%- set d = device_id(state.entity_id) %}
    {%- if d is not none %}
      {# relevant / labelled device found, determine its proper name #}
      {%- set device_name = device_attr(d, "name_by_user") %}
      {%- if device_name is none %}
        {%- set device_name = device_attr(d, "name") %}
      {%- endif %}
 
      {%- if device_name not in ns.d1 %} 
        {# first time device encountered #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- elif ns.d1[device_name] < state.last_reported %}
        {# more recent last_reported found #}
        {%- set d2 = { device_name : state.last_reported } %}
        {%- set ns.d1 = dict(ns.d1, **d2) %}
      {%- endif %}
    {%- endif %}
{%- endfor %}
 
{# now we have a dictionary for every labelled device with last_reported timestamp #}
{# {{ ns.d1 }} #}

{# define cut off timestamp to be considered 'too old' #}
{%- set timestampCutoff = now() - timedelta( hours = 2, minutes = 00 ) %}

<table border="1">
<tr><th>Device</th><th>Last known contact</th></tr>
{# loop over dictionary and only list entities that reported before timestampCutoff #}
{%- for key, value in ns.d1.items() %}
  {%- if as_timestamp(value|e) < as_timestamp(timestampCutoff) %}
<tr><td>{{ key|e }}</td><td>{{ ((now() - as_datetime(value|e))) }} ago</td></tr>
  {%- endif %}
{%- endfor %}
</table>

How do I leave just the table?
I edited the code and commented the line {{ ns.d1 }}

And I think these shouldn’t appear on the list:

Edit 2:

After removing some test devices:
image
It would be interesting to have an exclusion list

Cool. You are making some progress.

My initial template contained the line {%- for d in label_devices('Availability check: daily') %}

This checks if a device has that specific label; and only then any entities from that devices are analyzed. So you can see this as an inclusion list (kinda the opposite of your exclusion list, but should lead to the same results).

Did you label all the 6 devices from your list with that specific custom label (‘Availability check: daily’) in HA?

Some thoughts on this:

  • I agree with: Devices DO go offline silently. In reality it happens. :+1:
  • Why monitoring only a couple of devices and not all? Are there devices in your smart home where you do not care if they are online or not? :wink:
  • Why looking to/checking devices just to detect, that they are fine? If the device is okay, then a check is a waste of time. I guess you are interested in offline devices and not in the ones which are online.
  • The only entities which give you a real knowledge about the device status are sensors and binary sensors, since this are the entities which are updated by the device itself. Think about it. :wink:

This is my solution (just for info):
https://community.home-assistant.io/t/detecting-unresponsive-devices/658030/16

Good points. And am glad I’m not the only one with devices sneaking out of my system :wink:

Haven’t seen your solution. Looks more advanced and it’s touching unknown areas (to me) - more to learn for me - thx for the link! Looks like you are using last_changed and last_updated and my very first interpretation of their meanings was that these might not been always been updated and could lead to false positives (e.g. when an entity reports the same value like the previous one). Maybe that is a false conclusion?

Valid point about monitoring all devices. I wanted to categorize my devices as I don’t expect all of these being active in the same intervals. E.g. my basement motion sensor might be lonely for a few days in a row, whereas my main door motion sensor shouldn’t. Another case is that some things are season-dependent, like my heating thermostats. During summer they are off / without batteries and with the tagging I thought that I can quickly remove them from the list of devices-to-be-checked. I also play around a bit and test things out; just to realize that I won’t need them, often some Internet-based things (recently weather and solar/PV forecast integrations). So don’t want such things on my ‘offline’ list.

Not sure about the comment re: ‘why looking to/checking devices just to detect that they are fine’. Indeed if a device is fine, I don’t want to know. That’s why only devices being offline for too long (in my current case 24 hours) are listed.

And for the sensors/binary sensors I guess I don’t know enough to really talk about. Do things like physical buttons and lights also fall into that category? Cause I also care about them, despite them only responding to a physical interaction. E.g. it is rare that my outdoor lights triggered at night via motion sensors aren’t going on at least once per night. So if one bulb isn’t, then the likelihood for a problem is high and I want to know.

Have you considered building in a bit of redundancy? Two sensors with different protocols - one Zigbee and one wi-fi, or something?

The power monitoring smart plugs that are in my cupboard right now ? The Christmas lights ?