How do you monitor your various devices and services?

So I’m at a point where I have a lot of stuff set up and integrated into HA like Docker containers, VMs and various devices or integrations.

It’s inevitable that something goes wrong at some point like my Synology drive failing which I didn’t notice at first. I’ve now set up an automation based on the sensors I get from the Synology integration.

My question is very simple: Do you monitor all the stuff above somehow and, if yes, how? If not, why not?

Thanks for any input!

I use a markdown card on a dashboard that shows all of the entities that are unavailable or unknown. ONce you have that set up, you can tweak the code to ignore some of them that you expect to be unknown or unavailable for whatever reason (note the “rejectattr” linbes in the code below. Here is my yaml. An auto-entities card can be used as well but the spacing is horrible such that the card takes up a ton of space, so this one keeps it very small and concise with no wasted space:

type: markdown
title: Unknown/Unavailable Sensors
content: >
  {% set sensor_issues = states
     | selectattr('state', 'in', ['unknown','unavailable'])
     | rejectattr('entity_id', 'match', 'device_tracker.')
     | rejectattr('name', 'search', 'S23|current|Amarmfob|Reconnect|Reboot|Repairs|Reload|Restart|Alexa|holding|Estate|Room|Vibration|Backup|WS2812B|BasementWaterMeter|WAN|Identify|AlarmFob|Home Assistant Cloud|Master_Controller|Activity status|Clear hold|AirNow|Take Snapshot|Ding Dong|Last Ding|Last Pressed')
     | sort(attribute='last_changed', reverse=true)
     | list %}
  {% if sensor_issues | length > 0 %} {% for s in sensor_issues %} {{
  as_local(s.last_changed).strftime('%I:%M:%S %p') }}: {{ s.name }}

  {% endfor %} {% else %} None  - All Good! {% endif %}

The above sorts everything by the time they became unavailable or unknown, and I find it very convenient. I just restarted a bunch of services so my list is big and then shrinks pretty quickly as everything wakes up/kicks in. Here is the top of the card so you get an idea of how it works/whatit looks like. If there is nothing unavailable or unkown (that I care about) then you can see I have in the logic the text “None - All Good!”.

This is of course only part of the monitoring, I have ways to monitor other things as well - :slight_smile:

1 Like

Automation:

- id: f2917319-23f4-4f1c-9dca-ea4d1ccc2969
  alias: 'Unavailable Entities Alert'
  mode: single
  max_exceeded: silent
  triggers:
  - trigger: state
    entity_id: sensor.unavailable_entities
    not_to:
    - unknown
    - unavailable
    for: 180
  conditions:
  - condition: numeric_state
    entity_id: sensor.unavailable_entities
    above: 0
  actions:
  - action: notify.telegram_alert
    data:
      title: "⚠️<b>Unavailable Entities</b>"
      message: >
        The following entities are unavailable:

        {{ state_attr('sensor.unavailable_entities','entity_ids') }}

Triggered template sensor:

- trigger:
    - trigger: time_pattern
      minutes: "/3"
  sensor:
    - name: Unavailable Entities
      unique_id: 272d21b8-48ab-4e65-8000-32c7ea62deb2
      state_class: measurement
      unit_of_measurement: ents
      icon: "mdi:cancel"
      state: >
        {% set ignore_list = states('input_text.ignore_list').split(',')  %}
        {{ states
          | selectattr('state','eq', 'unavailable')
          | rejectattr('entity_id', 'in', ignore_list)
          | map(attribute='entity_id')
          | list 
          | count
        }}
      attributes:
        entity_ids: >
          {% set ignore_list = states('input_text.ignore_list').split(',')  %}
          {{ states
            | selectattr('state','eq', 'unavailable')
            | rejectattr('entity_id', 'in', ignore_list)
            | map(attribute='entity_id')
            | list 
            | join(', \n')
          }}

Input text:

ignore_list:
  name: Ignore List
  icon: mdi:file-document-remove-outline
  max: 255

Surprisingly it does not go off that offten.

I also have alerts for system resources, RAM, CPU, etc…

2 Likes