Home Assistant Integrations Global Watchdog (maintenance)

Hello everyone,
First post here (excuse my french) and I thought I’d share a new Automation I am testing. Disclaimer: I am getting help from ai to create it

The idea is to have an automation that runs in the background and reloads any integration that seems “unavailable” or “unknown”.
I’ve had a few “spin-off” from a couple of integrations without explainable reasons. So I thought I’d mitigate the problem by automating their reload.

In more details, the automation runs every five minutes. :stopwatch:

  1. If an entity is OK, it doesn’t do anything with said entity. :white_check_mark:
  2. If an entity is KO since at least 10 minutes, it tries to reload it once. :arrows_clockwise:
  • If the reload is a success then nothing (the automation works as it should). :white_check_mark:
    If it’s not a success then:
  1. It’ll add all the KO entities into a helper ( input_text.watchdog_global_exclusions_temp). This is to exclude them from upcoming global watchdog scans until the underlying problem is fixed.
  2. It’ll add them to a to do list ( todo.watchdog_issues ) for me to remember! Those entities have a problem and I should work on them.
  3. It’ll send me a push notification, telling me which entities have a problem. And by doing so asking me if I want to “clear” or “rearm”

Clear means that the entities stay in the helper and that upon the next checks they will be ignored so that the automation doesn’t do constant reloads. This is valid until these entities are erased from the helper.

Rearm means that the entities are deleted from the helper and that the automation will rescan them.

On the side of this automation, I have a “maintenance” dashboard that has the todo list and the rearm button

Here is the code for the automation:

input_text:
  watchdog_global_exclusions_temp:
    name: "Watchdog global — Exclusions temporaires"
    max: 255
    initial: ""

input_datetime:
  watchdog_global_dernier_reload:
    name: "Watchdog global — Dernier reload"
    has_date: true
    has_time: true

  watchdog_global_derniere_alerte:
    name: "Watchdog global — Dernière alerte"
    has_date: true
    has_time: true

  watchdog_global_derniere_mise_en_pause:
    name: "Watchdog global — Dernière mise en pause"
    has_date: true
    has_time: true

input_button:
  watchdog_global_acquitter:
    name: "Watchdog — Acquitter"
    icon: mdi:check-circle-outline

  watchdog_global_rearmer:
    name: "Watchdog — Réarmer"
    icon: mdi:reload

template:
  - sensor:
      - name: "Watchdog global — Entités KO"
        unique_id: watchdog_global_entites_ko
        icon: mdi:heart-broken
        state: >
          {% set ignored_domains = [
            'automation', 'script', 'scene', 'group', 'sun', 'person', 'zone',
            'button', 'event', 'update'
          ] %}
          {% set ignored_entities = [
            'sensor.watchdog_global_entites_ko',
            'sensor.watchdog_global_integrations_ko',
            'input_text.watchdog_global_exclusions_temp',
            'input_datetime.watchdog_global_dernier_reload',
            'input_datetime.watchdog_global_derniere_alerte',
            'input_datetime.watchdog_global_derniere_mise_en_pause',
            'input_button.watchdog_global_acquitter',
            'input_button.watchdog_global_rearmer',
            'sensor.onduleur_production_cumulee',
            'sensor.mesure_du_courant_exporte'
          ] %}
          {% set temp_excluded_raw = states('input_text.watchdog_global_exclusions_temp') %}
          {% set temp_excluded = temp_excluded_raw.split('|') if temp_excluded_raw not in ['unknown','unavailable','none',''] else [] %}
          {% set ns = namespace(items=[]) %}

          {% for s in states %}
            {% set domain = s.entity_id.split('.')[0] %}
            {% if domain not in ignored_domains
                  and s.entity_id not in ignored_entities
                  and s.entity_id not in temp_excluded
                  and s.state in ['unknown', 'unavailable'] %}
              {% set age_min = (as_timestamp(now()) - as_timestamp(s.last_changed)) / 60 %}
              {% if age_min >= 10 %}
                {% set ns.items = ns.items + [s.entity_id] %}
              {% endif %}
            {% endif %}
          {% endfor %}

          {{ ns.items | count }}

        attributes:
          entities: >
            {% set ignored_domains = [
              'automation', 'script', 'scene', 'group', 'sun', 'person', 'zone',
              'button', 'event', 'update'
            ] %}
            {% set ignored_entities = [
              'sensor.watchdog_global_entites_ko',
              'sensor.watchdog_global_integrations_ko',
              'input_text.watchdog_global_exclusions_temp',
              'input_datetime.watchdog_global_dernier_reload',
              'input_datetime.watchdog_global_derniere_alerte',
              'input_datetime.watchdog_global_derniere_mise_en_pause',
              'input_button.watchdog_global_acquitter',
              'input_button.watchdog_global_rearmer',
              'sensor.onduleur_production_cumulee',
              'sensor.mesure_du_courant_exporte'
            ] %}
            {% set temp_excluded_raw = states('input_text.watchdog_global_exclusions_temp') %}
            {% set temp_excluded = temp_excluded_raw.split('|') if temp_excluded_raw not in ['unknown','unavailable','none',''] else [] %}
            {% set ns = namespace(items=[]) %}

            {% for s in states %}
              {% set domain = s.entity_id.split('.')[0] %}
              {% if domain not in ignored_domains
                    and s.entity_id not in ignored_entities
                    and s.entity_id not in temp_excluded
                    and s.state in ['unknown', 'unavailable'] %}
                {% set age_min = (as_timestamp(now()) - as_timestamp(s.last_changed)) / 60 %}
                {% if age_min >= 10 %}
                  {% set ns.items = ns.items + [s.entity_id] %}
                {% endif %}
              {% endif %}
            {% endfor %}

            {{ ns.items }}

      - name: "Watchdog global — Intégrations KO"
        unique_id: watchdog_global_integrations_ko
        icon: mdi:puzzle-remove
        state: >
          {% set bad_entities = state_attr('sensor.watchdog_global_entites_ko', 'entities') | default([], true) %}
          {% set ns = namespace(ids=[]) %}
          {% for e in bad_entities %}
            {% set cid = config_entry_id(e) %}
            {% if cid is not none %}
              {% set ns.ids = ns.ids + [cid] %}
            {% endif %}
          {% endfor %}
          {{ ns.ids | unique | list | count }}

        attributes:
          entry_ids: >
            {% set bad_entities = state_attr('sensor.watchdog_global_entites_ko', 'entities') | default([], true) %}
            {% set ns = namespace(ids=[]) %}
            {% for e in bad_entities %}
              {% set cid = config_entry_id(e) %}
              {% if cid is not none %}
                {% set ns.ids = ns.ids + [cid] %}
              {% endif %}
            {% endfor %}
            {{ ns.ids | unique | list }}

automation:
  - id: watchdog_global_reload_integrations
    alias: "Watchdog global — Reload intégrations KO"
    description: >
      Surveille les entités unknown/unavailable depuis au moins 10 minutes,
      tente un reload de leur intégration, puis alerte si le problème persiste.
      Exclut Huawei Solar, géré par son watchdog dédié.
    mode: single

    trigger:
      - platform: time_pattern
        minutes: "/5"

    variables:
      reload_cooldown_minutes: 60
      alert_cooldown_minutes: 60

      bad_entities: "{{ state_attr('sensor.watchdog_global_entites_ko', 'entities') | default([], true) }}"
      entry_ids: "{{ state_attr('sensor.watchdog_global_integrations_ko', 'entry_ids') | default([], true) }}"

      last_reload: "{{ states('input_datetime.watchdog_global_dernier_reload') }}"
      can_reload: >
        {% if last_reload in ['unknown', 'unavailable', 'none', ''] %}
          true
        {% else %}
          {{ (as_timestamp(now()) - as_timestamp(as_datetime(last_reload))) > (reload_cooldown_minutes * 60) }}
        {% endif %}

      last_alert: "{{ states('input_datetime.watchdog_global_derniere_alerte') }}"
      can_alert: >
        {% if last_alert in ['unknown', 'unavailable', 'none', ''] %}
          true
        {% else %}
          {{ (as_timestamp(now()) - as_timestamp(as_datetime(last_alert))) > (alert_cooldown_minutes * 60) }}
        {% endif %}

    condition:
      - condition: template
        value_template: "{{ bad_entities | count > 0 }}"
      - condition: template
        value_template: "{{ entry_ids | count > 0 }}"
      - condition: template
        value_template: "{{ can_reload }}"

    action:
      - repeat:
          for_each: "{{ entry_ids }}"
          sequence:
            - service: homeassistant.reload_config_entry
              data:
                entry_id: "{{ repeat.item }}"
            - delay: "00:00:05"

      - service: input_datetime.set_datetime
        target:
          entity_id: input_datetime.watchdog_global_dernier_reload
        data:
          datetime: "{{ now().isoformat() }}"

      - delay: "00:00:30"

      - variables:
          still_bad: >
            {% set ns = namespace(items=[]) %}
            {% for e in bad_entities %}
              {% if states(e) in ['unknown', 'unavailable'] %}
                {% set ns.items = ns.items + [e] %}
              {% endif %}
            {% endfor %}
            {{ ns.items }}

      - choose:
          - conditions:
              - condition: template
                value_template: "{{ still_bad | count > 0 and can_alert }}"
            sequence:
              - service: input_text.set_value
                target:
                  entity_id: input_text.watchdog_global_exclusions_temp
                data:
                  value: "{{ still_bad | join('|') }}"

              - service: input_datetime.set_datetime
                target:
                  entity_id: input_datetime.watchdog_global_derniere_mise_en_pause
                data:
                  datetime: "{{ now().isoformat() }}"

              - repeat:
                  for_each: "{{ still_bad }}"
                  sequence:
                    - service: todo.add_item
                      target:
                        entity_id: todo.watchdog_issues
                      data:
                        item: "Watchdog: {{ repeat.item }}"
                        description: >
                          Entité toujours en {{ states(repeat.item) }} après tentative de reload automatique
                          le {{ now().strftime('%d.%m.%Y à %H:%M') }}.
                          Cette entité a été placée en exclusion temporaire jusqu'à acquittement/réarmement.
                        due_date: "{{ now().date().isoformat() }}"

              - service: persistent_notification.create
                data:
                  notification_id: "watchdog_global_ko"
                  title: "⚠️ Watchdog global — Entités encore KO"
                  message: >
                    Après tentative automatique de reload, certaines entités restent
                    en unknown/unavailable :
                    {{ still_bad | join(', ') }}.

                    Elles sont maintenant exclues des prochains cycles jusqu'à acquittement
                    ou réarmement manuel.

              - service: notify.yannick
                data:
                  title: "⚠️ Watchdog global — Entités encore KO"
                  message: >
                    Entités encore KO après reload : {{ still_bad | join(', ') }}.
                    Elles sont temporairement exclues jusqu'à ton acquittement.
                  data:
                    tag: watchdog_global_ko
                    persistent: true
                    sticky: true
                    actions:
                      - action: "WATCHDOG_ACK"
                        title: "Acquitter"
                      - action: "WATCHDOG_REARM"
                        title: "Réarmer"

              - service: input_datetime.set_datetime
                target:
                  entity_id: input_datetime.watchdog_global_derniere_alerte
                data:
                  datetime: "{{ now().isoformat() }}"

  - id: watchdog_global_ack_mobile
    alias: "Watchdog global — Acquittement depuis notification"
    mode: single
    trigger:
      - platform: event
        event_type: mobile_app_notification_action
        event_data:
          action: WATCHDOG_ACK

    action:
      - service: script.alerte_systeme
        data:
          niveau: info
          titre: "Watchdog acquitté"
          message: >
            Les entités actuellement exclues restent en pause technique.
            Le backlog reste visible dans la liste to-do pour traitement ultérieur.

  - id: watchdog_global_rearm_mobile
    alias: "Watchdog global — Réarmement depuis notification"
    mode: single
    trigger:
      - platform: event
        event_type: mobile_app_notification_action
        event_data:
          action: WATCHDOG_REARM

    action:
      - service: input_text.set_value
        target:
          entity_id: input_text.watchdog_global_exclusions_temp
        data:
          value: ""

      - service: persistent_notification.dismiss
        data:
          notification_id: "watchdog_global_ko"

      - service: script.alerte_systeme
        data:
          niveau: info
          titre: "Watchdog réarmé"
          message: "Les exclusions temporaires du watchdog global ont été vidées."

  - id: watchdog_global_ack_button
    alias: "Watchdog global — Acquittement depuis bouton"
    mode: single
    trigger:
      - platform: state
        entity_id: input_button.watchdog_global_acquitter

    action:
      - service: script.alerte_systeme
        data:
          niveau: info
          titre: "Watchdog acquitté"
          message: >
            Les exclusions en cours sont conservées.
            Les éléments restent visibles dans la to-do list pour traitement plus tard.

  - id: watchdog_global_rearm_button
    alias: "Watchdog global — Réarmement depuis bouton"
    mode: single
    trigger:
      - platform: state
        entity_id: input_button.watchdog_global_rearmer

    action:
      - service: input_text.set_value
        target:
          entity_id: input_text.watchdog_global_exclusions_temp
        data:
          value: ""

      - service: persistent_notification.dismiss
        data:
          notification_id: "watchdog_global_ko"

      - service: script.alerte_systeme
        data:
          niveau: info
          titre: "Watchdog réarmé"
          message: "Les exclusions temporaires du watchdog global ont été vidées."

And here is the code for the dashboard:

views:
  - title: Watchdog
    path: watchdog
    icon: mdi:shield-alert
    cards:
      - type: vertical-stack
        cards:
          - type: entities
            title: Watchdog — État
            entities:
              - entity: sensor.watchdog_global_entites_ko
                name: Entités KO actives
              - entity: sensor.watchdog_global_integrations_ko
                name: Intégrations touchées
              - entity: input_text.watchdog_global_exclusions_temp
                name: Exclusions temporaires
              - entity: input_datetime.watchdog_global_dernier_reload
                name: Dernier reload
              - entity: input_datetime.watchdog_global_derniere_alerte
                name: Dernière alerte
              - entity: input_datetime.watchdog_global_derniere_mise_en_pause
                name: Dernière mise en pause
          - type: horizontal-stack
            cards:
              - type: button
                entity: input_button.watchdog_global_acquitter
                name: Acquitter
                icon: mdi:check-circle-outline
                tap_action:
                  action: perform-action
                  perform_action: input_button.press
                  target:
                    entity_id: input_button.watchdog_global_acquitter
              - type: button
                entity: input_button.watchdog_global_rearmer
                name: Réarmer
                icon: mdi:reload
                tap_action:
                  action: perform-action
                  perform_action: input_button.press
                  target:
                    entity_id: input_button.watchdog_global_rearmer
          - type: todo-list
            entity: todo.watchdog_issues
            title: Watchdog — Backlog
            hide_completed: false
            display_order: duedate_asc 

It’s definitely a work in progress.
Please comment and give me you thoughts on to how to make it better or how you would go about it.

Happy coding fellow smart home owners! :grinning: