Tenacious Watchdog for Add-ons (Example for Zigbee2MQTT)

Hello,

My second post on the forums here was asking how to make the Zigbee2MQTT Watchdog keep attempting to start the add-on if it kept failing (as it gave up incredibly quickly - something like a minute or two). Unfortunately, nobody had any suggestions.

The reason for this is that if Zigbee2MQTT cannot reach the Zigbee coordinator (in my case, a network-connected coordinator) for 10 minutes or so, it will stop the service. The watchdog will do its thing restarting it for a little while, but eventually (within another 10 or 15 minutes, if I recall), it will give up too (note that it will fail and the watchdog will give up within about a minute if you start the server while the coordinator is unreachable).

This is not ideal in a home lab where I mess around with things quite frequently, and after spending some time in my lab, I’ve had a handful of occasions where I will wander off to go to the bathroom or something, and find out none of my crap works (even though I’d reconnected the coordinator).

Anyway, why spend the few minutes occasionally trying to troubleshoot something that comes up infrequently, when I can spend an entire evening trying to figure out how to prevent it altogether? lol. Anyway, here is my over-engineered, automation-based add-on watchdog:

First, we need to enable the Zigbee2MQTT “Running” entity so that we have a condition to check in the automation. I’ve not found any other means by which to get its status

  1. Go to Settings>Devices and Services>Devices. Search “Zigbee2MQTT.” Select it.
  2. Under “Sensors,” expand the entity list by clicking “+4 entities not shown.”
  3. Select “Running.”
  4. Select the “Settings” tab, expand “Advanced settings,” and change the Entity Status to “hidden.” Click “Update.”
  5. It will take about 30 seconds for HA to enable the entity. Refresh the page, and ensure it now automatically shows the entity in the “Sensors” list (without having to unhide it), and shows it as “Running (Hidden).”

Then, add the following automation. Note that the reason for the forcing of the entity status update is because, for some reason, HA only updates the Zigbee2MQTT “running” entity every 5 minutes. So, this automation will not start trying until up to 5 minutes after it stops. You can optionally create a separate automation that will force an update of the entity on a schedule, but I figured up to 5 minutes was fine for my use. If you happen to know where I can view/edit the YAML of automatically-created entities, please let me know - I’d modify the “scan_interval” on it if that were an option.

This automation will check every minute to see if the addon is running. If it is, then it doesn’t run. If it isn’t, it will attempt to start the service, wait, force-update the entity, and loop back again. It will keep looping, checking the service’s status and then attempting to start it again as necessary, until the service has been running for 5 minutes before exiting the loop.

alias: Zigbee2MQTT Watchguard
description: ""
trigger:
  - platform: time_pattern
    minutes: "*"
condition:
  - condition: state
    entity_id: binary_sensor.zigbee2mqtt_running
    state: "off"
    for:
      hours: 0
      minutes: 0
      seconds: 0
action:
  - repeat:
      until:
        - condition: state
          entity_id: binary_sensor.zigbee2mqtt_running
          state: "on"
          for:
            hours: 0
            minutes: 5
            seconds: 0
      sequence:
        - if:
            - condition: state
              entity_id: binary_sensor.zigbee2mqtt_running
              state: "off"
          then:
            - service: hassio.addon_start
              data:
                addon: 45df7312_zigbee2mqtt
            - delay:
                hours: 0
                minutes: 1
                seconds: 0
                milliseconds: 0
            - service: homeassistant.update_entity
              data: {}
              target:
                entity_id: binary_sensor.zigbee2mqtt_running
          else:
            - delay:
                hours: 0
                minutes: 0
                seconds: 30
                milliseconds: 0
            - service: homeassistant.update_entity
              data: {}
              target:
                entity_id: binary_sensor.zigbee2mqtt_running
mode: single
7 Likes

Zigbee2mqtt has never stopped for me, although mine is connected via usb.

If your network connected coordinator is failing connection, then fix your network. Detecting failure and restarting is not really a solution.

I explained my use case in the post. “I” am what is wrong with my network - and this is the solution to that problem. There are other scenarios that this would resolve reliability issues caused by the add-on’s design flaw, as well.

Another example of why this applies in more than just my use case, is in the event of a power outage. If the network-based coordinator is in another part of the house for it to be more centrally-located, and you don’t want to have it on a UPS (after all, why would you need to be able to control your lights if they don’t have power?), but your server, in another part of the house, is. If the power goes out for short enough time that your server stays up, then none of your smarthome stuff will work when the power comes back on. A smart home should be reliable, regardless of the circumstances, but the current design of the Zigbee2MQTT add-on is inherently unreliable. This automation fixes that (well, it’s a workaround for it - a true “fix” would need to be done on the add-on).

I suspect that Zigbee2MQTT was originally designed with only USB adapters in mind - that’s the only explanation that makes sense for this being the default behavior. If it were designed around the idea of network coordinators, then the add-on would remain running and retry connecting to the coordinator on an interval.

You are perhaps confusing z2m and the add-on configuration.

I have a similar issue. Sometimes my ethernet attached coordinator goes offline from power outage, reboot of router/hub etc and can take a while to connect. This fixes that.
Bloody awesome!

Curious as to what coordinator you use? I got one of those preflashed Athom ones. Had some trouble getting it working using the supplied instructions, but got there in the end without swearing too much or pulling most of my hair out…

I’m not familiar enough with where the functionality/configuration of the base Zigbee2MQTT ends and the add-on’s starts. Good point, though - it is possible (and now that you mention it, likely) that it is down to how the add-on handles it.

I use this one: CC2652P2 Based Zigbee to PoE Coordinator 2022 – TubesZB

So far, it has been solid. No issues with the coordinator itself. If yours is working without any issues, I wouldn’t imagine there would be too much reason to switch, though. I just switched to this to decouple the coordinator from the physical server, so I could move the VM around within my Proxmox cluster as I please.

1 Like

This is great! I didn’t even realize addons showed up as devices.

Supervisor added restart limits back in August: Set limits on watchdog retries by mdegat01 · Pull Request #3779 · home-assistant/supervisor · GitHub

Some addons may have expected the watchdog to work as before, and not implemented any sort of restart or retry logic. For rtl_433 I am probably going to switch to using s6 services, but it will take a while for the whole ecosystem to do that if that ends up becoming the best practice.

Ah, that could certainly explain why it was designed that way. Thanks for that historical info!

I have added an additional if statement in case it’s already running but we need a restart - in a form of a scipt (for my use case I prefer to call scripts from automations), tought I share this one which is based on the main article.


alias: Restart Zigbee2Mqtt
sequence:
  - if:
      - condition: state
        entity_id: binary_sensor.zigbee2mqtt_edge_running
        state: "on"
    then:
      - service: mqtt.publish
        data:
          topic: zigbee2mqtt/bridge/request/restart
          qos: "2"
    alias: If running then restart
  - if:
      - condition: or
        conditions:
          - condition: state
            entity_id: binary_sensor.zigbee2mqtt_edge_running
            state: "off"
          - condition: state
            entity_id: binary_sensor.zigbee2mqtt_edge_running
            state: unavailable
          - condition: state
            entity_id: binary_sensor.zigbee2mqtt_edge_running
            state: unknown
    then:
      - service: hassio.addon_start
        data:
          addon: 45df7312_zigbee2mqtt_edge
    alias: If not running then start
  - delay:
      hours: 0
      minutes: 30
      seconds: 0
      milliseconds: 0
  - service: homeassistant.update_entity
    data: {}
    target:
      entity_id: binary_sensor.zigbee2mqtt_edge_running
mode: single
4 Likes

How can I implement this script? :thinking: