No worries, it can be a bit to wrap your head around at first. I come from an embedded hardware design background, where we use actual watchdog ICs (the microcontroller must toggle a specific GPIO line in a specific manner within a specific time interval or the watchdog will physically reset the system), so I suppose I take the concept for granted.
I’m glad you see the value in the concept. My component is pretty lightweight, and pretty much every major Linux distribution runs systemd these days, so I see no reason HA shouldn’t ship with it as part of the core package.
For anyone else still confused as to exactly what a process-watchdog does, I linked to an article in my first post by one of the systemd developers that explains it, but I’ll leave the relevant part here:
First of all, to make software watchdog-supervisable it needs to be patched to send out “I am alive” signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the WATCHDOG_USEC= environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue sd_notify(“WATCHDOG=1”) calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.
To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the WatchdogSec= to the desired failure latency. See systemd.service(5) for details on this setting. This causes WATCHDOG_USEC= to be set for the service’s processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.
If a service enters a failure state as soon as the watchdog logic detects a hang, then this is hardly sufficient to build a reliable system. The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set Restart=on-failurefor the service. To configure how many times a service shall be attempted to be restarted use the combination of StartLimitBurst= and StartLimitInterval= which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with StartLimitAction=. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values arereboot, reboot-force and reboot-immediate. reboot attempts a clean reboot, going through the usual, clean shutdown logic. reboot-force is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast). Finally,reboot-immediate does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. reboot-immediate hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in systemd.service(5).
Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn’t help.
Basically, my hass-systemd module allows HA to talk to systemd over d-bus. Periodically, my module has to tell systemd, “Hey, we’re alive!” or systemd assumes HA has locked up, in which case it force-kills the process and restarts it. My module also allows HA to tell systemd when it’s performing startup and shutdown tasks, which allows you to reliably use timeouts for these events as well.