Custom Component: systemd (Run HA as a systemd notify daemon, with watchdog support!)

timothybrown · November 21, 2018, 7:19am

Hey guys, I just whipped up a little component to allow running HA as a proper systemd daemon, with support for the watchdog!

Now, why would anyone want such a thing? Well, this blog post by one of the systemd developers explains how the systemd watchdog works better than I can, so I won’t go into a lot of detail here. Basically, it’s a way for systemd (which is the backbone of most modern Linux systems) to make sure that HA is actually alive and working.

Let’s say, for example, HA’s core loop locks up somehow, but doesn’t actually crash. In that case, our little component would stop “petting” the watchdog every 15 seconds, in which case systemd would automatically kill and restart it. You can optionally setup systemd to automatically reboot the system if a service fails more than X times in a Y minute interval!

When you combine this with a hardware watchdog, which embedded systems like the Raspberry Pi have onboard, you end up with a top down method that ensures your critical services stay operational. (So, the hardware watchdog makes sure systemd is functioning and systemd makes sure your services are functioning.)

In addition to systemd watchdog support, I’ve also added systemd notify hooks. In a nutshell this allows HA to tell systemd when startup has completed, when it’s shutting down and reports the PID of the main process.

This is important for a number of reasons, but the main one is this: HA can take awhile to fully startup, sometimes on the order of minutes! If a process has no way to talk to the init system, the init system has no way to know if the process has actually finished startup or if it’s hung. Now that we can tell systemd when we’ve finished starting, we can use another cool systemd feature: Startup time limits. Which is exactly what it sounds like; basically, if a process doesn’t report a READY status in X seconds, systemd will consider it failed and restart it. There is also a shutdown time limit as well, in which case systemd can hard kill the process.

So, if you want to try out these awesome sounding features, you can grab my component here.

I’ve been running it for a few days and it seems rock solid, but obviously more testers would be appreciated!

Let me know if the instructions aren’t clear enough, but it should be rather simple to setup. I’ve also provided a sample systemd service file for Home Assistant to get you going.

This is my first true HA component, so feedback on everything from the quality of the code, to bugs, to better ways I could be doing things is welcomed!

subzero79 · December 23, 2018, 9:40pm

I’ve been using this component for a month. So far i can say works ok but I haven’t had a system hang yet. The only problem is they reboots. I think in this case fails to notify that ha is terminated to the shutdown so waits the timeout of the ha unit(“a stop job is running…”)
I had to reduce the timeout to 3 min, 5 minutes for a reboot seems excessive.

timothybrown · April 26, 2019, 8:50pm

Hey, sorry I’m so late in replying, for some reason I was never notified of your post.

Yes, I concur 5 minutes is a long timeout. Personally, I use a 60 second timeout, though this hasn’t been an issue for me since the root cause of HA not stopping was fixed in January. I set 5 minutes as the default because HA can take several minutes to start after an upgrade. This can be fixed by using TimeoutStartSec and TimeoutStopSec to give separate values. (I use 5m for start and 1m for stop.)

I’m about to upload v0.2.0 which converts the component into an integration to future proof it. I’m also working to get the component integrated into HA itself. I’ve been using it since November with zero adverse side-effects. In fact, it’s saved my bacon at least once when I was out of town and HA locked up. The watchdog worked and force-restarted HA.

I’m also working on v0.3 which will add further safe-guards, namely the plugin will monitor other components to make sure they’re functioning, to help with issues where parts of HA lock up but the core stays running.

subzero79 · May 9, 2019, 1:39am

When is it available to test again the 0.2.0 version?

timothybrown · May 22, 2019, 5:08am

It should be up on GitHub now! (Sorry about that, I had pushed it to my local server and not GitHub. Doh!)

subzero79 · May 22, 2019, 5:44am

So as for current ha, create a systemd folder in cc, and add init.py?

subzero79 · May 22, 2019, 8:17am

Do you have a way to test this? force a hang or a lock in HA? killing the PID systemd knows about it and restarts.

Yesterday ha was off, and yes is pretty annoying when it goes down. I am gonna take a safe guard see if i can do automatons directly with nodered from deconz to esphome in case ha fails.

timothybrown · May 22, 2019, 8:36am

Yes, add __init__.py and manifest.json to a folder named custom_components/systemd.

Yes, I have another custom component I designed to lock up HA. Let me make sure it still works and I’ll upload it for you. (I used this to test the watchdog functionality of the systemd component during development.)

You can also directly make sure the watchdog functionality is enabled via: systemctl show --property=WatchdogTimestamp hass.service (It’ll show the last time the dog was pet, which should be every 150s by default.)

subzero79 · May 22, 2019, 1:25pm

This is typical found in SO, or serverfault. Just to get a notification when unit enters failed state, in case someone needs

/etc/systemd/system/[email protected]

[Unit]
Description=Send Telegram

[Service]
Type=oneshot
ExecStart=/bin/bash -c '/usr/local/bin/send_telegram "<b>[SYSTEMD ALERT %i]:</b>*\n <code>Unit entered failed state, restart triggered\n `/bin/systemctl status %i`</code>"'
User=nobody
Group=systemd-journal

[Install]
WantedBy=multi-user.target

Just add to hass.service [Unit] section

OnFailure=notify-telegram@%i.service

Just to have the time when it happened, in case you need to scroll through the logs

Same can be done if you have any postfix service running or any other notification system.

timothybrown · May 22, 2019, 6:57pm

Nice little hack there! I had planned to include a similar feature in the future via the Notify component. If the service is restarted due to failure it will create a persistent notification in HA, and optionally send it to a user defined list of notify entities as well (such as mobile_app).

subzero79 · May 23, 2019, 4:10am

how will ha know if it was restarted due to failure if it happens when ha is locked?

timothybrown · May 23, 2019, 5:22pm

You can query systemd as to the reason for the last process restart, so the idea is we’d check this field during HA startup and if the reason is WatchdogFailure, send a message via the notify component. We’re already talking to systemd over the d-bus interface, so it’s basically nothing to ask the reason for restart.

Now, the only time this wouldn’t work is if you run into a situation where HA is locking up immediately on restart. However, I suspect that sort of failure mode would be fairly rare in unattended situations. (The only time I’ve ever seen it is either after an upgrade of HA or after editing a config file, both of which would have a user there monitoring the restart anyway.)

This is more of a convenience function for people who don’t want to setup telegram, boxcar, SMTP or similar services. If you’re happy with telegram doing the notifications, there’s no reason you have to enable my notification feature once implemented. Your way of doing it is the more robust option (and really goes to showcase the flexibility of systemd).

subzero79 · May 23, 2019, 9:59pm

Thanks for the clarification. At the beginning I didn’t get how the watchdog feature worked. Having a read of your code and through systemd docs is more clear now. Yes this code should should be included by default in ha, given the importance of being the central automation software

timothybrown · May 24, 2019, 7:37am

No worries, it can be a bit to wrap your head around at first. I come from an embedded hardware design background, where we use actual watchdog ICs (the microcontroller must toggle a specific GPIO line in a specific manner within a specific time interval or the watchdog will physically reset the system), so I suppose I take the concept for granted.

I’m glad you see the value in the concept. My component is pretty lightweight, and pretty much every major Linux distribution runs systemd these days, so I see no reason HA shouldn’t ship with it as part of the core package.

For anyone else still confused as to exactly what a process-watchdog does, I linked to an article in my first post by one of the systemd developers that explains it, but I’ll leave the relevant part here:

First of all, to make software watchdog-supervisable it needs to be patched to send out “I am alive” signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the WATCHDOG_USEC= environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue sd_notify(“WATCHDOG=1”) calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the WatchdogSec= to the desired failure latency. See systemd.service(5) for details on this setting. This causes WATCHDOG_USEC= to be set for the service’s processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.

If a service enters a failure state as soon as the watchdog logic detects a hang, then this is hardly sufficient to build a reliable system. The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set Restart=on-failurefor the service. To configure how many times a service shall be attempted to be restarted use the combination of StartLimitBurst= and StartLimitInterval= which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with StartLimitAction=. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values arereboot, reboot-force and reboot-immediate. reboot attempts a clean reboot, going through the usual, clean shutdown logic. reboot-force is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast). Finally,reboot-immediate does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. reboot-immediate hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in systemd.service(5).

Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn’t help.

Basically, my hass-systemd module allows HA to talk to systemd over d-bus. Periodically, my module has to tell systemd, “Hey, we’re alive!” or systemd assumes HA has locked up, in which case it force-kills the process and restarts it. My module also allows HA to tell systemd when it’s performing startup and shutdown tasks, which allows you to reliably use timeouts for these events as well.

subzero79 · May 27, 2019, 12:03am

What is the correct line to debug the component?

logger:
  default: info
  logs:
    homeassistant.components.systemd: debug

I am seeing this issue that the daemon is restarting via systemd once i restart via ha service call. Basically is restarting twice. The msg i get from the onfailure directive is

May 27 09:18:39 ha systemd[1]: hass.service: Main process exited, code=killed, status=9/KILL
May 27 09:18:39 ha systemd[1]: hass.service: Unit entered failed state.
May 27 09:18:39 ha systemd[1]: hass.service: Triggering OnFailure= dependencies.
May 27 09:18:39 ha systemd[1]: hass.service: Failed with result 'timeout'.

I tried to increase the timeout for watchdog for 7 minutes but my guess is the issue is not there. If i restart via systemd doesn’t trigger twice

Some testing
If i restart via ha then check the dog pat ts

systemctl show --property=WatchdogTimestamp hass.service
WatchdogTimestamp=Mon 2019-05-27 09:50:43 AEST

Then the telegram msg comes when it triggers the restart including systemd status output:

Active: activating (auto-restart) (Result: timeout) since Mon 2019-05-27 09:51:59 AEST; 17ms ago

That is shorter than half the watchdog time

subzero79 · May 27, 2019, 2:13am

I can see the notify STATUS successfully updated, but somehow fails to notify READY=1. As systemctl keeps reporting ‘deactivating’
I mean it doesn’t fail, i can see the log tha ha pushes the READY status, but systemd doesn’t receive it or doesn’t acknowledge.

Made a video so you can see the whole sequence.

https://www.dropbox.com/s/vcoxv3s0jey1tlf/Peek%202019-05-27%2011-20.mp4?dl=0

I even try to use the alternative systemd module

from cysystemd.daemon import notify, Notification

then

    notify(Notification.READY)

in the ~~good_dog~~ notify_started function but doesn’t make a difference

timothybrown · May 27, 2019, 4:34am

First off, thanks for the detailed bug report! (I mean that; I wish other people would put in a fraction of the amount of effort you have to track down a problem before reporting it.)

So, I think I know what’s going on here: I normally never restart HA from within HA. Generally I do it via systemctl restart hass.service, so this issue may have slipped by me. I know it did work at one point, I just completely forgot to test it with future builds.

It looks like systemd is killing HA based on the Timeout= (or StartTimeout/StopTimeout) value. This is different from the WatchdogTimeout= value, as it controls the time systemd waits for a READY or STOPPING notification on startup or shutdown.

This tells me that either:

The hass-systemd component isn’t reporting the new PID after restart, therefor systemd is ignoring READY notifications from it. (Systemd will only accept messages from process IDs it considers valid for that service.)
HA is killing the hass-systemd thread before it has a chance to send a STOPPING notification to systemd, hence the perpetual deactivating status you’re seeing.

Give me 24 hours and I’ll have a new build for you to test.

subzero79 · May 27, 2019, 5:37am

There is no way of proxy-ing or mitm the notify socket right? So should be watching then d-bus to see the stop msg right?

subzero79 · May 27, 2019, 5:48am

Not sure about this. The pid and the status msg are both updated. I can see the log on ha and the subsequent the pid change on the sctl status

Not sure also about point 2. But maybe should be able to test instead of watching the stop event send notify intercepting the call_service event on homeassistant.restart

timothybrown · May 27, 2019, 5:58am

Correct. D-bus messages include information about the sender (process ID, cgroups, etc.) and systemd uses this information to make sure it only accepts notifications from what it considers the main PID. [I belive there *is* an option you can enable in the service file that *will* allow it to accept notifications from all child processes and not just the parent.]