Custom Component: systemd (Run HA as a systemd notify daemon, with watchdog support!)

You can query systemd as to the reason for the last process restart, so the idea is we’d check this field during HA startup and if the reason is WatchdogFailure, send a message via the notify component. We’re already talking to systemd over the d-bus interface, so it’s basically nothing to ask the reason for restart.

Now, the only time this wouldn’t work is if you run into a situation where HA is locking up immediately on restart. However, I suspect that sort of failure mode would be fairly rare in unattended situations. (The only time I’ve ever seen it is either after an upgrade of HA or after editing a config file, both of which would have a user there monitoring the restart anyway.)

This is more of a convenience function for people who don’t want to setup telegram, boxcar, SMTP or similar services. If you’re happy with telegram doing the notifications, there’s no reason you have to enable my notification feature once implemented. Your way of doing it is the more robust option (and really goes to showcase the flexibility of systemd).

Thanks for the clarification. At the beginning I didn’t get how the watchdog feature worked. Having a read of your code and through systemd docs is more clear now. Yes this code should should be included by default in ha, given the importance of being the central automation software

No worries, it can be a bit to wrap your head around at first. I come from an embedded hardware design background, where we use actual watchdog ICs (the microcontroller must toggle a specific GPIO line in a specific manner within a specific time interval or the watchdog will physically reset the system), so I suppose I take the concept for granted.

I’m glad you see the value in the concept. My component is pretty lightweight, and pretty much every major Linux distribution runs systemd these days, so I see no reason HA shouldn’t ship with it as part of the core package.

For anyone else still confused as to exactly what a process-watchdog does, I linked to an article in my first post by one of the systemd developers that explains it, but I’ll leave the relevant part here:

First of all, to make software watchdog-supervisable it needs to be patched to send out “I am alive” signals in regular intervals in its event loop. Patching this is relatively easy. First, a daemon needs to read the WATCHDOG_USEC= environment variable. If it is set, it will contain the watchdog interval in usec formatted as ASCII text string, as it is configured for the service. The daemon should then issue sd_notify(“WATCHDOG=1”) calls every half of that interval. A daemon patched this way should transparently support watchdog functionality by checking whether the environment variable is set and honouring the value it is set to.

To enable the software watchdog logic for a service (which has been patched to support the logic pointed out above) it is sufficient to set the WatchdogSec= to the desired failure latency. See systemd.service(5) for details on this setting. This causes WATCHDOG_USEC= to be set for the service’s processes and will cause the service to enter a failure state as soon as no keep-alive ping is received within the configured interval.

If a service enters a failure state as soon as the watchdog logic detects a hang, then this is hardly sufficient to build a reliable system. The next step is to configure whether the service shall be restarted and how often, and what to do if it then still fails. To enable automatic service restarts on failure set Restart=on-failurefor the service. To configure how many times a service shall be attempted to be restarted use the combination of StartLimitBurst= and StartLimitInterval= which allow you to configure how often a service may restart within a time interval. If that limit is reached, a special action can be taken. This action is configured with StartLimitAction=. The default is a none, i.e. that no further action is taken and the service simply remains in the failure state without any further attempted restarts. The other three possible values arereboot, reboot-force and reboot-immediate. reboot attempts a clean reboot, going through the usual, clean shutdown logic. reboot-force is more abrupt: it will not actually try to cleanly shutdown any services, but immediately kills all remaining services and unmounts all file systems and then forcibly reboots (this way all file systems will be clean but reboot will still be very fast). Finally,reboot-immediate does not attempt to kill any process or unmount any file systems. Instead it just hard reboots the machine without delay. reboot-immediate hence comes closest to a reboot triggered by a hardware watchdog. All these settings are documented in systemd.service(5).

Putting this all together we now have pretty flexible options to watchdog-supervise a specific service and configure automatic restarts of the service if it hangs, plus take ultimate action if that doesn’t help.

Basically, my hass-systemd module allows HA to talk to systemd over d-bus. Periodically, my module has to tell systemd, “Hey, we’re alive!” or systemd assumes HA has locked up, in which case it force-kills the process and restarts it. My module also allows HA to tell systemd when it’s performing startup and shutdown tasks, which allows you to reliably use timeouts for these events as well.

1 Like

What is the correct line to debug the component?

logger:
  default: info
  logs:
    homeassistant.components.systemd: debug

I am seeing this issue that the daemon is restarting via systemd once i restart via ha service call. Basically is restarting twice. The msg i get from the onfailure directive is

May 27 09:18:39 ha systemd[1]: hass.service: Main process exited, code=killed, status=9/KILL
May 27 09:18:39 ha systemd[1]: hass.service: Unit entered failed state.
May 27 09:18:39 ha systemd[1]: hass.service: Triggering OnFailure= dependencies.
May 27 09:18:39 ha systemd[1]: hass.service: Failed with result 'timeout'.

I tried to increase the timeout for watchdog for 7 minutes but my guess is the issue is not there. If i restart via systemd doesn’t trigger twice

Some testing
If i restart via ha then check the dog pat ts

systemctl show --property=WatchdogTimestamp hass.service
WatchdogTimestamp=Mon 2019-05-27 09:50:43 AEST

Then the telegram msg comes when it triggers the restart including systemd status output:

Active: activating (auto-restart) (Result: timeout) since Mon 2019-05-27 09:51:59 AEST; 17ms ago

That is shorter than half the watchdog time

I can see the notify STATUS successfully updated, but somehow fails to notify READY=1. As systemctl keeps reporting ‘deactivating’
I mean it doesn’t fail, i can see the log tha ha pushes the READY status, but systemd doesn’t receive it or doesn’t acknowledge.

Made a video so you can see the whole sequence.

https://www.dropbox.com/s/vcoxv3s0jey1tlf/Peek%202019-05-27%2011-20.mp4?dl=0

I even try to use the alternative systemd module

from cysystemd.daemon import notify, Notification

then

    notify(Notification.READY)

in the good_dog notify_started function but doesn’t make a difference

First off, thanks for the detailed bug report! (I mean that; I wish other people would put in a fraction of the amount of effort you have to track down a problem before reporting it.)

So, I think I know what’s going on here: I normally never restart HA from within HA. Generally I do it via systemctl restart hass.service, so this issue may have slipped by me. I know it did work at one point, I just completely forgot to test it with future builds.

It looks like systemd is killing HA based on the Timeout= (or StartTimeout/StopTimeout) value. This is different from the WatchdogTimeout= value, as it controls the time systemd waits for a READY or STOPPING notification on startup or shutdown.

This tells me that either:

  1. The hass-systemd component isn’t reporting the new PID after restart, therefor systemd is ignoring READY notifications from it. (Systemd will only accept messages from process IDs it considers valid for that service.)

  2. HA is killing the hass-systemd thread before it has a chance to send a STOPPING notification to systemd, hence the perpetual deactivating status you’re seeing.

Give me 24 hours and I’ll have a new build for you to test.

There is no way of proxy-ing or mitm the notify socket right? So should be watching then d-bus to see the stop msg right?

Not sure about this. The pid and the status msg are both updated. I can see the log on ha and the subsequent the pid change on the sctl status

Not sure also about point 2. But maybe should be able to test instead of watching the stop event send notify intercepting the call_service event on homeassistant.restart

Correct. D-bus messages include information about the sender (process ID, cgroups, etc.) and systemd uses this information to make sure it only accepts notifications from what it considers the main PID. [I belive there *is* an option you can enable in the service file that *will* allow it to accept notifications from all child processes and not just the parent.]

Yes, I noticed that too when testing just now.

I think I’ve figured out an easy way to fix your issue (assuming it’s problem #2), basically listening for the restart event (like you suggested). I’m testing it now.

Not sure if the call service intercept will work. I added this

shell_command:
  send_systemd_notify: /usr/bin/systemd-notify STOPPING=1

Then i called the service in ha, i can see status shows

deactivating (stop-sigterm)

A bit under

Status: "Home Assistant is running." Which is correct since i just send only STOPPING, then call homeassistant.restart and same happens again

For this you have to setup

NotifyAccess=all

A dirty workaround is just to disable the STOPPING notification, just use the STATUS messages

Sorry for the delay, holiday weekend here in the US and all.

It turns out it’s not problem #2. After some additional testing combined with your experimentation I believe I’ve figured out what’s going on. Essentially, we’re catching the ha.stop event and sending a STOPPING notification to systemd. The problem is, since we’re restarting HA it never actually stops; systemd sits there waiting for it to stop until the timeout is reached. This is confirmed by the fact that disabling the ha.stop listener fixes the problem.

So, I think we need to listen for the ha.restart event and, if detected, set a flag. We’ll have the ha.stop listener function check that flag to determine if it should send the STOPPING notification or not.

Alternatively we can have the ha.restart listener simply disable the ha.stop listener.

I’m working on this now.

I was looking at core.py among other files also. I didn’t not see an event called restart, I did however see an event fired after stop called EVENT_HOMEASSSISTANT_CLOSE but I couldn’t catch it with the listener, I was thinking this event was related to full stop.

Hi thanks very much for this watchdog, I’m going to try it out. I’m currently using the systemd instructions that are in the docs here:

I have a few quick questions when using this component, is the best way to restart HA to simply run:

  • sudo systemctl restart hass.service

Also when upgrading HA to a different version, I guess this service should be stopped first? Is something like this OK:

sudo systemctl stop hass.service
cd homeassistant
source bin/activate
python3 -m pip install --upgrade homeassistant
deactivate
sudo systemctl start hass.service

Yes, until I get a chance to implement support for HA’s native restart ability the best way is to simple issue a systemctl restart *ha-service*.service.

For upgrading HA, after you’ve stopped it and performed the upgrade, I’d go into your HA directory and start it by hand and let it load up fully once:

sudo systemctl stop *ha-service*.service
cd *ha-dir*
source bin/activate.sh
sudo -u *ha-user* pip3 install —upgrade homeassistant
sudo -u *ha-user* hass -c /srv/hass
[Home Assistant Starts...]
[Home Assistant Finishes Loading...]
<CTRL-C>
[Home Assistant Stops...]
sudo systemctl start *ha-service*.service

Obviously change the stuff in asterisk to match your installation. The reason I do this is because after an upgrade HA typically updates various packages during startup, which can extend the startup tIme and cause systemd to kill and restart it. Alternatively you could do a systemctl edit —full *ha-service*.service and extend the watchdog and startup timers to 5+ minutes. This should give HA enough time to update itself before HA kills it.

Let me know if you need more help. :slight_smile:

1 Like

Thanks that’s great, I think I’ll try the start it by hand method next time I need to do a HA update, they can take quite a while sometimes.

I’ve only been running the watchdog for 1 day, so far so good. :slight_smile:

Hey guys, so I’ve got a new version coming up here soon. I’ve added support for HA’s native restart functionality. It’s not super elegant, but it works! It also requires a watchdog timeout value of at least 60 seconds, but that shouldn’t be an issue for most people. (Basically when we see a restart request come through we don’t send the STOPPING message to systemd and we immediately pet the watchdog. A 60 second watchdog timeout value should be long enough for HA to stop, start and reactivate our plugin. If you have a ton of components or devices in HA, or have a very slow system, it might require an even longer timeout value.)

Like I said, not the most elegant solution, but it functions.

3 Likes

when do you think you’re going to submit the component to main ha?

Hi, I have a timeout value set to 5 mins as sometimes it takes a while to load HA when it’s just been updated to a later version.

However usually my HA takes less than 1 min to startup. :slight_smile: