Home Assistant Add-on: Hardware watchdog service

About

Plugin activates /dev/watchdog - hardware watchdog device to restart server on no responce. For details about watchdog see https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.txt.
I checked it with my Raspberry Pi 4 - it has Broadcom BCM2835 Watchdog timer, enabled by default.
Service sends keepalive to watchdog timer every 5 seconds, on hang or other software problems system will do hardware restart in 15 seconds.

Repository on GitHub

8 Likes

Cool, I’ve been having occasional issues with my home assistant becoming completely unresponsive so I’m happy to see this exists and I’m giving it a try.

Seems like this should really just be a default part of Home Assistant OS.

1 Like

One thing that would be useful is to track how frequently this is triggered.

I do have an uptime sensor configured (Uptime - Home Assistant) but that won’t let me distinguish restarts due to software updates or config changes from watchdog resets.

I also have set up a notification when HA restarts (similar to the example here Home Assistant restart notification) so at least I can note & track manually.

Personally, I added notifications about shutting down and starting HA and I think this is enough, because reloading the watchdog timer is not a frequent occurrence.
I think the main problem with counting these events is that watchdog restarts are outside the scope of the software - it’s a hardware restart when the software becomes unresponsive. You can’t write down some information because you (as a script, as a program) may be (and probably) dead at that moment.
Another way is to count “incorrect” OS startups - create some flag on proper shutdown, and if there is no such flag on startup, interpret this situation as a bad/watchdog shutdown.
If anyone has an idea how to count this - do not hold back)
Also if you know how to increment any counter in HA from supervisor container - give me an example and I’ll add this function.

Please add a how-to-install section to your README. I‘ve tried adding your repo to the add-on store to no avail.

1 Like

Done, also automation with HA start/stop notifications are now in readme too.

@alex107 Is the „/dev/watchdog - hardware watchdog device“ only active when the add-on is started and running?
Is the watchdog still active when I stop the add-on or does the watchdog get deactivated when I stop the add-on?

Add-on activates (starts) watchdog timer on its start and updates its state while running. On correct shutdown add-on deactivates (stops) watchdog timer to prevent hardware reboot.

1 Like

Is there a way to simulate a no responce to test that watchdog is working correctly?

You can try to kill -9 add-on process to prevent stopping watchdog timer correctly. Don’t forget to disable auto restart of add-on. On success you will get a hardware restart by watchdog as on software problems. It’s not completely correct, but you will check if hardware watchdog is working correctly.
P.S.: kill -9 is not the same as docker stop or add-on stop - on stop add-on will disable hardware watchdog timer.

No tengo claro que el complemento esté funcionando correctamente. Estå en rojo. Pero el registro parece indicar que se inició correctamente.
El resto son los datos de mi sistema software y hardware.

Registro

2023-01-18 16:46:01 INFO Opening watchdog device
2023-01-18 16:46:01 INFO Watchdog identity: Broadcom BCM2835 Watchdog timer
2023-01-18 16:46:01 INFO Watchdog firmware version: 0
2023-01-18 16:46:01 INFO Watchdog options: 33152
2023-01-18 16:46:01 INFO Watchdog timeout: 15
2023-01-18 16:46:01 INFO Starting main cycle with sleep time 5 sec


Sistema

Home Assistant 2023.1.2
Supervisor 2022.12.1
Operating System 9.3
Frontend 20230104.0 - latest

Hardware RPI 3 B+

watchdog
Subsistema:
misc
Ruta del dispositivo:
/dev/watchdog
Atributos:
DEVNAME: /dev/watchdog
DEVPATH: /devices/platform/soc/3f100000.watchdog/bcm2835-wdt/misc/watchdog
MAJOR: ‘10’
MINOR: ‘130’
SUBSYSTEM: misc

Todo OK

Funciona perfectamente.

Home Assistant 2023.1.5
Supervisor 2022.12.1
Operating System 9.4

When I first started dreaming about a watchdog for my Pi-based Home Assistant, I envisaged a watchdog keep alive trigger coming from my Node Red logic (so if Node Red stopped, the system would reset).

I understand that this add-on is a service (which I’m not familiar with in a HA-context).

What are the possibilities that something could stop Node Red functioning, but the watchdog sevice would still keep plugging away happily?

This works great and saved me more than once when away from home and needed to rely on VPN access. HA broke, but came actually back w/o me onsite!
The only thing I wondered: it takes about 30-40 minutes until HA is actually restarted after it became unresponsive (and e.g. doesn’t record sensors any longer
). Is there any way to make it react quicker?
Thanks, habitoti

Cool, but unfortunately does not work with recent HA. I’m getting OSError: [Errno 16] Resource busy: '/dev/watchdog'.

Hi!
Can’t find anything about it in changelogs.
Can you show the results of “ls -la /dev/watchdog” and “lsof /dev/watchdog”? May be it can give us some clue.
Thanks!

UPD: looks like hassio’s lsof works the other way, so for me “lsof | grep watchdog” works fine

Sure


ls -la /dev/watchdog returns: 0 crw------- 1 root root 10, 130 Apr 4 2023 /dev/watchdog and lsof | grep watchdog nothing.

looks like the watchdog device exists and not used by anyone :-/
internet tells us that error 16 also can appear if the kernel can’t communicate with the device, but in this case /dev/watchdog should not exists.
you can also try to check if watchdog is enabled in systemd, maybe it’s tha thing that was changed in hassio: “cat /etc/systemd/system.conf | grep -i watch”

Hmm, to check that I should probably login through SSH to the host system, right? Right now I’m logging through SSH Addon to HA and there is no /etc/systemd directory.

Right, you need access to the core system. I use port 22222 and user root to connect to the core system, but I can’t recall if I did smth special for it before.