Home Assistant Add-on: Hardware watchdog service

habitoti · March 3, 2023, 12:07pm

This works great and saved me more than once when away from home and needed to rely on VPN access. HA broke, but came actually back w/o me onsite!
The only thing I wondered: it takes about 30-40 minutes until HA is actually restarted after it became unresponsive (and e.g. doesn’t record sensors any longer…). Is there any way to make it react quicker?
Thanks, habitoti

zdenekm · November 29, 2023, 10:47am

Cool, but unfortunately does not work with recent HA. I’m getting OSError: [Errno 16] Resource busy: '/dev/watchdog'.

alex107 · November 29, 2023, 11:37am

Hi!
Can’t find anything about it in changelogs.
Can you show the results of “ls -la /dev/watchdog” and “lsof /dev/watchdog”? May be it can give us some clue.
Thanks!

UPD: looks like hassio’s lsof works the other way, so for me “lsof | grep watchdog” works fine

zdenekm · November 29, 2023, 12:20pm

Sure…

ls -la /dev/watchdog returns: 0 crw------- 1 root root 10, 130 Apr 4 2023 /dev/watchdog and lsof | grep watchdog nothing.

alex107 · November 29, 2023, 6:09pm

looks like the watchdog device exists and not used by anyone :-/
internet tells us that error 16 also can appear if the kernel can’t communicate with the device, but in this case /dev/watchdog should not exists.
you can also try to check if watchdog is enabled in systemd, maybe it’s tha thing that was changed in hassio: “cat /etc/systemd/system.conf | grep -i watch”

zdenekm · November 30, 2023, 7:36am

Hmm, to check that I should probably login through SSH to the host system, right? Right now I’m logging through SSH Addon to HA and there is no /etc/systemd directory.

alex107 · November 30, 2023, 7:58am

Right, you need access to the core system. I use port 22222 and user root to connect to the core system, but I can’t recall if I did smth special for it before.

zdenekm · November 30, 2023, 8:10am

I’m there. It seems that the watchdog is probably disabled.

# cat /etc/systemd/system.conf | grep -i watch
#RuntimeWatchdogSec=off
#RuntimeWatchdogPreSec=off
#RuntimeWatchdogPreGovernor=
#RebootWatchdogSec=10min
#KExecWatchdogSec=off
#WatchdogDevice=

zdenekm · November 30, 2023, 9:16am

Maybe this could be also interesting.

# systemctl show | grep -i watchdog
WatchdogDevice=/dev/watchdog0
WatchdogLastPingTimestamp=Thu 2023-11-30 09:16:17 UTC
WatchdogLastPingTimestampMonotonic=82726617867
RuntimeWatchdogUSec=infinity
RuntimeWatchdogPreUSec=0
RebootWatchdogUSec=10min
KExecWatchdogUSec=0
ServiceWatchdogs=yes

zdenekm · November 30, 2023, 9:31am

# lsof | grep watchdog
1	/usr/lib/systemd/systemd	9	/dev/watchdog0
2753	/package/admin/s6-2.11.3.2/command/s6-supervise	3	/run/s6/legacy-services/watchdog/supervise/lock
2753	/package/admin/s6-2.11.3.2/command/s6-supervise	4	/run/s6/legacy-services/watchdog/supervise/control
2753	/package/admin/s6-2.11.3.2/command/s6-supervise	5	/run/s6/legacy-services/watchdog/supervise/control

Maybe the addon should use /sys/class/watchdog/watchdog0/dev file instead?

alex107 · November 30, 2023, 9:48am

Thanks for all your checks! Looks like hardware watchdog is enabled in your systemd. By default it’s disabled, maybe there is some other place to enable it and maybe it’s enabled by default in new hassio. Good news - there is no reason for you to use extra addon.

zdenekm · November 30, 2023, 10:59am

Not sure about the watchdog being actually used. I’m suffering from random crashes (HA itself somehow works, but the Supervisor and all Addons are dead, Observer can’t be reached) and I always need to do a power cycle to get out of it.

zdenekm · November 30, 2023, 11:05am

This Enable watchdog control in systemd by sbyx · Pull Request #2628 · home-assistant/operating-system · GitHub is where the watchdog was enabled in systemd.

alex107 · November 30, 2023, 11:48am

As this documentation says:

RebootWatchdogSec= may be used to configure the hardware watchdog when the system is asked to reboot. It works as a safety net to ensure that the reboot takes place even if a clean reboot attempt times out.

RuntimeWatchdogSec … will be programmed to automatically reboot the system if it is not contacted within the specified timeout interval.

So according to your systemd configuration (and commit you mentioned) it looks like the hardware watchdog is only used during reboots, but not during the runtime( strange dessigion imho(mean this commit).

spanzetta · December 3, 2023, 6:06pm

I also realized only now that the Watchdog addon, which I was using from over one year, it is now disabled… and can’t start

Don’t know exactly when it started to be disabled… but it has to be related to some recent HA updates

Using Raspberry PI3A+

This is the log output:

2023-12-03 19:13:44 INFO Opening watchdog device
Traceback (most recent call last):
File “/watchdog.py”, line 42, in
app.run()
File “/watchdog.py”, line 20, in run
self.wdt = watchdog(‘/dev/watchdog’)
OSError: [Errno 16] Resource busy: ‘/dev/watchdog’

matteorossininchi · December 3, 2023, 6:07pm

I’m having the same issue with my HA getting stuck in the last days and I’m not able to find the cause at the moment but I would definitely like the idea of the system rebooting by itself if needed.
When I tried to start the add on I get the same “reply” saying watchdog is already being used. Did you manage to disable the use of the watchdog and are you now able to use the add on?
The way the watchdog is being used now doesn’t restart the rpi4 when needed…

Thanks

spanzetta · December 8, 2023, 5:17pm

currently, after update to 2023.12 my RPI is totally unresponsive (but Observer reports no problem) and watchdog is NOT restarting my RPI

spanzetta · December 11, 2023, 3:34pm

now the situation is totally different… after the very latest update (see below the versions)
HA is restarting spontaneously multiple times per day…
So Watchdog is indeed working… and it must be the embedded one, since I removed the addon which I used for quite a long time

Core 2023.12.1
Supervisor 2023.11.6
Operating System11.2
Frontend 20231030.2

I noticed that all the times that a restart happens… is because of the following:

23-12-11 15:08:00 ERROR (MainThread) [supervisor.homeassistant.api] Error on call https://172.30.32.1:8123/api/core/state:
23-12-11 15:08:04 ERROR (MainThread) [supervisor.homeassistant.api] Error on call https://172.30.32.1:8123/api/core/state:
23-12-11 15:08:04 ERROR (MainThread) [supervisor.misc.tasks] Watchdog found a problem with Home Assistant API!
23-12-11 15:08:12 INFO (SyncWorker_0) [supervisor.docker.manager] Restarting homeassistant

spanzetta · February 7, 2024, 12:17pm

again situation is changed…

see details in my latest post here
Home Assistant automatic restart for API call error? - #30 by DarthJacks - Home Assistant OS - Home Assistant Community (home-assistant.io)

UPDATE (28/june/2024):
The problem is still there … at least on my “poor” Raspberry PI3A+…

I noticed that after 2 timeout error on call (api/core/state) the watchdog restart HomeAssistant…

Maybe it would be appropriate to change the values of either timeout (maybe it’s too short) or Max attempts (in supervisor code) … in order to give “more time” to react and eventually avoid all of these HA restart which may not be necessary…

I understand that from a developer point of view everything should react as in theory should be (on enough powerfull HW) but givin the fact that there are many “small HW” that maybe are much slower… giving the options to “accept” some slower reaction to avoid useless restart could be a good idea…

Maybe these values can be configurable with UI (so who has slower HW can better tune these values accepting that system will react slowly

“TimeoutError” in supervisor/supervisor/homeassistant/api.py
“ASS_WATCHDOG_MAX_API_ATTEMPTS” (currently = 2) in supervisor/supervisor/misc/tasks.py

What do you think about?

firstnamelast · November 16, 2024, 9:06pm

Hello,
I have a Raspberry Pi 5 (with the official power adapter) with the ‘Home Assistant Operating System’ that now and then hangs to the level that it doesn’t even respond to PING. I found your add-on: thank you for creating it. Unfortunately it doesn’t seem to work. Here’s the diagnostics I could think of to maybe help.

After installation (and system reboot) the watchdog log has this:

2024-11-16 20:51:00 INFO     Opening watchdog device
Traceback (most recent call last):
  File "/watchdog.py", line 42, in <module>
    app.run()
  File "/watchdog.py", line 20, in run
    self.wdt = watchdog('/dev/watchdog')
OSError: [Errno 16] Resource busy: '/dev/watchdog'

I have the SSH add-on running, but that’s a rather limited shell so I don’t seem to be able to use systemctl but I can see this:

[core-ssh ~]$ ls -l /dev/watch*
crw-------    1 root     root       10, 130 Nov 15 21:04 /dev/watchdog
crw-------    1 root     root      247,   0 Nov 15 21:04 /dev/watchdog0
[core-ssh ~]$ lsof | grep watch
[core-ssh ~]$ dmesg | grep watch
[    0.012687] hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
[    0.374311] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer
[    0.943965] systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0

Anything I can do to fix this or help troubleshoot?
Thank you in advance!