Thanks for all your checks! Looks like hardware watchdog is enabled in your systemd. By default it’s disabled, maybe there is some other place to enable it and maybe it’s enabled by default in new hassio. Good news - there is no reason for you to use extra addon.
Not sure about the watchdog being actually used. I’m suffering from random crashes (HA itself somehow works, but the Supervisor and all Addons are dead, Observer can’t be reached) and I always need to do a power cycle to get out of it.
RebootWatchdogSec= may be used to configure the hardware watchdog when the system is asked to reboot. It works as a safety net to ensure that the reboot takes place even if a clean reboot attempt times out.
RuntimeWatchdogSec … will be programmed to automatically reboot the system if it is not contacted within the specified timeout interval.
So according to your systemd configuration (and commit you mentioned) it looks like the hardware watchdog is only used during reboots, but not during the runtime( strange dessigion imho(mean this commit).
I’m having the same issue with my HA getting stuck in the last days and I’m not able to find the cause at the moment but I would definitely like the idea of the system rebooting by itself if needed.
When I tried to start the add on I get the same “reply” saying watchdog is already being used. Did you manage to disable the use of the watchdog and are you now able to use the add on?
The way the watchdog is being used now doesn’t restart the rpi4 when needed…
now the situation is totally different… after the very latest update (see below the versions)
HA is restarting spontaneously multiple times per day…
So Watchdog is indeed working… and it must be the embedded one, since I removed the addon which I used for quite a long time
Core 2023.12.1
Supervisor 2023.11.6
Operating System11.2
Frontend 20231030.2
I noticed that all the times that a restart happens… is because of the following:
23-12-11 15:08:00 ERROR (MainThread) [supervisor.homeassistant.api] Error on call https://172.30.32.1:8123/api/core/state:
23-12-11 15:08:04 ERROR (MainThread) [supervisor.homeassistant.api] Error on call https://172.30.32.1:8123/api/core/state:
23-12-11 15:08:04 ERROR (MainThread) [supervisor.misc.tasks] Watchdog found a problem with Home Assistant API!
23-12-11 15:08:12 INFO (SyncWorker_0) [supervisor.docker.manager] Restarting homeassistant
UPDATE (28/june/2024):
The problem is still there … at least on my “poor” Raspberry PI3A+…
I noticed that after 2 timeout error on call (api/core/state) the watchdog restart HomeAssistant…
Maybe it would be appropriate to change the values of either timeout (maybe it’s too short) or Max attempts (in supervisor code) … in order to give “more time” to react and eventually avoid all of these HA restart which may not be necessary…
I understand that from a developer point of view everything should react as in theory should be (on enough powerfull HW) but givin the fact that there are many “small HW” that maybe are much slower… giving the options to “accept” some slower reaction to avoid useless restart could be a good idea…
Maybe these values can be configurable with UI (so who has slower HW can better tune these values accepting that system will react slowly
“TimeoutError” in supervisor/supervisor/homeassistant/api.py
“ASS_WATCHDOG_MAX_API_ATTEMPTS” (currently = 2) in supervisor/supervisor/misc/tasks.py
Hello,
I have a Raspberry Pi 5 (with the official power adapter) with the ‘Home Assistant Operating System’ that now and then hangs to the level that it doesn’t even respond to PING. I found your add-on: thank you for creating it. Unfortunately it doesn’t seem to work. Here’s the diagnostics I could think of to maybe help.
After installation (and system reboot) the watchdog log has this:
2024-11-16 20:51:00 INFO Opening watchdog device
Traceback (most recent call last):
File "/watchdog.py", line 42, in <module>
app.run()
File "/watchdog.py", line 20, in run
self.wdt = watchdog('/dev/watchdog')
OSError: [Errno 16] Resource busy: '/dev/watchdog'
I have the SSH add-on running, but that’s a rather limited shell so I don’t seem to be able to use systemctl but I can see this:
[core-ssh ~]$ ls -l /dev/watch*
crw------- 1 root root 10, 130 Nov 15 21:04 /dev/watchdog
crw------- 1 root root 247, 0 Nov 15 21:04 /dev/watchdog0
[core-ssh ~]$ lsof | grep watch
[core-ssh ~]$ dmesg | grep watch
[ 0.012687] hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
[ 0.374311] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer
[ 0.943965] systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Anything I can do to fix this or help troubleshoot?
Thank you in advance!