This works great and saved me more than once when away from home and needed to rely on VPN access. HA broke, but came actually back w/o me onsite!
The only thing I wondered: it takes about 30-40 minutes until HA is actually restarted after it became unresponsive (and e.g. doesn’t record sensors any longer…). Is there any way to make it react quicker?
Thanks, habitoti
Cool, but unfortunately does not work with recent HA. I’m getting OSError: [Errno 16] Resource busy: '/dev/watchdog'
.
Hi!
Can’t find anything about it in changelogs.
Can you show the results of “ls -la /dev/watchdog” and “lsof /dev/watchdog”? May be it can give us some clue.
Thanks!
UPD: looks like hassio’s lsof works the other way, so for me “lsof | grep watchdog” works fine
Sure…
ls -la /dev/watchdog
returns: 0 crw------- 1 root root 10, 130 Apr 4 2023 /dev/watchdog
and lsof | grep watchdog
nothing.
looks like the watchdog device exists and not used by anyone :-/
internet tells us that error 16 also can appear if the kernel can’t communicate with the device, but in this case /dev/watchdog should not exists.
you can also try to check if watchdog is enabled in systemd, maybe it’s tha thing that was changed in hassio: “cat /etc/systemd/system.conf | grep -i watch”
Hmm, to check that I should probably login through SSH to the host system, right? Right now I’m logging through SSH Addon to HA and there is no /etc/systemd
directory.
Right, you need access to the core system. I use port 22222 and user root to connect to the core system, but I can’t recall if I did smth special for it before.
I’m there. It seems that the watchdog is probably disabled.
# cat /etc/systemd/system.conf | grep -i watch
#RuntimeWatchdogSec=off
#RuntimeWatchdogPreSec=off
#RuntimeWatchdogPreGovernor=
#RebootWatchdogSec=10min
#KExecWatchdogSec=off
#WatchdogDevice=
Maybe this could be also interesting.
# systemctl show | grep -i watchdog
WatchdogDevice=/dev/watchdog0
WatchdogLastPingTimestamp=Thu 2023-11-30 09:16:17 UTC
WatchdogLastPingTimestampMonotonic=82726617867
RuntimeWatchdogUSec=infinity
RuntimeWatchdogPreUSec=0
RebootWatchdogUSec=10min
KExecWatchdogUSec=0
ServiceWatchdogs=yes
# lsof | grep watchdog
1 /usr/lib/systemd/systemd 9 /dev/watchdog0
2753 /package/admin/s6-2.11.3.2/command/s6-supervise 3 /run/s6/legacy-services/watchdog/supervise/lock
2753 /package/admin/s6-2.11.3.2/command/s6-supervise 4 /run/s6/legacy-services/watchdog/supervise/control
2753 /package/admin/s6-2.11.3.2/command/s6-supervise 5 /run/s6/legacy-services/watchdog/supervise/control
Maybe the addon should use /sys/class/watchdog/watchdog0/dev
file instead?
Thanks for all your checks! Looks like hardware watchdog is enabled in your systemd. By default it’s disabled, maybe there is some other place to enable it and maybe it’s enabled by default in new hassio. Good news - there is no reason for you to use extra addon.
Not sure about the watchdog being actually used. I’m suffering from random crashes (HA itself somehow works, but the Supervisor and all Addons are dead, Observer can’t be reached) and I always need to do a power cycle to get out of it.
This Enable watchdog control in systemd by sbyx · Pull Request #2628 · home-assistant/operating-system · GitHub is where the watchdog was enabled in systemd.
As this documentation says:
RebootWatchdogSec=
may be used to configure the hardware watchdog when the system is asked to reboot. It works as a safety net to ensure that the reboot takes place even if a clean reboot attempt times out.
RuntimeWatchdogSec
… will be programmed to automatically reboot the system if it is not contacted within the specified timeout interval.
So according to your systemd configuration (and commit you mentioned) it looks like the hardware watchdog is only used during reboots, but not during the runtime( strange dessigion imho(mean this commit).
I also realized only now that the Watchdog addon, which I was using from over one year, it is now disabled… and can’t start
Don’t know exactly when it started to be disabled… but it has to be related to some recent HA updates
Using Raspberry PI3A+
This is the log output:
2023-12-03 19:13:44 INFO Opening watchdog device
Traceback (most recent call last):
File “/watchdog.py”, line 42, in
app.run()
File “/watchdog.py”, line 20, in run
self.wdt = watchdog(‘/dev/watchdog’)
OSError: [Errno 16] Resource busy: ‘/dev/watchdog’
I’m having the same issue with my HA getting stuck in the last days and I’m not able to find the cause at the moment but I would definitely like the idea of the system rebooting by itself if needed.
When I tried to start the add on I get the same “reply” saying watchdog is already being used. Did you manage to disable the use of the watchdog and are you now able to use the add on?
The way the watchdog is being used now doesn’t restart the rpi4 when needed…
Thanks
currently, after update to 2023.12 my RPI is totally unresponsive (but Observer reports no problem) and watchdog is NOT restarting my RPI
now the situation is totally different… after the very latest update (see below the versions)
HA is restarting spontaneously multiple times per day…
So Watchdog is indeed working… and it must be the embedded one, since I removed the addon which I used for quite a long time
- Core 2023.12.1
- Supervisor 2023.11.6
- Operating System11.2
- Frontend 20231030.2
I noticed that all the times that a restart happens… is because of the following:
23-12-11 15:08:00 ERROR (MainThread) [supervisor.homeassistant.api] Error on call https://172.30.32.1:8123/api/core/state:
23-12-11 15:08:04 ERROR (MainThread) [supervisor.homeassistant.api] Error on call https://172.30.32.1:8123/api/core/state:
23-12-11 15:08:04 ERROR (MainThread) [supervisor.misc.tasks] Watchdog found a problem with Home Assistant API!
23-12-11 15:08:12 INFO (SyncWorker_0) [supervisor.docker.manager] Restarting homeassistant
again situation is changed…
see details in my latest post here
Home Assistant automatic restart for API call error? - #30 by DarthJacks - Home Assistant OS - Home Assistant Community (home-assistant.io)
UPDATE (28/june/2024):
The problem is still there … at least on my “poor” Raspberry PI3A+…
I noticed that after 2 timeout error on call (api/core/state) the watchdog restart HomeAssistant…
Maybe it would be appropriate to change the values of either timeout (maybe it’s too short) or Max attempts (in supervisor code) … in order to give “more time” to react and eventually avoid all of these HA restart which may not be necessary…
I understand that from a developer point of view everything should react as in theory should be (on enough powerfull HW) but givin the fact that there are many “small HW” that maybe are much slower… giving the options to “accept” some slower reaction to avoid useless restart could be a good idea…
Maybe these values can be configurable with UI (so who has slower HW can better tune these values accepting that system will react slowly
- “TimeoutError” in supervisor/supervisor/homeassistant/api.py
- “ASS_WATCHDOG_MAX_API_ATTEMPTS” (currently = 2) in supervisor/supervisor/misc/tasks.py
What do you think about?
Hello,
I have a Raspberry Pi 5 (with the official power adapter) with the ‘Home Assistant Operating System’ that now and then hangs to the level that it doesn’t even respond to PING. I found your add-on: thank you for creating it. Unfortunately it doesn’t seem to work. Here’s the diagnostics I could think of to maybe help.
After installation (and system reboot) the watchdog log has this:
2024-11-16 20:51:00 INFO Opening watchdog device
Traceback (most recent call last):
File "/watchdog.py", line 42, in <module>
app.run()
File "/watchdog.py", line 20, in run
self.wdt = watchdog('/dev/watchdog')
OSError: [Errno 16] Resource busy: '/dev/watchdog'
I have the SSH add-on running, but that’s a rather limited shell so I don’t seem to be able to use systemctl but I can see this:
[core-ssh ~]$ ls -l /dev/watch*
crw------- 1 root root 10, 130 Nov 15 21:04 /dev/watchdog
crw------- 1 root root 247, 0 Nov 15 21:04 /dev/watchdog0
[core-ssh ~]$ lsof | grep watch
[core-ssh ~]$ dmesg | grep watch
[ 0.012687] hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
[ 0.374311] bcm2835-wdt bcm2835-wdt: Broadcom BCM2835 watchdog timer
[ 0.943965] systemd[1]: Using hardware watchdog 'Broadcom BCM2835 Watchdog timer', version 0, device /dev/watchdog0
Anything I can do to fix this or help troubleshoot?
Thank you in advance!