Help debugging/mitigating a recurring restart issue

paoloantinori · May 13, 2023, 8:42am

Hi everyone, I’m seeking some ideas on how to debug ( or just mitigate ) a recurring restart problem that is bugging me.

I run latest HAOS on a Rpi4 with 4GB of ram and I’m afflicted by recurring restarts, every 2-3h on average ( but there are moments when they happen more frequently).

Logs are not helping, there are three usual warnings here and there, but nothing consistent with the occurrence of the issue.

The only partially useful logs are those of Supervisor that trigfers the restart because /config endpoint check fails for n times in a row.

Having read that many times this might be related to CPU or ram spikes, I’ve enabled system metrics and started tracking them with glances+influx+grafana but I cannot see any unusual patterns.

I’ve also run experiments disabling add-ons and integrations ( but I’ve realised now that I’ve just disabled them from the ui without restarting. Should I have restarted HA altogether? ) again without any positive improvement.

At this point I want to see if preventing the restart might help me to pinpoint the source of the problem.

Reading Supervisor code I’ve noticed that the restart can be inhibited if watchdog is disabled for HA.
Indeed setting it to false via the CLI protect it from Supervisor restarts.

The only remaining issue seems to be that the flag is not persisted across host reboots.

Does anyone know how to make it static?
Or do you have further suggestions on how to debug the problem further?

paoloantinori · June 24, 2023, 1:46pm

The best mitigation that I’ve found so far is to disable watchdog automatically, so that when it fails to respond for short period of time, supervisor doesn’t kills it immediately:

shell_command:
  disable_watchdog: "curl -sSL -H \"Authorization: Bearer $SUPERVISOR_TOKEN\" -H \"Content-Type: application/json\" -d '{\"watchdog\": false }' http://supervisor/core/options"

still hoping to fix this properly though.