How to debug startup, I get fatal hangs in core startup with no errors

lukeme · February 16, 2024, 10:05pm

Looking for a way to see what is hanging my core startup, basically any restart has a 30% chance of success (reboot, update or just restart). Wondering if there is a better way to see where in the process it is failing, since it seems to be hanging and not outputting anything to the log when it hangs.

SSH addon, CLI, Supervisor all work, but the frontend isn’t accessible and in the CLI “jobs info” show a “core_start” or “core_restart_after_problem”

This is a long post, but I appreciate anyone’s help.

I have tried:

DEBUG log level - creates a lot of output, but rarely fails at the same place
delete the sqlite db - saw the warnings about unfinished sessions because of the hard restarts, tried deleting the db. It gets recreated but problem persists.
Disable Integrations/Addons - down to a minimal config (not nothing, but down to the one’s I’d rather not re-setup), tried to get rid of all the REST and stuff that could take a long time fetching data
Update / Downgrade - persisted through OS, Core and supervisor updates, backup restores, etc.
Safe Mode - I can get there by intentionally breaking the config file, seems to always work, don’t see anything there that can help this situation?

Currently to restart I ssh into the machine and do a “supervisor restart” then (in another term) “core stop”, if I try to just do a core restart I can’t because there is already a core job running. Repeat until I get a successful start.

The interesting thing is that the frontend loads and starts doing the toast notifications for all the components loading, then loses connection when it fails.

Started on one system around core 2023.08, but then a second system started to do the same thing this week. On the second system there was a crash while I was in dev tools editing a template, then it started showing this behavior.

Both are x86 installs of HAOS, i5 CPU, 8GB RAM, SSD.

Here’s a nice fail log, this one just stops when loading “timer”, but sometimes it’s “switch_as_x”, “media_player”, or something else.