Can't update Core or shutdown gracefully

tribbles · November 24, 2024, 5:39pm

A number of issues are taking place.

Home Assistant Core won’t update. (2024.11.1 → 2024.11.3) .2 wouldn’t update either but that has now been replaced with .3
Home Assistant will not shutdown gracefully. It gets stuck trying to stop/kill the journald.server.
systemd[1]: systemd- journald.service: Killing process 204791 (systemd- journal) with signal SIGKILL. After letting it do that for hours I have to power off the system.
After restarting I am getting the following message on the console screen. It just keeps repeating every few seconds.
systemd- journald[105]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected.

tribbles · November 24, 2024, 5:45pm

ALSO, after about 12 hours running, all functionality stops and I need to restart/shutdown which I can’t do gracefully and must power off the system.

tribbles · November 24, 2024, 7:43pm

Tried doing the update again and the supervisor log is showing:
ERROR (MainThread) [supervisor.homeassistant.core] Home Assistant has crashed!
CRITICAL (MainThread) [supervisor.homeassistant.core] update failed → rollback!

WallyR · November 25, 2024, 12:28am

Not much to go after, but my guess is a crashed SDcard.
This is just based on experience with crash medias, which mostly affect writing, but often allow reading.
It fits with not being able to update HA core, not being able to stop the journald server, because that is actually the log file system that needs to then write the logs files to the media and the journald system not starting up correctly, because it needs to rename the previous log file and create a new one.

tribbles · November 25, 2024, 2:06am

Thanks for the reply.

I have it running at the moment and I will see if things go south again overnight. I made changes to configuration.yaml and automations.yaml without encountering any issues. I am still unable to do the updates but the “Failed to send WATCHDOG=1” error hasn’t returned since the last startup.

nickrout · November 25, 2024, 5:03am

dmesg should reveal a broken SD card.

tribbles · November 25, 2024, 1:14pm

Ran dmesg and didn’t see anything unusual. Should have mentioned that I am running HA OS VM in Virtualbox.

The system didn’t crash overnight and automations and devices are sill working but I am still unable to update anything (I now have a number of updates pending).

nickrout · November 25, 2024, 6:20pm

Well it won’t be an SD card then.

Virtualbox - bleech.

retc · January 5, 2025, 2:05pm

I have exactly the same problem since last week.
I run it on a VM in Synology. At first I was on the last version of 2024, yesterday I restored a snapshot, upgraded it to the most recent version (2025.1). After a couple of hours: the webinterface of HA is unreachable. The console didnt show any messages, but when I reboot the VM-host (I can’t gracefully shut down either), I get “systemd-journalId[94]: Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected”

retc · January 5, 2025, 2:25pm

in addition to my previous post: when I perform ha supervisor restart it performs the command, but every other ha supervisor command like info gives me the message system is not ready with state: setup. It stays in this state.

tribbles · January 5, 2025, 2:46pm

At this point the only way I can do core or supervisor updates is from the CLI using the following commands.

core update

supervisor update

After the update, the system usually becomes unresponsive (I let it sit for about an hour just in case it recovers but it never does lol) and needs to be manually shutdown from the VM controls. After the restart, it shows that it has been updated. I fear that at some point it will stop working altogether but sadly I just don’t have the time to try and figure out what the problem is.

retc · January 5, 2025, 8:10pm

Core and supervisor are updated to the latest versions, supervisor is healthy (checked it with supervisor info).
But the big questions I have, are:

why the error message?
why can’t I access the web interface?
why can’t I shutdown HA from the VM-controls (like I could before).

My best guess is that it is a bug, related to the last update.
I’m going to restore a snapshot from one month ago to test my guess

retc · January 7, 2025, 11:22am

“My best guess is that it is a bug, related to the last update.”
I can rule that out for now. I restored a snapshot with the following versions:

Core: 2024.12.0
Supervisor 2024.12.3
OS: 13.2
Frontend: 20241127.4

That used to work ok when I took the snapshot.
Now when I restore it, I get the Failed to send WATCHDOG=1 notification message: Transport endpoint is not connected” message.
After restoring, HA is reachable en functional for a couple of hours. After that, it’s not reachable any more via the web interface. I can reach the CLI. in there, the “core info” and “supervisor info” look fine.

I can’t get my head around the error message. Could it be something like a docker error? for instance the image that can’t be reached or something?

The weird thing is: I would expect that this issue wouldnt occur in an old snapshot, since it was running fine back than.

retc · January 7, 2025, 11:23am

I also build a new VMM with a fresh download of the OVA file from HA.
Same error shows up in the newly created VMM

WallyR · January 7, 2025, 11:44am

Supervisor is typically set to automatic update, so if you have not done anything to stop that, then both new and old image will end up with the newest super isor after being online for a short while.

retc · January 7, 2025, 12:22pm

Glad you mentioned that: I didn’t realise that.
I also set me to a new thought: perhaps it has something to do with Supervisor, and not with HA (or with my VMM).

I’ve restored the VMM again and now disabled the auto-update of supervisor, with ha supervisor options --auto-update=false.
I wonder if it still runs without problems when I get home this evening

retc · January 7, 2025, 2:28pm

Do you use supervisor? (and if so: which version?)

I’ve disabled the auto-update of supervisor after restoring a previous snapshot (supervisor version 2024.12.3) a couple of hours ago. And by the looks of it, the “Watchdog” error doesn’t occur again.
And also: I can gracefully shotdown the HA virtual machine again