HA hangs multiple times a day. Looks docker related

Running HAOS in a VM under ProxMox. This has been running painlessly for years with all the usual HAOS and Core updates. Recently, HA has been either outright hanging or Mosquitto stops listening (and thus all of my MQTT entities show as ‘unavailable’. 1883 isn’t listening and I can’t connect to 1883 remotely (ECONNREFUSED).

After digging in a bit, it appears that Docker is hung because I can’t attach to, or stop or kill any containers (like mosquitto).

If I try to shutdown the system to reboot, systemd takes forever because it’s waiting for various things to die. Usually takes half an hour to gracefully shut down the system if it works at all. If I just tell Proxmox to ‘reset’ the VM, it spins for a while and then returns with “Error: VM 121 qmp command ‘system_reset’ failed - got timeout” which is annoying, so then I end up having to reboot Proxmox.

So I understand the proxmox issues are separate and not relevant here, but it seems like the root of my problem is that since updating HA recently, docker seems to have problems. The rest of my VM’s and LXC containers under proxmox are working fine.

Anyone know what’s going on? I’m not seeing a lot of other complaints so I’m willing to entertain the notion that this is maybe self-inflicted but I really don’t think I’ve done anything.

I get the same with Node Red and it complains a lot about a HTTP Listen2 in NGinX being drepecated.
The messages are in the addon logs, not in the core logs.

One thing I would try, since the problem seems to start with mosquito, is to stop the add-on and run mosquito in a separate LXC.

Unfortunately, the Mosquitto logs just cease with no errors. The last message in the log is a received PUBLISH and then the log ends at the time that Mosquitto stopped running.

I also discovered that despite trying to update to 15.2, I’m still only actually running 15.1. Watching the system boot, it appears to fail to boot from the slot that is the 15.2 slot so then it boots from the 15.1 slot.

I’m probably gonna have to burn this whole thing down and re-create it all from scratch. Seems to happen around every 5 years.

I just realized my supervisor is dying from the Node Red error and that cause other containers to die too.
Not sure if your issue is the same, but if you have Node Red, then check its logs and maybe try to disable it.

Nope. No ‘Node Red’.

But I did download a fresh 15.2 image and built a new VM and it ran for just over a day before hanging solid. Now I have an unkillable kvm process and have to reboot my proxmox server.

At least now I know 100% that there’s something wrong with 15.2.

I run 15.2 on Proxmox and so do many others.
Maybe you have a faulty hard drive or so?

I have all of my VM and LXC images on the same NAS as my HA images. All of my other containers and VM’s work just fine. The NAS is healthy, network is healthy, etc.

Do you have these same logs on your console? Reminder this is a freshly created VM from the 15.2 image downloaded the day before yesterday. This is not indicative of a healthy release:

I do not have those messages.
Ho much ram and storage have you assigned to your HA VM?

8GB ram and 32GB disk. I tried bumping the ram to 16 but it never booted after waiting 45mins. Tried to strace the kvm process and it was wedged hard. Not even cycling on a timer or select/poll. It was stuck setting up HAOS swap on the console.

Still sound like a storage issue.
Try to up the amount to maybe 48 or 64Gb

At rage? (Probably auto correct problem. :smile: )

I’ll try bumping up the disk, sure, but the previous VM ran for years at 4GB and 32GB. When I created this one I was feeling generous and gave it 8gb. After it hung, I tried giving it 16gb and it didn’t boot so now it’s back to 8. The disk image is largely empty.

Corrected, but may be needed later if it is not solved. :rofl:

Sometimes HA have issues with more ram than 8Gb, so stick with that for now.

It’s been up for 2d now with no changes. It ran for less than a day at 8GB/32GB. Then I tried giving it 16GB and it failed to boot. I put it back to 8GB and it’s been running 2d. No difference from the previous time it ran for less than a day at 8/32.

I don’t trust it but it’s a record for 15.2 for me.

Do you see the messages on the console?

yup. Still the usual terminated processes, etc.

well my hopes and dreams were dashed. It’s hung again.


After 20 mins, it was still trying to shutdown systemd-journald.service and “Docker Application Container Engine”.