HA keeps crashing and I have no idea how to troubleshoot the problem

Hi team!

I am really hoping someone can help me troubleshoot why my HA instance keeps dying on me on a weekly basis. I am going to be as detailed as possible because I think it should help narrow down the issue.

The last couple of months, I have been experiencing HA crashes every week or so. My HA instance was running on a VM on a NAS. It would start acting slow a couple of days prior to the crash, and then it would crash completely. While it was broken, the dashboard would load, but everything would have errors and the statuses of entities were unknown. At that point, it wouldn’t let me do a reboot through HA - the only way to fix it was a hard reboot of the NAS. After a reboot, things would work great for a few days, then get slower, and then crash again. Rinse and repeat.

A week ago, I changed to running it via Proxmox on a brand new mini PC. Everything was running great for a few days, and then it again got slower, and it crashed again last night. Except this time, a hard reboot of the mini PC did not bring me back to the normal state. I can log into HA, but my dashboards aren’t loading and everything is slow. I am now trying to reboot into safe mode but it just hangs there at the reboot screen and doesn’t actually reboot.

I know that this is some type of configuration problem and not a hardware issue, but I have no idea how to pinpoint the problem. I tried looked at the system logs and I really don’t know what I’m looking at. Which system log? What do these lines mean? I only seen very recent line items, nothing from before 5 AM when I know the latest crash occurred.

Could someone please give me some direction on how to proceed? I read all these stories about people’s HA instances running reliably for a year with no problems… and I want to be one of those people!!

Logs.

Go here:

Scroll down to troubleshooting

Find the entry about getting logs after a crash.

1 Like

Thank you for your reply. I went to the link you recommended, but I could not add the “SSH and Web Terminal Addon” because my HA basically won’t do anything right now. When I clicked the link to add the add-on, it spins and spins, and eventually takes me back to my broken dashboard.

I tried going to my HA VM terminal in proxmox and typed “journalctl” but it says that is not a valid command. Is there another way I can get this info?

Also are you using custom integrations?

I am. I figure it might be one of them doing something weird. But would love to be able to disable all of them and start trying to narrow it down. I thought that’s what safe mode was supposed to do, but I can’t get into it!

I got it rebooted into safe mode using the command prompt. It is still ridiculously slow and won’t let me install the addon to retrieve logs. Is there anything else I can do? I have a backup on google drive, so I guess I will have to learn how to make a new proxmox VM and install that backup. Then install the addon to retrieve logs before it crashes again?

Why? It is in ui.

Settings>>system>>logs

Also I would look at memory use of the VM

Those logs in the GUI only show me the most recent things, not from the initial crash.

Here’s the memory usage graph. I’ve allocated 8 GB to the VM. It looks like it’s only using 2 GB.

Whatever’s causing crash likely shows in logs before failure. Before a reboot the logs may still exist in config.

Same here. Failure happens over time. Watch if usage is increasing over time.

You need to check this stuff every hour to monitor for problems

Thanks. Proxmox shows that over the past week, it didn’t get above 3 GB of RAM usage.

I got the logs from the current failed state in safe mode (where it is not loading custom integrations). I really have no idea what I’m looking at. I would love to post it here so you guys could take a look, but it says I can’t post .txt files and the total length is too long to post :frowning:

Are you using HA OS or a supervised install on Debian? I prefer the supervised install because you can then get in and use all of the Debian bash commands to check things out. I had problems in the pass with an integration using up all of the memory and then the system would start thrashing attempting to use swap space. The system would run very slow and then a few hours later it would crash. I disabled swapping so if the memory was exhausted the system would just crash and reboot. This eliminated the system getting in a thrashing mode for a few hours. Since you’re using a VM you should be able to check the wait state in the host machine. This would tell you if the system was getting hung up because of disk IO. Also does the host OS also seem to be responding slowly?

I’m using HAOS on the proxmox VM. I don’t think I’m having memory problems because it doesn’t seem to go above 3 GB. But it sure acts like there is no more RAM available because it’s so slow!

I’ve never used proxmox, but I’d be concerned that while you might have allocated 8G, maybe something is preventing it from all being used. Either something associated with proxmox configuration, or the HA VM. If you can access the proxmox command line you should be able to see if the HA system is taxing the physical resources.

Thanks. I can access the proxmox command line, I just have no idea what to do with it.

This sucks. I get so frustrated trying to troubleshoot these types of problems. I’d much rather be troubleshooting automations.

I took a quick look at the proxmox command line tools and they seem to be lacking.
When the HA VM boots can you get to it’s command line? Proxmox should allow you to access the VM console. Assuming you get logged in I’d start by watching top as it shows you the system about of time being spent in various states, i.e. user, system, idle, wait, etc. The wait state will tell you if you have a disk IO problem. You can see if you’re pegging the CPU and you can see what processes are taking up the CPU time and memory.

The best way to troubleshoot these kind of problems is to have a methodical start from scratch.

Create a new VM and install HAOS. Then slowly every day or two add whatever ADD-ON/HACS you want to use. When you see the problem raise its head you know what to do.

Try my guide to install HAOS on Proxmox. It will always provide a clean install.

https://portal.habitats.tech/Home+Assistant+(HA)/3.+HAOS+-+Install+VM

You also have access to guides such as Proxmox, Traefik, etc.

I did just create a new VM and installed HAOS the exact same way I did before. I shut down the old VM. But when I started the new VM, it doesn’t even start HA… it gives me the same error as when I try to run my old one: “Error returned from Supervisor: System is not ready with state: setup”

What the heck. How is a problem on my old VM affecting the new one??

You lost me on the second half of this. Yes, I can get to the VM console. But I don’t understand what to do after that!

top is a command line command. Run it and you’ll get output something like this:

The third line tells you how the system is using CPU cycles. On a system that is running well I’d except id (idle) to be in the 90% space most of the time. I’d expect wa (wait) to be close to 0. MiB mem shows you how much memory the system thinks it has (total), used shows how much it’s using and avail mem shows how much memory that can still be assigned to processes. I’ve disabled swap and so my is zero, but you should see how swap much you have allocated. I’d expect with 8G of memory dedicated to this VM that Mib Swap free should be close to the amount that is allocated. If it turns out most of the swap space is being used that would be your problem. You then see a list of processes on the system. The %CPU column shows you how much CPU time each process is using along with the amount of memory.

At the command prompt you simply enter top to run top

Ah, thank you for such a great explanation! Here’s my “top” screenshot. It looks like the hardware is working as expected??