For the past few days each morning I wake up to an unresponsive HA due to its memory (and CPU) being maxed out. This requires a hard reset to resolve in a reasonable amount of time (as in I’ve not waited more than a couple of hours for it to resolve itself).
I suspect some routine DB cleanup is occurring that is erroring out. Is there a way to recreate what is going on while at the console so it can be diagnosed?
I’d be more suspicious of non-standard integrations, apps/add-ons, blueprints etc.
For apps/add ons, they report their memory right in the UI.
I’d start at the top by monitoring/disabling/removing things you’ve added, add them back one at time until this starts happening again. A place to start would be the last thing(s) you added before the problem appeared
You could start by looking at your log-files, and as mentioned above, if anything point towards a certain integration/add-on(app) , try to enable “Debug” for this
Thank you for the replies. I should have added that I checked the logs and see nothing awry during the time frame. I also checked the start of the window where memory goes through the roof and nothing untoward occurs then either.
The box is a proxmox container with 6GB ram. It has been running fine for the past 5 years or so. The only recent config change I made was to add an entry to the recorder excludes.
Since posting I realised that I was still running the DB in a Postgres 13 cluster, and I have now migrated it to 17. I’ll see how that goes to rule out any DB issue.
I’m running HA Core. I presume I don’t have access to that feature?
Go to Devices & services, find the Home Assistant Superviser integration, and then go through each app and enable Memory percent (and optionally CPU percent) in the Sensor section. That’ll give you history for those sensors. Also you have those entities available to use in the frontend with cards like history graph, statistics graph, entities, etc.
Aside from that, do you use the VSCode app? It’s created a few meltdowns for me, although it is very uncommon.
Edit: Sorry I missed that you were running a core installation.
This sounds very familiar as sometime back I had some problems (HAOS on Proxmos) where HA was unresponsive in the morning and it turned out I had (my own fault) made the backups very large.
I’m not sure if it is corrupt. I’ll look into checking it is sound.
I don’t think it is obese - I add excludes entries regularly to keep it trim and that is something I added to. Looking at my logs this has been happening before I added that entry.
Yes my system is running out of memory and goes into max swap max CPU mode. It still runs slowly but enough to kill the system. but it goes from < 5% usage to 100% in about 90mins from 6.30am so something is spinning.
Nothing in the logs that I can see (they just stop generating by that point). I manage to run ps aux which shows the HA python process as the culprit. Is there any way to see what exactly is taking up the memory inside that process?
I don’t think so. Settings suggest they run at 5:45, so it is close but each backup is only 90mb and takes less than 10 seconds to run manually.
I’ve doubled the ram to 12gb which should give me some time to investigate but unless I have a way of looking inside the process I’m not sure what I’d actually do! Would disabling integrations be an idea?
And which Automations, templates/scripts are “involved/running/executed” at this time ?
I Mean such an exact time, clearly points toward something You have done, Maybe unintended
You say your manual backup only takes 90 seconds and is only 90Mb. Was it successful, or did it crash part way through? Is it including the correct files? Can you verify the backup file is intact?
How big is your database? What is the size of your drive and how much free space?
What times do your housekeeping automations start? When do they stop? Do they finish with no errors? They are scheduled for certain times, however you may be able to manually run them to observe behaviour in debug mode.
What apps/integrations are you running? Are they all up to date?
What is your baseline (CPU, disk space, as well as memory) before database housekeeping starts? Is the system already overloaded, or is the housekeeping at fault?
Does one automation finish before the next one kicks in?
Some logs [appropriately </> formatted for forum display] may put some more eyes on your issue and pickup something you may have skipped over.
So still a spike, just not one that took out the host (you can see gaps previously when the host was so overwhelmed where the gaps are).
Going back in time, there’s always been a bit of above average memory usage at 6.30am every day. It started getting large from the 17th (which was also the first crash).
So this isn’t a new process, just something existing that has suddenly gotten worse.
Is there a way to see when integrations were updated? A log or some such?
/Settings/System/Logs … Click on the 3 Dot’s to the Right(RAW-Log), Scroll Up To the 17th … or copy that log to Your PC, and Search there
Also in that View, in the top right corner You can Find Your other logs, APPs(add-ons), Supervisor/OS, Etc.
PS: Also You haven’t Stated which System/Ver. Etc You are running, Beside CORE, Ver. ?, and Postgress.
How are Your OS / Docker “Feeling” ?
And which Python-ver ?
Built-in During CORE Updates, 3rd part, When you decide, Click on their Github-Link(look up release-dates), Or make notes when you update those
PS: You have Chosen CORE-Install, it involves some extends of DIY efforts from You