HomeAssistant Becoming More Unstable

This has been happening with greater regularity where HA just stops. The host is up and I can ping it but it’s totally dead, can’t SSH to it, can’t SAMBA to it and cannot use the web interface - my only recourse is to power off the rPi and let it come up. It’s a big enough concern that I built an ESP device to monitor my HA and reboot it if it fails pings (which I now know is faulty so I’ll do it with API calls instead).

I am not sure where I can go to find out what cause this sudden stop state. Which logs have this information? The main HA log doesn’t have anything outside of the normal internal error reporting and nothing seems to shine a light on this problem. I’ve checked all the logs under System that seem logical to check and nothing tells me why HA is essentially crashing.

This has been happening for a few months now. It isn’t predictable, just random and it’s not terribly frequent, but it’s happened a half dozen times in several months and that has me concerned - especially since my house is reliant upon HA and I often take RV trips where I need it to maintain my security systems and vacation “fake lighting”.

Any suggestions on how I can try to track this gremlin down? I’m almost to the point of starting over with HA but that is a MASSIVE undertaking, so much so that I’m avoiding it at all costs.

You are using raspberry pi? Do you have by any chance sd card in it and install your home assistant on it?

Use the system monitor to keep an eye on your system memory use. You may have a memory leak caused by an automation loop or a third party integration.

No, I’m on SSD external, no SD card exists at all on my Pi.

Do you mean create sensors for this or is there another system monitor you are referring to?

Would like to help you but I was not brave enough to use pi as my main computer. I use full desktop comp to build my smart home around it.

That’s the direction I am moving as well once I figure out how to get bluetooth to work on VMWare :slight_smile:

Use debian and container based installation. Leave everything else out door.

1 Like

Once everything was tweaked to high heaven on my setup it was working perfectly - except for the same symptoms you have discovered (I have an RPI4 *Gig of RAM running healthy supported Home Assistant Supervised, not microsdcard in the slot, a 1TB ssd (samsung T7)).

It’s like the CPU just decides to go on vacation. CPU usage goes to zero (I think) and it sits there powered on, running very cool (not overheated at all), happy as a clam to be “out of the office”.

I did alot of research on this - I have heard from forum moderators this is a known issue with RPI (not sure if it happens to everyone, but it is known to happen alot) - I can’t find ANYTHING out of the ordinary in ANY of the logs.

So, I just had to give up and used the below workaround - since it happened to me every 10 days or so, I decided to create two automations that would allow me to reboot the host hardware from within HA (properly so the OS closes all files in use, etc.) twice per week, and adjust when that happens right in my dashboard. The helpers allow me to select the day of the week and the time of that day for each of my 2 reboot automations.

I have them set to run in the middle of the night on different days of the week and it works like a charm. I have ALOT of sensors and integrations and automations - and also use the RPI as a DNS server to strip out advertisements on our web browsing - and every single time, everything reconnects properly.

(I have pretty good networking equipment - and well configured with solid connections everywhere - so a router issue is rare (a reboot maybe needed once every 6 months or so), so I also scheduled my router (in it’s firmware) to reboot a certain day and time in the middle of the night once each month which helps the stability as well, you might also want to consider that.)

Also if I were you I would connect your RPI directly to ethernet (rather than WiFi) if possible.

Anyway, here’s an example one of the automations -

alias: "Host Reboot (Per Day and Time #1 of 2)"
description: ""
trigger:
  - platform: time
    at: input_datetime.time_for_weekly_reboot_1_of_2
condition:
  - condition: template
    value_template: >-
      {{ states('input_select.day_of_weekly_reboot_1_of_2') ==
      now().strftime('%a') }}
action:
  - service: hassio.host_reboot
    data: {}
mode: single
1 Like

I was experiencing similar symptoms to what you described. I didn’t know it at the time, but my SSD was dying.

PSA: Don’t get Western Digital/Sandisk SSDs: Western Digital promises to release firmware update for failing SanDisk Extreme SSDs | Engadget

1 Like

That’s not a bad idea, thanks for the history and the automation you set up to help mitigate. I assume since you did this that everything has been good? I’m a little concerned about a reboot causing DB corruption but I suppose I could try to mitigate that as well to only reboot after my nightly backup occurs.

Well that sucks, I do have a WD SSD. What do you use? Since I don’t know the exact cause I cannot rule this out (and I have considered that my drive could be failing as well).

I’ve been hyper diligent with setting up regular backups to my NAS. That will buy you some time until you can migrate off of your drive to confirm whether the ssd is/isn’t the issue.

Great video to follow from Lewis/@EverythingSmartHome:

Current SSD I’m using is a Sabrent brand… but anything other than WD (for now) should be good.

1 Like

My Samsung T7 has been running about two years with no issue - and also draws very little power, and it’s the third one I have. Occasionally I just clone the whole thing with the RPI SD Card Copier - boot up the PI on a generic raspbian sd card without any SSD’s connected to the USB ports, then plug them both in, then they show up as sd cards in the sd card copier app (I actually have a Samsung T3 and a Samsung T7, both 1TB - if they were both the same exact model then I wouldn’t know which was which in the SD card copier because it just shows the hardware name…(??*&^?). So, I have a backup I can just plug in and when it is up and running I can just do a full HA restore from within HA (use the google backup addon to make backups constantly (on mine I have it do that daily)). I think this situation was unusual, but one of the SSD’s I bought caused me no end of trouble and it did in fact have corrupted sectors (after a bunch of painful linux education on how to examine and check low level linux/debian file system stuff as well as buying some expensive software for managing partitions sizes and disk drives for diagnotic purposes)… Finally I determined it must be the SSD and was able to replace it for free after struggling for three weeks not knowing what the issue was (the store had a 1 month exchange policy and by chance I was lucky to ask for an exchange just before that window closed). Alot of unnecessary education for me!

BTW, the samsung T7 and T3 BOTH come with a great cable that is USB3.0 on one end (for the RPI) and USB-C on the other end for the SSD…

And yes, I set that reboot automation up about three months ago and it has been rock solid. Note, the host_reboot command is VERY safe, it politely asks the underlying OS to close all open files and databases while HA shuts down all of its processes in an orderly manner, before the RPI power goes off briefly. So it’s the safest way to do it in my opinion, because I believe that command takes care of ensuring nothing is corrupted. I have other daemons running on the RPI host outside of and completely unrelated to HA, and it shuts those down (and a database there as well) very cleanly and in an orderly manner. (Of course you need all of your stuff you need running - HA and outside of HA - to be set up to launch when the RPI is first booted up…) I believe it doesn’t just reboot the hardware - I think it properly shuts down HA and then as a last step it makes a formal request to the underlying OS to reboot in safe orderly fashion when it is able to. So, it’s not an instant reboot - I haven’t watched it lately, but it does take a couple of minutes. So, go for it! You’ll be happy you did!

(P.S. I am going to add cameras soon - but before I do that to play it safe I am going to switch over to more powerful hardware so my need for the reboots at that point hopefully would go away. However, I would still do it (maybe just not twice a week… A purist would balk at this “FIND THE ISSUE AND FIX IT!!”, but I believe it is better to have an occasional automatic reboot ANYWAY, so everything stays fresh and any even miniscule software memory leaks and the like are never an issue.)

1 Like

Have a look at

home-assistant.log
home-assistant.log.1
home-assistant.log.1.fault

The .1 files are the logs from previous run, so you can always retrace where it went wrong :wink:

1 Like

Thanks, I did check those already to no avail (and .fault is empty).

I’m going to start with @KruseLuds idea of auto reboots to test if the pi is the problem before I get a new drive on there, if it still acts up despite reboots then I know it could be the drive.

I already do nightly backups to an SMB share so migrating to another drive is relatively easy for me if needed.

Yes.

1 Like

What version of HASSOS are you running? There is a known issue with v10.0+. Easy solution for me was to downgrade to v9.5. Haven’t had this issue since.

Is this the known issue or is it something else?