HassIO stops responding every so often

benzo · September 25, 2020, 3:56pm

Same problem here, in my case i have this problem since january 2020. I have reinstalled all backing up yaml/etc. two months ago but the issue is still there. I remember reading about Foscam stream issue that freeze system, but don’t know is this is the case. The ip is still visible in local network but the web gui/apps are not working.

Raspberry PI3 b+

Pelicjan · September 27, 2020, 6:49pm

Same problem. I have RPi 3B+ and SSD. HassIO randomly freezes. There is nothing I can do then. Only ping works.

Recte · September 28, 2020, 7:27pm

I had performance issues as posted here which showed as high CPU loads. These where caused by IO_Waits. Not something you might expect when using an SSD, but perhaps worth to monitor for a while.

I used a command line sensor since it is not by default available. You can create the sensor by adding the code below to your configuration.yaml

command_line:
  - sensor:
      name: CPU IO Wait
      command: top -b -n1 | grep ^CPU | awk '{printf("%.0f"), $10}'
      scan_interval: 3
      unit_of_measurement: "%"
      value_template: '{{ value }}'

As you can see in the topic, it showed a direct relation with the CPU loads.

Edit: Since 2023.6 the command line sensor syntax has changed. This edit on June 8 '23 replaces the old YAML.

baldfox · September 29, 2020, 6:00am

I too am having this issue on a celeron NUC / SSD. Am running hassio (you can rename it to whatever you want ;)). In the last 3-4 months it’s been getting more and more unresponsive. Suddenly it’s not accessible but then everything works again a few minutes later, other times it’s hours. What’s odd is that the main UI is not accessible at all, but if I ask alexa to turn something on (node-red script that switches on a tasmota light for example) it still works. I’m loathed to start over and redo it, but I’m thinking this might be the only course of action to take.

benzo · September 29, 2020, 6:44pm

So TL:DR, is it an hardware issue (limitation)?
You speak about SSD instead of an SD Card, but using a mechanical HDD, e.g. HA on a VM or Docker, will be there the same problems?
Thank you

Recte · September 30, 2020, 9:41am

Yes, it is a limitation in SD Card performance and I fixed it using a SSD (Solid State Disk), but expect you can also use a HDD.

In summary:
My HA froze from time to time as did all connections of the Conbee II stick became unavailable from time to time. I found out that both issues had one thing in common, being the high CPU load (not to be mistaking with CPU Usage percentage) before or at the moment of the issue.

I found out that the high CPU load was caused by a high CPU IO Wait (as shown in the images). What basically happens is that your CPU is waiting for the read/write actions to finish. Currently I have a max IOWait of 2% (using an SSD), where this was 100% quite often and for longer times using a SD card.

Disclaimer: The reason for the load is/was not per se HA itself. I reduced the issue a little by limiting the frequency of logbook writes, which helped a little. I use several addons which can/could also very well be the reason of the amount of IO. I however wanted to keep them, so I decided to increase performance instead of lowering the IO.

So I am not saying my solution is the only or only right solution, it’s basically a tip for troubleshooting.

Resume
The first step I’d advice in troubleshooting not a respondong HA, is see if the CPU is obstructed in any way. By default the sensor platform ‘systemmonitor’ gives you all the tools but not IOWait. Therefor I created the commandline sensor shown above.

baizinger · September 30, 2020, 9:41am

as mentioned above, i had the exact same problem. moving to a new a2 sd card and adding every integration from scratch (without any snapshots) solved the issue for me.

benzo · September 30, 2020, 12:30pm

The are too many SD Card classifications… SDHC II U3 C10 V60 A2

benzo · September 30, 2020, 12:30pm

Thanks again (:

baizinger · October 1, 2020, 7:36am

This is the one i used. But i think starting all over again with my .yaml files solved the problem for me.

r33b · October 5, 2020, 9:37am

After the troubles i mentioned above, my system is stable now for already 17 days.

For me it was the deconz addon. I moved the conbee/deconz to a separate system. Since then there was no similar behavior anymore. Everything is running fine.

So for those who have similar problems I would suggest to try to narrow the problem down by disabling addons.

benzo · October 5, 2020, 12:34pm

No freezes since 4 days, without changing to ssd or sdcard a2. What i did is to delete the lovelace cards “picture-elements” for the two Foscam cameras, the cards example was taken from official doc (https://www.home-assistant.io/integrations/foscam/) . With your sensor for the IO wait i constantly monitored the situation, and in only one time for now the spike went to 76, sometimes 50, but the average is 25. Is it a “normal” situation having 25 averge?
Thank you again!

PS The CPU load never went over 25%.

Recte · October 6, 2020, 2:18pm

I am no linux expert but as far as I know, the IOWait percentage tells you the time the CPU is waiting for the IO to read/write. So what you are basically saying is that your CPU is 25% of its time waiting for your IO to finish.

As a result it most likely shows a high load (not a percentage, but a number) and a relatively low CPU usage percentage. The reason for that is that the instructions keep coming, but the CPU is patiently waiting for the IO to finish before it can start processing.

So simply said, the instruction “show lovelace” or “process login” is in the load queue, because the IO isn’t ready and the CPU is waiting.

FYI. My current average IOWait is about 0% since the switch to the SSD. The spikes I see are about once or twice an hour and hit 2% at most. CPU 3-6% and my 15m load average over a period of two hours is between 0.64 and 0.89

At the SD-Card times it was hitting 100% IOWaits as you can see in the image, even though I already had reduced the write actions.

I hope this helps.

benzo · October 6, 2020, 6:40pm

Thank you very much

benzo · October 6, 2020, 6:41pm

Hello again, did you tested the IO waits with @Recte sensor and the CPU load, before and after changing sd card?

baizinger · October 8, 2020, 7:19am

Hey. i tracked the CPU Load and Memory.

Original SD Card: CPU goes up to 75% in one day - crash
New SD Card with Snapshot recover: CPU goes up to 75% in one day / deconz stuff is unavailable - crash
New SD Card with complete new configuration of addons, yaml and NO! Snapshots: CPU max is
about 25% at peak - no crash since then.

My system is running since 2 weeks without any reboot of the pi.

In my opinion something went wrong in an Update a while ago (probably deconz) and my system never recovered from it.

BamaBlueCollar · October 17, 2020, 5:39pm

I was running into a similar problem with a Pi 3b+ running the system on a USB stick. I was able to eliminate the problem simply by using a MySql server running on another computer to replace the builtin recorder.

jaruba · November 5, 2020, 7:46am

I just went through this whole thread because I have the same issues as most of you for some months now.

What I’ve learned from this thread:

most use HA on an RPI
many use a SSD
many (if not all) seem to be using the Deconz addon

I also run HA on an RPI 3B+, use a SSD and the Deconz addon. In my case the system can lock up around 1-2 days from a restart. I disabled the Deconz addon and the RPI stopped locking up.

Finding any good logs for the cause of this seems to be an issue for me too. I tend to think this could be a HA recorder or logbook issue somehow…

I’m thinking of modifying the default config to disable the recorder as a test…

jaruba · November 8, 2020, 12:40am

This was clearly a hardware limitation of the RPI…
I (finally) managed to fix my issues by optimising the more intensive integrations.
I set memory_init: 256 in Unifi Controller’s config, disabled query log in AdGuard Home, and tweaked the recorder with these settings:

recorder:
  purge_keep_days: 1
  commit_interval: 10

Not only does it not hang anymore, all the automations seem a lot faster too.

skynet35 · December 19, 2020, 8:52am

Hi everybody, for 1 week, i had the same problem of you, HA crashed every day, every 24H. (config: NUC with proxmox, and home assistant OS) All other VM not crashed, so problem came from HA.
I removed one by one the last plugins i installed, and bingo, after removed motion eyes, HA stopped crash. Maybe it can help some persons.