Home Assistant dying every day

Hello Community,
I hate generic questions, but I’m starting to get the crisis here. My Home Assistant installation is now dying in under 24 hours, with random error messages about components or IO errors.
I have already reinstalled HassIO several times and restored snapshots. I had a Pi, changed the power supply there, I connected the Pi directly to the router instead of via D-LAN, I deactivated all custom componentens… Now I bought a completely new NUC out of desperation and because I suspected a bit the SD card. Yesterday evening snapshot restored, ran through the night, this morning again hundreds of errors in the log, now dead.
Sometimes it’s the Yeelight lamps, sometimes my Bluetooth thermostat, sometimes MQTT, sometimes Wemo, sometimes the network is allegedly gone, sometimes the filesystem read-only…

Here are some excerpts from my log, sometimes I can’t even save it before I can’t get to it anymore: https://pastebin.com/WFVQjKmm

I’m slowly going crazy here! Does anyone have any idea what is going so wrong here?

Greets,
pattyland

Are you using Hassio?

What is in the system log (on the hassio page)?

Yes, I’m using Hass.io on HassOS, the standard image from https://www.home-assistant.io/hassio/installation/

The system logs was one of the first things that stopped working this morning with a red “Can’t load log”.
I will produce a new one this evening

Have you tried exorcism? :latin_cross:
J/k we’ll figure it out :heart:

1 Like

The Systemlog looks pretty fine after another hard reboot: https://pastebin.com/iKNxyKYp

@krash I will do anything! Do you think a little cross might help?
I really miss my automations :frowning:

This is a long shot but if you happen to have the Duckdns hass.io add-on enabled, you should turn that off. I once had really strange issues I couldn’t diagnose. It sounds similar to your issue, where random components would start failing slowly, basically couldn’t make a network connection. I turned off the add-on and haven’t had the issue for nearly a month now.

Thanks for the tip, but I don’t use it. I use OpenVPN to access my Home Assistant on the go.
But maybe your comment will help someone else!

Edit: And it’s getting bad again… After running 3 hours: https://pastebin.com/xKWAKxMz
Nothing new in the system log since hours

SSH already dead, can’t restart the addon, System log empty, can still use the GUI to interact with my devices, more more error messages are coming: https://pastebin.com/A0ykyvmF

How big is your sqlite DB? Do you purge it?

I deleted it yesterday because it had integrity problems. I think I deleted it every month…

You didn’t mention what version of HA you were running on or what hardware it was running on. I’ve noticed a bit of instability in 0.91.2 but I see today 0.91.3 is now available. Might be worth updating.

If using an SD card it may be going bad. As mentioned above try deleting the DB and the HA log files too.

HA 0.91.3 on a NUC J3455. I deleted the DB, I’m not sure how I delete log files on HassOS?
Right now I can’t even connect via SSH or SMB :confused:

Sounds like a full disk.

05

Error writing config for auth: [Errno 30] Read-only file system: '/config/.storage/tmp6nc4b1z8'

Your disk is remounting in read only mode, this happens when there’s a filesystem issue.

I flashed the image on my ssd with etcher. Could this be a problem? That the filesystem got not resized or something? Or should the sensor display the not-resized size then?

That would explain the short time where it runs fine

The disk is likely bad, you’ll need a new SDcard or switch to a HDD/SDD or USB drive.

There is a slightly used 60GB SSD in my NUC that I checked with HDDScan before flashing. Do you think I should ran another test?

Honestly you’d need to check the kernel logs for what’s going on fully, and to do that you’ll need SSH access to the main OS or mount the disk in another instance of linux somewhere.

It’s likely you’re SSD is fine it’s just that you may have flashed a bad image. What else if anything is running on your NUC.