Home Assistant dying every day

pattyland · April 12, 2019, 7:14am

Hello Community,
I hate generic questions, but I’m starting to get the crisis here. My Home Assistant installation is now dying in under 24 hours, with random error messages about components or IO errors.
I have already reinstalled HassIO several times and restored snapshots. I had a Pi, changed the power supply there, I connected the Pi directly to the router instead of via D-LAN, I deactivated all custom componentens… Now I bought a completely new NUC out of desperation and because I suspected a bit the SD card. Yesterday evening snapshot restored, ran through the night, this morning again hundreds of errors in the log, now dead.
Sometimes it’s the Yeelight lamps, sometimes my Bluetooth thermostat, sometimes MQTT, sometimes Wemo, sometimes the network is allegedly gone, sometimes the filesystem read-only…

Here are some excerpts from my log, sometimes I can’t even save it before I can’t get to it anymore: https://pastebin.com/WFVQjKmm

I’m slowly going crazy here! Does anyone have any idea what is going so wrong here?

Greets,
pattyland

tom_l · April 12, 2019, 8:57am

Are you using Hassio?

What is in the system log (on the hassio page)?

pattyland · April 12, 2019, 9:12am

Yes, I’m using Hass.io on HassOS, the standard image from https://www.home-assistant.io/hassio/installation/

The system logs was one of the first things that stopped working this morning with a red “Can’t load log”.
I will produce a new one this evening

krash · April 12, 2019, 9:19am

Have you tried exorcism?
J/k we’ll figure it out

pattyland · April 12, 2019, 3:46pm

The Systemlog looks pretty fine after another hard reboot: https://pastebin.com/iKNxyKYp

@krash I will do anything! Do you think a little cross might help?
I really miss my automations

dwinnn · April 12, 2019, 4:21pm

This is a long shot but if you happen to have the Duckdns hass.io add-on enabled, you should turn that off. I once had really strange issues I couldn’t diagnose. It sounds similar to your issue, where random components would start failing slowly, basically couldn’t make a network connection. I turned off the add-on and haven’t had the issue for nearly a month now.

pattyland · April 12, 2019, 4:45pm

Thanks for the tip, but I don’t use it. I use OpenVPN to access my Home Assistant on the go.
But maybe your comment will help someone else!

Edit: And it’s getting bad again… After running 3 hours: https://pastebin.com/xKWAKxMz
Nothing new in the system log since hours

pattyland · April 12, 2019, 5:10pm

SSH already dead, can’t restart the addon, System log empty, can still use the GUI to interact with my devices, more more error messages are coming: https://pastebin.com/A0ykyvmF

rdehuyss · April 12, 2019, 5:40pm

How big is your sqlite DB? Do you purge it?

pattyland · April 12, 2019, 6:37pm

I deleted it yesterday because it had integrity problems. I think I deleted it every month…

ConcordGE · April 12, 2019, 6:43pm

You didn’t mention what version of HA you were running on or what hardware it was running on. I’ve noticed a bit of instability in 0.91.2 but I see today 0.91.3 is now available. Might be worth updating.

If using an SD card it may be going bad. As mentioned above try deleting the DB and the HA log files too.

pattyland · April 12, 2019, 7:02pm

HA 0.91.3 on a NUC J3455. I deleted the DB, I’m not sure how I delete log files on HassOS?
Right now I can’t even connect via SSH or SMB

firstof9 · April 12, 2019, 7:03pm

Sounds like a full disk.

pattyland · April 12, 2019, 7:04pm

firstof9 · April 12, 2019, 7:06pm

Error writing config for auth: [Errno 30] Read-only file system: '/config/.storage/tmp6nc4b1z8'

Your disk is remounting in read only mode, this happens when there’s a filesystem issue.

pattyland · April 12, 2019, 7:10pm

I flashed the image on my ssd with etcher. Could this be a problem? That the filesystem got not resized or something? Or should the sensor display the not-resized size then?

That would explain the short time where it runs fine

firstof9 · April 12, 2019, 7:17pm

The disk is likely bad, you’ll need a new SDcard or switch to a HDD/SDD or USB drive.

pattyland · April 12, 2019, 7:25pm

There is a slightly used 60GB SSD in my NUC that I checked with HDDScan before flashing. Do you think I should ran another test?

firstof9 · April 12, 2019, 7:29pm

Honestly you’d need to check the kernel logs for what’s going on fully, and to do that you’ll need SSH access to the main OS or mount the disk in another instance of linux somewhere.

ConcordGE · April 12, 2019, 7:50pm

It’s likely you’re SSD is fine it’s just that you may have flashed a bad image. What else if anything is running on your NUC.