Home Assistant dying every day

I am running Hassio / docker on a NUC. Last two days hassio just stops. Nothing in logs. Just restart container and up and running again. 0.91.3. As I said, nothing in logs.

@firstof9 I enabled debugging SSH for HassOS and got access to dmesg now. I guess I will to have to wait for errors, right now it’s looking fine: https://pastebin.com/AdU08pEL

@ConcordGE Better than buying a new SSD! There is nothing running beside HassOS, I simply flashed the image from https://www.home-assistant.io/hassio/installation/
As HA is running fine for hours, could the image be that bad?

Keep watching that log, you’ll likely see a filesystem error and it’ll say something about re-mounting the root file system.

That wasn’t as easy as I thought as the even the debugging ssh session is killed in a short time. But I learned how to record ssh sessions and got the kernel messages: https://pastebin.com/NqTDdSMT
I gets interesting below 3504.054370

I think I/O error, dev sda, sector X looks like a faulty SSD, right? Pretty strange that neither a S.M.A.R.T. test nor a read/write test saw that…

Here’s the start of your shit show:

[ 3627.974324] exception Emask 0x0 SAct 0x3 SErr 0x0 action 0x6 frozen
[ 3627.981067] failed command: WRITE FPDMA QUEUED
[ 3627.984581] cmd 61/48:00:80:37:26/00:00:00:00:00/40 tag 0 ncq dma 36864 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[ 3627.991430]  status: { DRDY }

Are you able to execute smartctl -a /dev/sda and post the output?

There is no smartctl in HassOS, but I bootet the NUC with an Ubuntu live stick and run smartctl there. Doesn’t look that bad beside the strange warnings at the beginning: https://pastebin.com/EgD5aSYD

I seem to have Firmware 0009 and according the linked blog from smartctl all versions below 0309 are faulty and cause the drive to get unresponsive? I’m pretty near at this 5184 hours

Edit: According to this Document 0009 is exactly one version below 0309. Damn, now I need a Windows PC for the crucial update tool

Your smartlog there shows disk write errors.

Error 0 occurred at disk power-on lifetime: 5230 hours (217 days + 22 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 50 20 08 e3 46 e5   at LBA = 0x0546e308 = 88531720

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  ca 00 20 f8 e2 46 e5 00  19d+02:02:50.816  WRITE DMA
  ca 00 20 d8 e2 46 e5 00  19d+02:02:50.816  WRITE DMA
  ca 00 20 d8 e2 46 e5 00  19d+02:02:50.816  WRITE DMA
  ca 00 20 d8 e2 46 e5 00  19d+02:02:50.816  WRITE DMA
  ca 00 20 d8 e2 46 e5 00  19d+02:02:50.816  WRITE DMA

Just out of curiosity what is the working capacity of your SSD drive.

@firstof9 Do you think then that it can not be because of the faulty firmware? Since my NUC doesn’t run Windows and it doesn’t have a CD drive, the firmware upgrade is getting complicated.

@ConcordGE I’m not sure what you mean by “working”. How much of the 60GB are usable?

Most of the NUC setups I come across are running Windows so I was guessing you had it partitioned.

I’m guessing at this stage it’s a disk sector issue so maybe reformat the disk and try reflashing the image and see what results you get

If you use the ssh addon, try to leave it off…

Doubting a firmware issue.

You could try as ConcordGE suggested and reformat the drive so the bad sectors get tagged, then re-image the disk. I’ve ran into issues previously where the linux kernel wouldn’t handle SSDs correctly because trim support wasn’t there, I would assume HassOS would have that baked into the kernel by now tho. I’m not sure what distro it’s based on.

@ConcordGE @firstof9 First of all thanks for all the help. “Reformatting” is just the 5 seconds when I delete the partition table and create a new partition? HassOS is based on Buildroot btw

@gieljnssns I’m using the Community SSH/Terminal Addons since it existed. Do mean the official one?

Yes I mean the official one.
When I used that in combination with generic linux install, I’ve had some strange issues…

Normally when linux is installed the filesystem gets created and the disk is formatted at that point rerouting around bad sectors. Since this is being imaged to the disk I’m not sure how that will work.

If you’ve nothing that you want to retain on your SSD I’d do a low level format to ensure that it’s completely cleaned using something like this https://superuser.com/questions/203305/how-do-i-perform-a-low-level-format-of-a-sandforce-solid-state-disk/485949#485949

Did your NUC ever have Windows installed on that disk.

My advice, install a real OS like debian server or ubuntu server and install hassio over the top.

Hello everyone, thanks again for your help and tips. After reading the topic “crucial m4 BSOD” again, I decided to try the firmware update before reinstalling everything. After I had to borrow a desktop computer with CD burner, the update to 070H was no longer a problem.
And see there: Since over 24 hours the NUC and Home Assistant runs without problems. I didn’t reformat or do anything else. Really crazy that such a serious error can be in the firmware!

Doesn’t sound too reliable. You should try running HA on a Raspberry Pi 3+

I did, and that was pretty slow, and my SD cards died fast. I was told several times that the combination of Raspberry Pi, SD card and Home Assistant is a bad one because Home Assistant puts a lot of strain on the SD card. And the Pi doesn’t manage the exciting addons like face recognition at all.