Intermittent filesystem errors / best practices?

prosodyspeaks · September 3, 2022, 4:52pm

Hi!

TL:DR:

how do i check the bootdrive for hardware fault in linux? (pref by ssh)
is it ok to manually edit HA / ESPhome / etc config files via ssh? are there limitations to this (eg must i stop HA before modifying its yamls?
if manual editing is ok, how do i take and KEEP ownership (or at least write permissions) to the /opt folder where all my HA type stuff is located? or should i not? why?
is manually interfering with files / permissions, or something else a noob might do likely to cause the OS to flip into RO?
ultimately, what is best practice for manually editing HA files while the system is live?

i’m new to linux, docker and HA (asides from HAOS on a pi), so please bear with me!

i’m running HA via docker-compose on ubuntu-server headless, on an i7 NUC with nvme (boot) + sata (storage) drives.

Every day or two i am unable to issue commands to the NUC via SSH. the filesystem has turned read-only, or i get IO errors like this one:
/usr/bin/dmesg: Input/output error
reboot (via power-button on NUC because sudo-reboot fails) fixes all issues - they only occur after a day or so of operation.

HomeAssistant automations, scenes, switches etc continue to function but i can’t add or edit them (of course because the underlying filesystem is RO)

searches show this indicates potential / imminent catastrophic hard-disk failure. (worth noting i have nothing i can’t afford to lose on the machine - /opt is backed up to gDrive via duplicati (also docker-compose)

i found smartctl and ran on the sata drive with no faults found, but this is hardly useful because it is the nvme drive i need to check. trying smartctl on the nvme drive yields:

[user]@[server]:/opt$ smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-47-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/nvme0n1 failed: Permission denied

firstly, how do i check the boot drive for errors? my win10 desktop only has one m.2 slot and it’s the boot drive, so i can’t put it in there to test.

i tried to follow this guide and received this error:

/lib/recovery-mode/recovery-menu: line 80: /etc/default/rcs: No such file or directory 
fsck from util-linux 2.37.2 
/dev/mapper/ubuntu--vg-ubuntu--lv is mounted. 
e2fsck: Cannot continue, aborting.

Finished, please press ENTER

from system info in linux recovery i see:

===LVM state ===
Physical Volumes: not ok (BAD)
Volume Groups: ok (good)

i understand that so far this is generic linux support rather than HA specific, however the NUC is only running home-automation stuff and i am aware that i’m manually doing things i don’t understand like changing file ownerships* to edit files while HA is running and don’t know if this is relevant. in any case i’m sure there are best-practices, and plenty of pitfalls that i am unaware of.

*all my HA / ESPHome/ etc stuff is at root/opt which is owned by the only user - sometimes when i try to edit files i get permissions errors, so i run ssh [user]@[ip] chown [user]:[user] /opt to take ownership and make my edits.

so asides from disk-checking a boot drive, i’d like any advice there is about best practices for editing files on a live HA/docker system, or any insights into what could cause my issues other than hardware fault on the nvme drive.

fwiw i bought the NUC second-hand quite recently. it has a warranty so if i can establish hardware fault i can likely get the drive replaced, although i’m concerned they may wish to replace the whole NUC which would be a shame because i paid for a grade ‘B’ i5 and received a (cosmetically) immaculate i7

thanks!

not sure what info would help, but here’s my:
docker-compose.yaml:
HA.config.yaml
dmesg output