Hi!
TL:DR:
- how do i check the bootdrive for hardware fault in linux? (pref by ssh)
- is it ok to manually edit HA / ESPhome / etc config files via ssh? are there limitations to this (eg must i stop HA before modifying its yamls?
- if manual editing is ok, how do i take and KEEP ownership (or at least write permissions) to the /opt folder where all my HA type stuff is located? or should i not? why?
- is manually interfering with files / permissions, or something else a noob might do likely to cause the OS to flip into RO?
- ultimately, what is best practice for manually editing HA files while the system is live?
i’m new to linux, docker and HA (asides from HAOS on a pi), so please bear with me!
i’m running HA via docker-compose on ubuntu-server headless, on an i7 NUC with nvme (boot) + sata (storage) drives.
Every day or two i am unable to issue commands to the NUC via SSH. the filesystem has turned read-only, or i get IO errors like this one:
/usr/bin/dmesg: Input/output error
reboot (via power-button on NUC because sudo-reboot fails) fixes all issues - they only occur after a day or so of operation.
HomeAssistant automations, scenes, switches etc continue to function but i can’t add or edit them (of course because the underlying filesystem is RO)
searches show this indicates potential / imminent catastrophic hard-disk failure. (worth noting i have nothing i can’t afford to lose on the machine - /opt is backed up to gDrive via duplicati (also docker-compose)
i found smartctl and ran on the sata drive with no faults found, but this is hardly useful because it is the nvme drive i need to check. trying smartctl on the nvme drive yields:
[user]@[server]:/opt$ smartctl -a /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-47-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
Smartctl open device: /dev/nvme0n1 failed: Permission denied
firstly, how do i check the boot drive for errors? my win10 desktop only has one m.2 slot and it’s the boot drive, so i can’t put it in there to test.
i tried to follow this guide and received this error:
/lib/recovery-mode/recovery-menu: line 80: /etc/default/rcs: No such file or directory
fsck from util-linux 2.37.2
/dev/mapper/ubuntu--vg-ubuntu--lv is mounted.
e2fsck: Cannot continue, aborting.
Finished, please press ENTER
from system info in linux recovery i see:
===LVM state ===
Physical Volumes: not ok (BAD)
Volume Groups: ok (good)
i understand that so far this is generic linux support rather than HA specific, however the NUC is only running home-automation stuff and i am aware that i’m manually doing things i don’t understand like changing file ownerships* to edit files while HA is running and don’t know if this is relevant. in any case i’m sure there are best-practices, and plenty of pitfalls that i am unaware of.
*all my HA / ESPHome/ etc stuff is at root/opt
which is owned by the only user - sometimes when i try to edit files i get permissions errors, so i run ssh [user]@[ip] chown [user]:[user] /opt
to take ownership and make my edits.
so asides from disk-checking a boot drive, i’d like any advice there is about best practices for editing files on a live HA/docker system, or any insights into what could cause my issues other than hardware fault on the nvme drive.
fwiw i bought the NUC second-hand quite recently. it has a warranty so if i can establish hardware fault i can likely get the drive replaced, although i’m concerned they may wish to replace the whole NUC which would be a shame because i paid for a grade ‘B’ i5 and received a (cosmetically) immaculate i7
thanks!
not sure what info would help, but here’s my:
docker-compose.yaml:
HA.config.yaml
dmesg output