Yes, power saving is not enabled in my host (/sys/module/usbcore/parameters/autosuspend contains a -1). I guess if that was the issue, anyway, I would see the controller not working after some time, rather than issues at any time, when I try to stop the addon.
And by crashing the system I indeed mean the HAOS in the VM Host (i.e. the whole VM freezes and I need to kill the process)
I don’t agree. The whole point of virtual stuff is sharing resources. If memory or disk is bad it’s all over the place. Just check smart on the drive and perform a ramdisk check via a live cd/usb.
Main thing of trouble shooting and finding the culprit is to cross off possible trouble makers.
My BIOS has support for memtest and I ran it. Ram is alright. I doubt a ram issue would only affect one thing and always the same thing, because of how ram gets allocated… but still worth the try as you pointed.
I’ve also smarted the physical volume of my file system and it’s healthy and with no logs of error. I have a Virtual Volume created on top (LVM gives me a lot of flexibility). Again, since the virtual HDs (images) of the VM are fully allocated in their own fille, I think a Hard Drive issue would not affect equally two different Virtual Machines, but it was worth the try.
It’s worth noting that this server runs a lot of other things, not only Home Assistant, so I keep it pretty healthy and I think I would have noticed HD or RAM issues.
Follow-up: Looks like it’s not “as easy as it seems”. My system seems to slow down a lot since I’ve installed qemu, not sure if just because qemu itself or because something I did wrong (suspicious about how I’m bridging the network). Also, my Home Assistant OS is stuck loading at " A start job is running for HAOS swap". And when it reaches there, the thing is the VM “pauses” due to an I/O error… so not sure how to proceed here.
I think I’m going to put a hold and wait until I get my new ZStick and I upgrade to the newest firmware there, or even to 700 series… and if that doesn’t work I’ll come back to trying a different emulator.
I don’t have that many options for this Hard Drive, unfortunately. I just checked that there were no errors logged, no critical errors and that health assessment passed.
I also did a fsck on the file system… that took longer and I don’t have the output, but the SMART output is the following:
root@madre:~# smartctl -l error /dev/nvme0n1p3
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-28-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
root@madre:~# smartctl -H /dev/nvme0n1p3
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-28-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
root@madre:~# smartctl -a /dev/nvme0n1p3
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-28-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: WDC PC SN530 SDBPNPZ-256G-1002
Serial Number: 2135FU474613
Firmware Version: 21106000
PCI Vendor/Subsystem ID: 0x15b7
IEEE OUI Identifier: 0x001b44
Total NVM Capacity: 256.060.514.304 [256 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256.060.514.304 [256 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 001b44 4a49595984
Local Time is: Tue Aug 13 09:40:07 2024 CEST
Firmware Updates (0x14): 2 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x1e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size: 128 Pages
Warning Comp. Temp. Threshold: 80 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Namespace 1 Features (0x02): NA_Fields
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 3.50W 1.80W - 0 0 0 0 0 0
1 + 2.40W 1.60W - 0 0 0 0 0 0
2 + 1.90W 1.50W - 0 0 0 0 0 0
3 - 0.0250W - - 3 3 3 3 3900 11000
4 - 0.0050W - - 4 4 4 4 5000 39000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 2
1 - 4096 0 1
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 41 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 4%
Data Units Read: 3.807.215 [1,94 TB]
Data Units Written: 13.910.073 [7,12 TB]
Host Read Commands: 25.010.185
Host Write Commands: 416.061.161
Controller Busy Time: 2.025
Power Cycles: 111
Power On Hours: 6.408
Unsafe Shutdowns: 63
Media and Data Integrity Errors: 0
Error Information Log Entries: 1
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged
I’m running Home Assistant OS 12.4 on top of a VirtualBox VM. My original HAOS image was way earlier (HAOS 7.2) and it’s been getting updated in-place. But I also tried with a fresh 12.4 HAOS image (the one available right now in the HA website) and the issue was exactly the same.
Yup. Basically I would ssh into Home Assistant OS (using the Advanced SSH addon) and then from there run docker exec to run bash inside the docker, so I would be effectively ssh-ing into the Home Assistant itself. Would I find anything there?
Is there another docker container specific to the Zwave JS UI Addon, that I may be able to ssh into and get additional data?
Every addon is a container, yes. But you can see the container logs under Settings/System/Logs, just switch between containers with the dropdown in the top right corner.
I finally got my Zstick Gen5+ and I have migrated from Gen5 to Gen5+, i.e. to firmware 5.2.
After that, I still get the main issue (whenever I try to stop the ZWave JS UI, it makes the whole HAOS crash) but I was able to update ZWave JS UI to the last version (by setting it not to boot at init, rebooting, and then updating before initing) and I’m not getting the old error of an invalid discovery/type in the s6-overlay (whatever that means).
So at least I got partial success.
Once I was in 5.2 I tried plugging the USB stick directly into my server (I had it through a hub, because for some reason with the Gen5 I was having troubles when plugging git directly). I had hopes that this may fix the HAOS freeze when stopping ZWave JS UI, but it did not help.
Next step is trying to migrate to Gen7 and see if that changes anything.
BTW something I may have never mentioned: If I unplug the USB stick before stopping the ZWave JS UI addon, then it does not crash. It has to do with how ZWave JS UI is trying to disconnect from the stick.
It is very unlikely that migrating to a 700 series stick will work any better. At this point, it’s pretty certain the issue lies somewhere with your VM setup. Have you tried setting up a new HAOS VM in QEMU and restoring the HA backup into that?
I did try but did not succeed, I did not manage to run HAOS on QEMU on Debian, for some reason. I think it had to do with how I was creating the network bridge.
I did try on a fresh new HAOS image on VirtualBox and, even without reloading the backup, I was having the same issue.
I think I just solved it by enabling USB 3.0 on Virtual Box (I had it as USB 2.0 before). I’m pretty sure in the past I had to set it to USB 2.0 for some reason (maybe related to Aeotec ZStick gen5 vs gen 5+?)
So, to summarise in case someone ever gets here
Error 1 - After updating ZWave JS UI, the addon will not start
I was getting the following error
s6-rc-compile: fatal: invalid /etc/s6-overlay/s6-rc.d/discovery/type: must be oneshot, longrun, or bundle
s6-rc: fatal: unable to take locks: No such file or directory
s6-linux-init-shutdownd: warning: /run/s6/basedir/scripts/rc.shutdown exited 111
Fi
I sorted it out by
Updating Aeotec ZWave Stick to Gen5+ / Firmware 1.2. (Not sure if that may have been solved by some of the Home Assistant Core updates, but I don’t think so)
Error 2 - HAOS crash when trying to stop ZWave JS UI / disconnect from the ZStick
Whenever I tried to stop ZWave JS UI addon (including any attempt to reboot the system or to update the ZWave Addon), the whole HAOS would crash, no errors in any of the logs.
I sorted it out by
Upgrading Aeotec ZStick Gen 5 to Gen5+ (firmware 1.1 to firmware 1.2)
Setting up the USB controller as USB 3.0 on VirtualBox (I had USB 2.0 before, which I think was a requirement at some point, maybe only for firmware 1.1)
Hope it helps someone in the future, and thanks everyone for your ideas and suggestions, I don’t think I would have been able to debug without you (I learned a lot in this process).
You may have done it to avoid usb 3 interference on the stick. (USB3 is known to interfere with zigbee and some ZWave sticks.) this is where the recommendation to use a short extension or a USB2 hub to get the stick off the USB3 bus comes from.