Not sure where to start with this - been all over the place in my head but keep coming back to something in HA. Sorry for the long post but if any of you have the time to read I’d be eternally grateful - I think this one is beyond my ability right now. I also know that due to my setup (virtualised and not on a Pi) that I might get folks blaming that, and it could be the culprit, but I don’t see it somehow.
I have 13 ESPHome devices (mostly Sonoff Basics/Minis plus an RF bridge and an ESP32 TTGO T-Call) and 4 TP-Link Kasa devices connected to my HA via a BT HH6 Wifi AP, plus an RFlink connected via USB and a Conbee stick also connected via USB with approx. 15 ZigBee nodes. Zigbee is one two gang wall switch and a xaiomi temp/hum sensor in each room. I run HA virtualised (latest version) on ESXi with decent spec desktop hardware and a certified NIC. 2GB assigned memory. I do this so I can scale up the spec easily and I can do checkpoints & backups using systems I’m familiar with. All my wifi devices (Kasa and ESPHome) are given static leases by my DHCP server via MAC binding and they’re manually added to HA via the integrations and their static IPs to prevent dropouts.
All in all the setup has been rock solid for months.
Nothing’s changed in my setup for a good few weeks and everything’s fine, then a couple of weeks ago I started getting an ESPHome device (Sonoff mini switching a lamp) dropping off HA and going unavailable. Most of the time only for a few secs/mins but it was happening frequently enough to cause problems with automations not available to fire it when it needed to etc. There was no real pattern to it i.e. it didn’t seem to correlate with times the network was busy streaming etc.
To start with I put it down to a Sonoff gone bad, BUT - I have a Shelly 1 that also showed dropouts in the logbook but not as frequently. This is weird as not a single one of my devices have shown any dropout whatsoever for a long time and then all of a sudden I get two at once. What makes it even stranger is that both these nodes seemed to be dropping out at similar times during the day/night and the times were when the network was quiet (5-7am, 9-11pm mainly), so not a lot of flowing traffic. They’re also in the same room. The room isn’t far from the router and I had devices much further away (detached garage up the drive) that are 100% reliable.
Some folks may have seen my other thread on my alarm system which I’ve been working on over time. As part of it I expose the two PIRs of my alarm system to HA via an ESP32 with ESPHome so I can use them for other things but don’t currently have them included in any automations. I have noticed however that if I try to view the history section on the frontend or even if I just look at the state history of one of the PIRs the frontend freezes for up to a minute whilst it tries to process and graph all the on/offs. This makes me wonder if the constant on/offs of the PIRs with the kids running around during Covid are causing some sort of IO bottleneck. That’s the only sluggishness I’ve seen anywhere on the system though.
Whilst the PIR state history lag and the two ESPHome nodes dropping out were on my list to sort they weren’t the highest priority though so last night I cracked on setting up the HA side of my alarm system, adding a manual panel and doing a few reboots to see if it would restore armed state. It didn’t so I added recorder to my config and rebooted, then it worked. Being uber-cautious I decided to reboot one more time and things got really interesting. Before I started making changes I did a snapshot in ESXi in case anything went wrong, as I always do.
On boot I lost everything. I mean everything. Almost. Not a single ESPHome or Kasa device would show as connected (all unavailable and frontend suggesting I remove them) in the addon or in HA and all my Deconz entities were dead. The only thing still working was the USB connected RFLink and that was still receiving data from my weather station sensor. The odd ESPHome devices would show up for a split second or two then disappear again but all in all they were completely unusable. It was like HA/the OS/whatever couldn’t handle all the connections. The UI was still fast and I could browse around without issue or lag. I spent hours trying further reboots (HASSOS, just HA and full hypervisor) to no avail and tried restoring checkpoints/backups to configs and HA versions I had weeks ago and that I’d made immediately before starting and that I knew were good - nothing worked. At first I thought it was ESPHome but then noticed none of my Zigbee or Kasa devices worked either. Odd thing with that though was that the Deconz plugin still worked - it saw my Conbee stick and let me log in but wouldn’t allow me to control anything or receive data from sensors. RFlink via USB still functioned fine as well. Logs show nothing untoward for HA, HASSOS or the addons.
It got very late and I ended up going to bed, then checked the state history this morning and everything is back, albeit there are still droupouts, the original Sonoff and Shelly being the worst. Nothing working with Deconz but a reboot sorted that. Looking at the state history for the ESPHome/Kasa devices, they stayed unavailable for 6.5 hours and then eventually started coming back to life at around 6am. Connections are now very unstable but are just about usable.
My initial thoughts are:
- Kids at home for Covid, PIRs going mad, lots of state history = IO bottleneck somewhere. Shouldn’t be. SSD storage. Dying SSD? Other VMs suffering no performance issues though.
- CPU? Nope. Runs at an average of 95Mhz single core. Basically idle.
- RAM full/leaking? VM’s got 2GB. Possibly. Nothing in the logs, going to check via HA sensor. Wouldn’t it just page though? If so, SSD storage so surely no issue, unless SSD failing. Also, wouldn’t it affect USB connected devices also?!
- Wireless AP? No known issues with laptops/phones BUT they’re on the 5ghz link and IOT stuff all on 2.4ghz. Possibly. Could be interference or dodgy AP. Thing is though, also lost all Deconz devices. Conbee is connected via USB which explains why I could get to the interface but doesn’t it communicate via ‘virtual’ network to HA when USB connected? Rules out access point/physical network and makes it a purely HA API/HA network issue or severe interference on 2.4ghz knocking out both the all wifi devices and the zigbee devices, all at once and at the same time as my reboot. Can’t see it somehow and all devices stayed completely off for 6.5 hours. Too much of a coincidence. Definitely something HA related. Takes me back to the bottleneck theory but the PIRs weren’t seeing anything, it was night.
- HA related issue. How is that possible when restoring snapshots from good configs on previous versions?! This has to be hypervisor/hardware. I’m back to RAM/SSD.
- All the while the USB connected RFLink performed impeccably throughout the saga. If it was Disk/RAM IO, wouldn’t I have lost that as well?!
How can I clean up my install? It’s been update on top of update since around 0.90, surely this has to get messy. Can I get rid of old data, clean up junk/temp files and carry out some general maintenance to get it running as slick as possible? Not sure where to go next. Don’t fancy clean install. I don’t see my setup as overly extravagant/busy, surely I’ve not worn out/used up the SSD writes? It’s not that old.
Thanks and sorry again for the long post. Real head scratcher.