Whole system gone haywire

mrrodge · April 21, 2020, 9:42am

Not sure where to start with this - been all over the place in my head but keep coming back to something in HA. Sorry for the long post but if any of you have the time to read I’d be eternally grateful - I think this one is beyond my ability right now. I also know that due to my setup (virtualised and not on a Pi) that I might get folks blaming that, and it could be the culprit, but I don’t see it somehow.

I have 13 ESPHome devices (mostly Sonoff Basics/Minis plus an RF bridge and an ESP32 TTGO T-Call) and 4 TP-Link Kasa devices connected to my HA via a BT HH6 Wifi AP, plus an RFlink connected via USB and a Conbee stick also connected via USB with approx. 15 ZigBee nodes. Zigbee is one two gang wall switch and a xaiomi temp/hum sensor in each room. I run HA virtualised (latest version) on ESXi with decent spec desktop hardware and a certified NIC. 2GB assigned memory. I do this so I can scale up the spec easily and I can do checkpoints & backups using systems I’m familiar with. All my wifi devices (Kasa and ESPHome) are given static leases by my DHCP server via MAC binding and they’re manually added to HA via the integrations and their static IPs to prevent dropouts.

All in all the setup has been rock solid for months.

Nothing’s changed in my setup for a good few weeks and everything’s fine, then a couple of weeks ago I started getting an ESPHome device (Sonoff mini switching a lamp) dropping off HA and going unavailable. Most of the time only for a few secs/mins but it was happening frequently enough to cause problems with automations not available to fire it when it needed to etc. There was no real pattern to it i.e. it didn’t seem to correlate with times the network was busy streaming etc.

To start with I put it down to a Sonoff gone bad, BUT - I have a Shelly 1 that also showed dropouts in the logbook but not as frequently. This is weird as not a single one of my devices have shown any dropout whatsoever for a long time and then all of a sudden I get two at once. What makes it even stranger is that both these nodes seemed to be dropping out at similar times during the day/night and the times were when the network was quiet (5-7am, 9-11pm mainly), so not a lot of flowing traffic. They’re also in the same room. The room isn’t far from the router and I had devices much further away (detached garage up the drive) that are 100% reliable.

Some folks may have seen my other thread on my alarm system which I’ve been working on over time. As part of it I expose the two PIRs of my alarm system to HA via an ESP32 with ESPHome so I can use them for other things but don’t currently have them included in any automations. I have noticed however that if I try to view the history section on the frontend or even if I just look at the state history of one of the PIRs the frontend freezes for up to a minute whilst it tries to process and graph all the on/offs. This makes me wonder if the constant on/offs of the PIRs with the kids running around during Covid are causing some sort of IO bottleneck. That’s the only sluggishness I’ve seen anywhere on the system though.

Whilst the PIR state history lag and the two ESPHome nodes dropping out were on my list to sort they weren’t the highest priority though so last night I cracked on setting up the HA side of my alarm system, adding a manual panel and doing a few reboots to see if it would restore armed state. It didn’t so I added recorder to my config and rebooted, then it worked. Being uber-cautious I decided to reboot one more time and things got really interesting. Before I started making changes I did a snapshot in ESXi in case anything went wrong, as I always do.

On boot I lost everything. I mean everything. Almost. Not a single ESPHome or Kasa device would show as connected (all unavailable and frontend suggesting I remove them) in the addon or in HA and all my Deconz entities were dead. The only thing still working was the USB connected RFLink and that was still receiving data from my weather station sensor. The odd ESPHome devices would show up for a split second or two then disappear again but all in all they were completely unusable. It was like HA/the OS/whatever couldn’t handle all the connections. The UI was still fast and I could browse around without issue or lag. I spent hours trying further reboots (HASSOS, just HA and full hypervisor) to no avail and tried restoring checkpoints/backups to configs and HA versions I had weeks ago and that I’d made immediately before starting and that I knew were good - nothing worked. At first I thought it was ESPHome but then noticed none of my Zigbee or Kasa devices worked either. Odd thing with that though was that the Deconz plugin still worked - it saw my Conbee stick and let me log in but wouldn’t allow me to control anything or receive data from sensors. RFlink via USB still functioned fine as well. Logs show nothing untoward for HA, HASSOS or the addons.

It got very late and I ended up going to bed, then checked the state history this morning and everything is back, albeit there are still droupouts, the original Sonoff and Shelly being the worst. Nothing working with Deconz but a reboot sorted that. Looking at the state history for the ESPHome/Kasa devices, they stayed unavailable for 6.5 hours and then eventually started coming back to life at around 6am. Connections are now very unstable but are just about usable.

My initial thoughts are:

Kids at home for Covid, PIRs going mad, lots of state history = IO bottleneck somewhere. Shouldn’t be. SSD storage. Dying SSD? Other VMs suffering no performance issues though.
CPU? Nope. Runs at an average of 95Mhz single core. Basically idle.
RAM full/leaking? VM’s got 2GB. Possibly. Nothing in the logs, going to check via HA sensor. Wouldn’t it just page though? If so, SSD storage so surely no issue, unless SSD failing. Also, wouldn’t it affect USB connected devices also?!
Wireless AP? No known issues with laptops/phones BUT they’re on the 5ghz link and IOT stuff all on 2.4ghz. Possibly. Could be interference or dodgy AP. Thing is though, also lost all Deconz devices. Conbee is connected via USB which explains why I could get to the interface but doesn’t it communicate via ‘virtual’ network to HA when USB connected? Rules out access point/physical network and makes it a purely HA API/HA network issue or severe interference on 2.4ghz knocking out both the all wifi devices and the zigbee devices, all at once and at the same time as my reboot. Can’t see it somehow and all devices stayed completely off for 6.5 hours. Too much of a coincidence. Definitely something HA related. Takes me back to the bottleneck theory but the PIRs weren’t seeing anything, it was night.
HA related issue. How is that possible when restoring snapshots from good configs on previous versions?! This has to be hypervisor/hardware. I’m back to RAM/SSD.
All the while the USB connected RFLink performed impeccably throughout the saga. If it was Disk/RAM IO, wouldn’t I have lost that as well?!

How can I clean up my install? It’s been update on top of update since around 0.90, surely this has to get messy. Can I get rid of old data, clean up junk/temp files and carry out some general maintenance to get it running as slick as possible? Not sure where to go next. Don’t fancy clean install. I don’t see my setup as overly extravagant/busy, surely I’ve not worn out/used up the SSD writes? It’s not that old.

Thanks and sorry again for the long post. Real head scratcher.

anon43302295 · April 21, 2020, 10:33am

My first thought is wifi AP playing silly buggers. I know you think you’ve ruled that out because deconz went screwy at the same time, but that could (I’m guessing) just be a chain reaction.

If you’re happy that your restoration process is flawless, and a fully working ‘snapshot’ now cannot connect, the balance of probability would be that its the communication layer that’s at fault.

Now taking your other theory about excess traffic due to lockdown, but holding it at that point rather than worrying about the writes at HA’s end, this would also mean that the AP is being swamped more than usual, so could be bowing under the extra pressure.

Anyway, that’s where I would start. Even if just to focus on it for elimination purposes. Hope this helps

mrrodge · April 21, 2020, 12:20pm

Thanks for the quick reply - and for taking the time to read such a long post .

Been looking at getting rid of the AP for a while - It’s a BT Hub 6 (Smart Hub) and it’s given me a good few years of service with great range and stability but it’s not very configurable. I’ve relieved it of its routing/modem duties and because I’ve turned a lot of bits off it constantly flashes amber, so I actually filled it with black silicone to hide it . It is acting as a switch for an IP camera so has that traffic passing through it as well. Could be a heat thing if it’s ventilated via the slot I filled but you’d expect it to reset if that was the case - I’d have noticed by now (It’s been filled for 12+months).

I had a dual band TP-Link router (MR200) that I tried before the BT hub and the range was shockingly poor in comparison. I put that down to the lack of MIMO but it put me off risking it with another router as I thought I might end up spending £££ and getting something worse.

Any recommendations?

NK553 · April 21, 2020, 1:30pm

Yeah, this seems like network issues more than HA issues. I had a few connection drops with mine for a while, raising the DTIM Period to 3 (from 1) solved all my Sonoff/ESPhome connection problems (I have 41 of them and no dropoffs at all)

I have 3 Unifi APs covering a ~2,800sq ft. ranch and a acre lot, and about 65 wifi clients.

anon43302295 · April 21, 2020, 1:35pm

The unifi long range AP is working far better than anything I’ve used before here (three storey town house). That was what was recommended to me when I was last looking . You have to run management software to configure it, but it’s just a docker container so it’s not particularly complicated.

mrrodge · April 21, 2020, 1:58pm

@NK553 @anon43302295, how’s your network speeds (best & worst?)

NK553 · April 21, 2020, 2:07pm

About 160 Mbps on wifi, I don’t keep track of it over time so I don’t know the worst. I have gigabit for ethernet portions of my network and the backhaul for the APs.

mrrodge · April 21, 2020, 2:20pm

Thanks. Which APs are you using?

NK553 · April 21, 2020, 2:22pm

Two AP-AC Lites and 1 AP-AC-LR (I have a garden sensor about 200 feet away outside, and that picks it up really well)

anon43302295 · April 21, 2020, 2:32pm

Mine is the AP-AC-LR (just one), and I’m not sure about speeds off the top of my head and I’m on my phone so I can’t check but I’d say Brian’s is a good indication (rather than a flukey exception to the rule).

Edit, just walked in the door and my phone reckons it’s connected at around 400 just now…

So yeah, it’s more than quick enough, but I’d have to do some digging with the laptop to see actual best/worst/averages.

willemsFEB · April 21, 2020, 3:59pm

When you were talking about your 2 wifi switches periodically going offline, it sounded like a problem I was experiencing, but your additional problems don’t seem very related, but who knows:

I have multiple Shelly wifi switches installed in my apartment, and I noticed some of them would become unresponsive whenever I left the house. After a very long time of searching, I found out that the problem was with a presence detection script I wrote using Bluetooth, where I would periodically search for my phone’s MAC address, and if it would find it I’m home.

Turned out that when I’m away (so when my rpi could not find my phone), it was generating enough Bluetooth traffic to disrupt my Shelly switches, as my wifi was operating in the same frequency range as Bluetooth. I fixed it with changing my wifi channel and using a more lightweight bluetooth pinging tool.

Perhaps you are also using Bluetooth in some automation which disrupts your wifi devices?

finity · April 21, 2020, 7:35pm

That would be my vote since they both run at 2.4GHz. And since you had access to the USB part of the zigbee stick I doubt it’s the internal comms to the stick itself.

I don’t know if it’s possible to change your zigbee/AP to different channels but that is something you can try first. make sure they are as far apart in the spectrum and as far away as possible from the neighbors channels too. In a crowded 2.4GHz space that is definitely a challenge sometimes. Unless you get lucky and none of your neighbors know they can change the channels and they all run on channel 1!

mrrodge · April 22, 2020, 8:34am

Used a Wi-Fi analyser app for Android last night and I think you guys are right. Lockdown congestion! We have a lot of 2.4ghz routers around here (shocked to be honest, detached housing a fair way from neighbours).

When everything was back and stable yesterday I checked my AP and it was using channel 11. My ZigBee setup is using channel 11 also, which I’ve learned is at the opposite end of the range so the two were separate which is good.

I kept checking HA as the evening went on and sure enough, from about 7pm things started failing again, resulting in every ESPHome device eventually disappearing from the network. ZigBee remained fine though and hasn’t missed a beat.

I checked the analyser app and it seems the ‘smart’ hub switched the channel to channel 1 which clashes directly with the Zigbee channel 11 setup and half the neighbours hotspots. This morning it’s still on channel 1 and my phone won’t even connect to it/see it half the time. I also recently bought a HP printer that has a ‘Direct’ wifi connection and that was also using channel 1, so turned that off. I’ve enabled the hotspots on the ESPHome devices as well and when they lose their connection they all seem to advertise on channel 1 as well, so it’s been a downward spiral. I may just turn that off and re-flash as well. ZigBee remained solid throughout all this, so I’m inclined to think that the reboot the other night made it give up its signal and it was unable to get through the noise to re-establish the network.

I do want rid of the BT hub but I’m not sure the Unifi devices are the way for me. I get that the Unifi devices are good - we use them at work and I know they’re solid but I’m wondering if I’ll be better getting something flashable with DD-WRT and ending up with more spec/bang for the buck. Hardware-wise the Unifi chipsets don’t look different and they’re expensive for lesser MIMO specs than my BT Hub.

One thing’s certain though - the BT ‘Smart’ Hub can’t handle auto channel swapping!

mrrodge · April 22, 2020, 10:00am

Update:

Definitely the culprit. Instantly went from no devices to all of them by forcing the hub to switch back to channel 11. Also swapped the Conbee from 11 to 15 in accordance with this (page 18).

Looking at the default SSIDs on my neighbours’ networks it looks like they’re all on pretty recent hardware so theirs will be switching all over with crappy algorithms as well, so if I manually set a channel the chances are it’ll get crapped on by another at some point so to keep it reliable whatever hardware I get will have to sport a better channel switching algorithm and/or be able to go to overlapping channels (other than 1, 6 and 11). Is this what the Unifi APs do?

Thanks for the help and patience, really appreciate it.

mrrodge · April 22, 2020, 12:11pm

Any idea why that improved things for you? You’d think lowering it would work, not raising.

NK553 · April 22, 2020, 12:44pm

I have no idea, other than 3 is a normal default and maybe the devices liked that better.

mrrodge · April 22, 2020, 1:51pm

OK thanks!