TL;DR: Sorry for how ranty this is. Jump to the last 2 paragraphs for some ideas that could make it easier to resolve connectivity issues in ZHA.
I’ve been using Zigbee almost since I started using HA in 2022, after falling into the trap that is smart wi-fi bulbs that depend on a sketchy app and cloud server in China. I decided to avoid wi-fi and switch to a guaranteed local protocol. I discovered Zigbee and fell in love right away.
Unfortunately, one issue that keeps coming up despite my best efforts to prevent it is that regularly parts of the network or the entire network responds extremely slowly or not at all until I restart ZHA. I live in a fairly rural area and only have about 10 wi-fi devices, and made sure to select a Zigbee channel with low interference. If it were interference as many people say it likely is, I wouldn’t expect restarting the integration to work as reliably as it does and I would expect the issue to sometimes resolve itself, which it rarely does.
The Zigbee network is using SkyConnect with about 100 devices including 5 smart plug routers distributed around a 2000 sqft house. Most devices are Sengled bulbs, which is usually how I find out it’s not working when a command or automation fails to change the light. I have a few climate sensors and can see sometimes they stop updating when the bulbs aren’t responding. Sometimes unplugging a router that a non-responding device is going through resolves it, sometimes it doesn’t.
I also set up a small network at my parents with only 11 devices including one plug router and it runs into the same issue. It works great for a while and then out of nowhere something stops responding until it’s restarted.
I tried using the network visualization to troubleshoot, but with 100 devices it’s a jumbled mess. I can hardly tell by looking at it which devices aren’t responding and what their routes are to the coordinator. Zooming in and clicking on things just to see the route is more slow and painful than it should be. The topology takes a very long time to update after a change and clicking refresh, as it will continue showing a device connected through an unplugged router potentially hours after the device was reconnected and functioning through a different route.
I think it would make much more sense to have a sortable/filterable table that shows each device’s route, signal strength, last communication time, and some metric on communication errors.
ZHA could really use better tools to detect, automatically attempt to resolve, and if necessary notify the user about a connectivity issue and the most likely causes/solutions.
- If a device is misbehaving, particularly a router not routing, detect it and notify the user.
- If the network is too congested with talkative devices, show where the congestion is and suggest ways to reduce it.
- If interference is causing a lot of problems and another channel would be much better, detect it and present the energy scan data to the user in a notification so they can make a more informed decision without having to download the diagnostics file.
- If a battery is low in a device, notify the user so they don’t find out the hard way (a custom automation is inconvenient since each device needs to be added, most people wouldn’t bother).