WTH is ZHA/Zigbee so unstable and hard to troubleshoot?

TL;DR: Sorry for how ranty this is. Jump to the last 2 paragraphs for some ideas that could make it easier to resolve connectivity issues in ZHA.

I’ve been using Zigbee almost since I started using HA in 2022, after falling into the trap that is smart wi-fi bulbs that depend on a sketchy app and cloud server in China. I decided to avoid wi-fi and switch to a guaranteed local protocol. I discovered Zigbee and fell in love right away.

Unfortunately, one issue that keeps coming up despite my best efforts to prevent it is that regularly parts of the network or the entire network responds extremely slowly or not at all until I restart ZHA. I live in a fairly rural area and only have about 10 wi-fi devices, and made sure to select a Zigbee channel with low interference. If it were interference as many people say it likely is, I wouldn’t expect restarting the integration to work as reliably as it does and I would expect the issue to sometimes resolve itself, which it rarely does.

The Zigbee network is using SkyConnect with about 100 devices including 5 smart plug routers distributed around a 2000 sqft house. Most devices are Sengled bulbs, which is usually how I find out it’s not working when a command or automation fails to change the light. I have a few climate sensors and can see sometimes they stop updating when the bulbs aren’t responding. Sometimes unplugging a router that a non-responding device is going through resolves it, sometimes it doesn’t.

I also set up a small network at my parents with only 11 devices including one plug router and it runs into the same issue. It works great for a while and then out of nowhere something stops responding until it’s restarted.

I tried using the network visualization to troubleshoot, but with 100 devices it’s a jumbled mess. I can hardly tell by looking at it which devices aren’t responding and what their routes are to the coordinator. Zooming in and clicking on things just to see the route is more slow and painful than it should be. The topology takes a very long time to update after a change and clicking refresh, as it will continue showing a device connected through an unplugged router potentially hours after the device was reconnected and functioning through a different route.

I think it would make much more sense to have a sortable/filterable table that shows each device’s route, signal strength, last communication time, and some metric on communication errors.

ZHA could really use better tools to detect, automatically attempt to resolve, and if necessary notify the user about a connectivity issue and the most likely causes/solutions.

  • If a device is misbehaving, particularly a router not routing, detect it and notify the user.
  • If the network is too congested with talkative devices, show where the congestion is and suggest ways to reduce it.
  • If interference is causing a lot of problems and another channel would be much better, detect it and present the energy scan data to the user in a notification so they can make a more informed decision without having to download the diagnostics file.
  • If a battery is low in a device, notify the user so they don’t find out the hard way (a custom automation is inconvenient since each device needs to be added, most people wouldn’t bother).

Welcome!

I know you have put a lot of sweat equity into ZHA (and I don’t like telling your spend more coin than you already have), however from my experiences with both ZHA and Zigbee2MQTT (and well as too many other zigbee systems) I would recommend you stand up a zigbee2MQTT coordinator (that is independent of your HA setup, so that it can be worked on independently of HA). Set up a matrix of solid router devices across your physical space on this new zigbee network (if I read your post correctly, you have only 5 routers for 100+ devices, that sounds very thin). Then in stages move your end user zigbee devices to the Zigbee2MQTT network (make sure to ‘add’ devices physical where they will live and use the add ‘via’ close by routers function in zigbee networks (read up on this if you do not grok what I am saying).

You might even consider setting up a second Zigbee2MQTT production network and device your devices between the two. Maybe even a third depending your your setup. And an addition small zigbee2mqtt network as development/testing environment for new devices to test before adding them to your production.

Unless there is a only on ZHA device in your world, I really think that Zigbee2MQTT offers more tools, community and development.

Good hunting!

2 Likes

I agree with you regarding visualization - I use ZHA and looking at Zigbee2MQTT, it’s just as useless once you have a large number of devices. What I’d really like to know is are battery devices connecting to routers in the same room. Rooms are something HA knows about, not ZHA/Zigbee2MQTT. I could sort of do this manually with deCONZ because I could drag devices into grouped areas of the screen but it was still a mess.

Every few weeks I have to re-pair a device or two because they’ve dropped off. I have a lot of potential interference from neighbourhood wifi.

@pilot51 You should vote for your own post as well.

1 Like

Sorry, I forgot to clarify that it’s 100 paired devices but only about half of them are continually powered. 19 battery sensors (8 climate, 6 door, 4 leak, 1 motion), ~25 bulbs, 2 relay routers in the wall, and the 5 plug routers. I have 3 other plugs that I normally leave unplugged because they aren’t used much and one in the basement was being a bad router once. All of the sensors, the relays, and half the plugs are Aqara, the other half are ThirdReality.

I did look at Zigbee2MQTT about a year ago and I wasn’t ready to climb that mountain and rebuild my network with the uncertain hope that it would resolve these issues, let alone that I would be happy with the change in functionality and ease of use. One concern is losing all of the sensor history, especially climate history which I regularly reference. I should give it a shot anyway. If restarting ZHA reliably resolves the issue, albeit temporarily, there’s a good chance the problem is with ZHA and not the network itself.

My understanding is that pairing devices via routers is really only useful if it’s out of range of the coordinator, as once it’s paired the network will eventually self-heal and find what it determines to be optimal routes (though I question some of its choices).

Setting up multiple networks seems a bit overkill and shouldn’t be necessary in a production environment where all I care about is that everything works as promised. I found several posts from people with single networks handling 150+ devices without issue.

I use z2m, not ZHA, so only can offer general Zigbee advice.

“only about half of them are continually powered”

Does that mean you’re powering those devices only when you need them? Because there are basically two things you shouldn’t do with Zigbee networks: routinely power down devices that act as router, and/or not having enough routers in the first place.

The ballpark figure I use for routers vs end devices is about 1 to 5, although my current network has about 1 to 3 (24 routers, 73 end devices, 97 in total). I have 0 issues with Zigbee, other than the occasional battery running out.

Also, the following is purely anecdotal, but from talking to other HA users that I know, and also from the Homey (don’t buy one) forum, I get the impression that the Silicon Labs Zigbee chip (which is used in the SkyConnect, and also in the Sonoff Dongle Plus-E) isn’t great. I have a Sonoff Dongle Plus-P, using a TI chip, and like I said, no issues.

4 Likes

Aside from the 3 plugs that I have unplugged because I rarely use them, all of the other devices that are normally powered off are end point bulbs. I did have a few bulbs from a different brand that were also routers which I kept powered because of that, but I came to the conclusion that they were causing some of the problems (turning them off sometimes resolved issues with end points not responding) and replaced them with end point bulbs. The routers I do have in the network rarely change, like I add or remove one less than once a month, so that wouldn’t explain the issue which happens almost on a daily basis.

Basically my suspicion has been, if it has anything to do with routers, is that a router powered on and in the network is failing to be a router. The problem then is how to easily and reliably find out if that’s indeed what is happening and if so, which one it is without having to pull plugs. That’s also why I don’t just leave all of the plugs connected as I was doing about a year ago, because I see it as more points of potential failure that would be even harder to troubleshoot.

On the other hand, I’ve also had issues with some end point devices not responding when the visualization indicated that it was connected directly to the coordinator. Now that I think of it, I find it strange that devices within a few feet of the coordinator (in the same closet or near the closet) often use a router off in a different room with worse signal quality than the ones going direct.

I originally had the Nortek HUSBZB-1 and replaced it with the SkyConnect in hopes of resolving the issue, making sure to use the USB extension.

If it makes any difference, I’m running HA Container (Docker) 2024.11.3 on a headless Intel NUC 5 with Debian 12.

I know that some brands don’t really mix, although it seems to be specific to which controller is being used. For example, on the Homey forum there’s a lot of people that have issues with the combination Aqara/IKEA, but I have plenty of devices from both brands and they work fine (Aqara devices do tend to stick to the router they initially paired with, and typically don’t jump to another router if that router is better or if the original router is gone from the network).

But like I said already, if you have 100+ devices you probably need at least 20 of those to be routers. Most router devices have a limited amount of children they can handle (which includes other routers), just like the coordinator (I think the SL chip in the SkyConnect can handle 32 direct children at most).

FWIW, I’m running both HA and z2m in a Docker container.

That brings me back to my original question. Why is it so hard to determine if it’s caused by a router limitation or something else?

Almost all of the potential causes and solutions I hear are basically guesses with no known way to verify using all the available technical data. I don’t want to spend a bunch of money on devices for the sole purpose of having more routers if I’m not sure that will solve the problem.

And again, not having enough routers doesn’t explain why end devices connected directly to the coordinator (currently just 2) are also sometimes unresponsive. I’m only at 9 direct children.

I would also expect restarting ZHA to make things worse by flooding the network with requests as it reconnects to all of the devices, but the opposite happens. Almost every time, a few seconds after restarting, everything responds lightning fast.

It also doesn’t explain why my parents have the same issue with only 11 devices (6 end bulbs, 2 router bulbs always on but not routing anything, door and leak sensors, 1 router plug specifically to extend range). That’s much easier to troubleshoot since the two devices having the most trouble (door sensor controlling a light) are the only two routing through the plug. The question there is how to get it to stay stable, which I suspect would be the same solution that my own network needs. My dad just gave up on it and went back to using a regular bulb and switch.

Because fixing these sorts of issues are mostly done by interpreting symptoms.

In my experience, bad Zigbee networks, or misbehaving devices, are typically caused by (in order of most likely first):

  1. not enough routers
  2. devices being on the edge of the network, so they have a tendency to “fall off” (closely related to, but not the same as, the “not enough routers” issue)
  3. devices that get stuck to a specific router
  4. devices of different brands that don’t work well together
  5. interference (google “zigbee wifi overlap” to find out which Zigbee channels overlap WiFi channels, and try to set them so they don’t overlap)
  6. people not understanding Zigbee and routinely powering down router devices

Nothing in life is certain :man_shrugging:t3: A router device doesn’t cost the world, and it will strengthen your Zigbee network.

See points 2 and 3.

Does ZHA provide any debug logging that might provide insight into what’s going on?

1 Like

In this case, all except point 5 wouldn’t apply for devices not responding even when connected directly to the coordinator and physically very close to it. For point 5, in theory, even WiFi interference shouldn’t be much of a problem given that those signals would be very strong. And again, I took the steps to prevent interference by researching the overlaps and doing both WiFi and Zigbee scans before selecting the Zigbee channel when I replaced the coordinator and recreated the network.

Nothing in life is certain :man_shrugging:t3: A router device doesn’t cost the world, and it will strengthen your Zigbee network.

As I explained, there was no noticeable benefit to using 8 routers instead of the 5 I’m using now and it actually seemed less stable with more routers. Most of the devices that I experience the issue with are within 30 feet of the coordinator with routers between. I see no evidence that it’s a range issue since I’ve had little to no problem pairing and using devices about 60 feet away at the opposite end of the house, connected directly to the coordinator before they switched to a router. I have a router plug all the way over there with an automation that turns it on during off-peak electric rates and it never has issues (I checked history to be sure). I don’t fully understand what the visualization is showing with all the hard-to-follow lines and no legend or tooltips describing what anything means, but it appears to be actively connected to the coordinator as well as to other routers that are also actively connected to the coordinator.

Again, I fail to understand how this could be a router issue if everything responds fast immediately after starting the integration and over time the network becomes slow or unresponsive and does not recover until ZHA is restarted. Did you see that exact same behavior and adding routers solved it?

See points 2 and 3.

They definitely weren’t at the edge. As I said, they were within feet of the coordinator. They also weren’t connected to any router unless the topology was outdated or lying to me.

Does ZHA provide any debug logging that might provide insight into what’s going on?

The general logs don’t say anything useful other than that commands are timing out, basically the same thing I get in the toast messages when using the app. I’m pretty sure I enabled ZHA debug logging before and it didn’t help, but I enabled it to get you an answer and waited a few days for it to happen again and it never did (typical, it behaves when you debug it). Unfortunately, today HA started acting up and I found that the log file ballooned over 8GB, completely filling the partition. So I’ll have to leave debugging turned off and enable it after the issue shows up.

As much as I’m describing my problem to either support my question or in reply to comments, I’m not really asking for help troubleshooting my problem. I’m kind of tired of asking and getting the same responses about not having enough routers, which clearly aren’t the root cause. People seem to believe with confidence that the solution to their problems would apply to everyone else having vaguely similar connectivity problems.

What I am asking is why can’t ZHA make it be easier to troubleshoot with better detection and reporting of common issues, if not automatically fix it, before the problem is discovered when trying to use a device? Ideally easy enough that even my dad would have an idea what to do to permanently solve the issue without having technical knowledge of how Zigbee works or where to go to find answers. If that can be accomplished, it would help adoption expand beyond the technically inclined like myself. Relying 100% on the user to interpret the symptoms is putting too much trust in the user, many don’t even RTFM if one is even provided. I’m a software developer and know my way around these things fairly well and even I’m struggling with it. The main thing working against me here is the time and motivation to get into the weeds. I’d rather this “just work” so I can spend more time on things I enjoy spending time on, but if that’s too much to ask, I would like it to at least tell me a connected device isn’t working right before I have to find out the hard way.

The ability to automate restarting an integration would also help me work around it. I’m surprised that’s not an option while restarting HA is.

I agree with what you post, I also find it quite hard to pinpoint why my 64 device network with 20+ routers sometimes reacts very slowly and sometimes doesn’t, or how to tell devices not to try routing through a router that broke last month and was replaced.
Somebody made a solution for the last problem in your list though, really nice blueprint, functionality only limited by the accuracy of battery readings devices report: 🪫 Low Battery Notifications & Actions . No need to add devices manually.

1 Like

Keep in mind that the way Zigbee works it is a combination of the Zigbee stack firmware running on your Zigbee Coordinator (a.k.a. controller) and Zigbee stack firmware running on uour Zigbee Router (a.k.a. repeaters) that does almost all of the Zigbee network mesh automatically all in their own, while the Zigbee Gateway more or less just send simple commands to the Zigbee Coordinator asking it to do stuff like enable joining/pairing mode and sending a command like on/off and device configuration changes, etc. so there is not a lot of micromanagement that the Zigbee gateway software can or should do.

Yes if you have not physically optimized your devices then troubleshooting Zigbee is hard regardless of which Zigbee Gateway you are using, which is FYI why I wrote this community guide with recommended best practices and tips how to proactivly avoid or workaround known issues, so strongly suggest everyone with problems or not try to follow all advice there before troubleshooting as that will at least make it easier to find the root cause later:

I so now however know how or if it is possible to automate anything of that in ZHA or Home Assistant as most of that are practical things in your enviroment that you need ro take actions on to change or adjust in order to optimize the conditions to give your Zigbee network a chance to work its own meshing magic. At its essense it is just the fact that you more or less always need to both add many more Zigbee Router devices and make sure to move your Zigbee Coordinator away from anything electronic by using a long USB extension cord.

If a device isn’t routing, how would ZHA know? The packet either takes another route to the coordinator or is dropped, either way, it’s not something that the coordinator knows about. Really all it knows is that a device hasn’t checked in for a defined period of time in which case it marks it as unavailable.

Also, not something ZHA can do much about. It doesn’t know that a specific device is overloading the mesh, that shows up in the form of dropped packets which similar to above. It might be able to report if it’s receiving a large number of packets from a specific device, but what is a large number? Would depend on the device type and the network. Also, Zigbee devices can send to other endpoints, so it’s possible to flood the network and the coordinator may never see the traffic.

Not as useful as you would think, if you have a robust mesh it will show that the Zigbee channel is full when in reality it just a bunch of Zigbee devices communicating. ZHA used to run an energy scan on startup, but it was commonly misinterpreted.

This can easily be done with an automation today, I wouldn’t bother, my experience is a device will drop off due to a dead battery and it will still be reporting 70%.

I will note, if you have 100 devices and only 5 routers, that’s way too few. I have a similar sized network and as a rough count, 35 routers. Also, some devices are just plain bad, there are routers that will drop packets from other manufacturers, fail to route randomly, etc. There isn’t a silver bullet here, you have to go through the network and identify what’s causing the problem. It very well could be a combination of things, but there isn’t going to be an easy button added to ZHA to do it for you, it’s just not possible.

2 Likes

SkyConnect has a lot of diagnostics entries, much more than Sonoff dongle (this is one reason I replaced mine). You can try to check those and see if there is anything specific visible when problems occur and how does the restart changed it.

I’m personally more in your side and don’t think the problem is in your network or in lack of routers when coordinator restarts help.

It needs some more time to dig into the ZB world but I’d still suggest you to buy a sniffer device (like CC2531 USB dongle for around $7). Even without any deep knowledge of IEEE 802.15.4 networking you’d might tell the difference in network activity and get more glue where to look into.