WTH is ZHA/Zigbee so unstable and hard to troubleshoot?

Aside from the 3 plugs that I have unplugged because I rarely use them, all of the other devices that are normally powered off are end point bulbs. I did have a few bulbs from a different brand that were also routers which I kept powered because of that, but I came to the conclusion that they were causing some of the problems (turning them off sometimes resolved issues with end points not responding) and replaced them with end point bulbs. The routers I do have in the network rarely change, like I add or remove one less than once a month, so that wouldn’t explain the issue which happens almost on a daily basis.

Basically my suspicion has been, if it has anything to do with routers, is that a router powered on and in the network is failing to be a router. The problem then is how to easily and reliably find out if that’s indeed what is happening and if so, which one it is without having to pull plugs. That’s also why I don’t just leave all of the plugs connected as I was doing about a year ago, because I see it as more points of potential failure that would be even harder to troubleshoot.

On the other hand, I’ve also had issues with some end point devices not responding when the visualization indicated that it was connected directly to the coordinator. Now that I think of it, I find it strange that devices within a few feet of the coordinator (in the same closet or near the closet) often use a router off in a different room with worse signal quality than the ones going direct.

I originally had the Nortek HUSBZB-1 and replaced it with the SkyConnect in hopes of resolving the issue, making sure to use the USB extension.

If it makes any difference, I’m running HA Container (Docker) 2024.11.3 on a headless Intel NUC 5 with Debian 12.

I know that some brands don’t really mix, although it seems to be specific to which controller is being used. For example, on the Homey forum there’s a lot of people that have issues with the combination Aqara/IKEA, but I have plenty of devices from both brands and they work fine (Aqara devices do tend to stick to the router they initially paired with, and typically don’t jump to another router if that router is better or if the original router is gone from the network).

But like I said already, if you have 100+ devices you probably need at least 20 of those to be routers. Most router devices have a limited amount of children they can handle (which includes other routers), just like the coordinator (I think the SL chip in the SkyConnect can handle 32 direct children at most).

FWIW, I’m running both HA and z2m in a Docker container.

That brings me back to my original question. Why is it so hard to determine if it’s caused by a router limitation or something else?

Almost all of the potential causes and solutions I hear are basically guesses with no known way to verify using all the available technical data. I don’t want to spend a bunch of money on devices for the sole purpose of having more routers if I’m not sure that will solve the problem.

And again, not having enough routers doesn’t explain why end devices connected directly to the coordinator (currently just 2) are also sometimes unresponsive. I’m only at 9 direct children.

I would also expect restarting ZHA to make things worse by flooding the network with requests as it reconnects to all of the devices, but the opposite happens. Almost every time, a few seconds after restarting, everything responds lightning fast.

It also doesn’t explain why my parents have the same issue with only 11 devices (6 end bulbs, 2 router bulbs always on but not routing anything, door and leak sensors, 1 router plug specifically to extend range). That’s much easier to troubleshoot since the two devices having the most trouble (door sensor controlling a light) are the only two routing through the plug. The question there is how to get it to stay stable, which I suspect would be the same solution that my own network needs. My dad just gave up on it and went back to using a regular bulb and switch.

Because fixing these sorts of issues are mostly done by interpreting symptoms.

In my experience, bad Zigbee networks, or misbehaving devices, are typically caused by (in order of most likely first):

  1. not enough routers
  2. devices being on the edge of the network, so they have a tendency to “fall off” (closely related to, but not the same as, the “not enough routers” issue)
  3. devices that get stuck to a specific router
  4. devices of different brands that don’t work well together
  5. interference (google “zigbee wifi overlap” to find out which Zigbee channels overlap WiFi channels, and try to set them so they don’t overlap)
  6. people not understanding Zigbee and routinely powering down router devices

Nothing in life is certain :man_shrugging:t3: A router device doesn’t cost the world, and it will strengthen your Zigbee network.

See points 2 and 3.

Does ZHA provide any debug logging that might provide insight into what’s going on?

1 Like

In this case, all except point 5 wouldn’t apply for devices not responding even when connected directly to the coordinator and physically very close to it. For point 5, in theory, even WiFi interference shouldn’t be much of a problem given that those signals would be very strong. And again, I took the steps to prevent interference by researching the overlaps and doing both WiFi and Zigbee scans before selecting the Zigbee channel when I replaced the coordinator and recreated the network.

Nothing in life is certain :man_shrugging:t3: A router device doesn’t cost the world, and it will strengthen your Zigbee network.

As I explained, there was no noticeable benefit to using 8 routers instead of the 5 I’m using now and it actually seemed less stable with more routers. Most of the devices that I experience the issue with are within 30 feet of the coordinator with routers between. I see no evidence that it’s a range issue since I’ve had little to no problem pairing and using devices about 60 feet away at the opposite end of the house, connected directly to the coordinator before they switched to a router. I have a router plug all the way over there with an automation that turns it on during off-peak electric rates and it never has issues (I checked history to be sure). I don’t fully understand what the visualization is showing with all the hard-to-follow lines and no legend or tooltips describing what anything means, but it appears to be actively connected to the coordinator as well as to other routers that are also actively connected to the coordinator.

Again, I fail to understand how this could be a router issue if everything responds fast immediately after starting the integration and over time the network becomes slow or unresponsive and does not recover until ZHA is restarted. Did you see that exact same behavior and adding routers solved it?

See points 2 and 3.

They definitely weren’t at the edge. As I said, they were within feet of the coordinator. They also weren’t connected to any router unless the topology was outdated or lying to me.

Does ZHA provide any debug logging that might provide insight into what’s going on?

The general logs don’t say anything useful other than that commands are timing out, basically the same thing I get in the toast messages when using the app. I’m pretty sure I enabled ZHA debug logging before and it didn’t help, but I enabled it to get you an answer and waited a few days for it to happen again and it never did (typical, it behaves when you debug it). Unfortunately, today HA started acting up and I found that the log file ballooned over 8GB, completely filling the partition. So I’ll have to leave debugging turned off and enable it after the issue shows up.

As much as I’m describing my problem to either support my question or in reply to comments, I’m not really asking for help troubleshooting my problem. I’m kind of tired of asking and getting the same responses about not having enough routers, which clearly aren’t the root cause. People seem to believe with confidence that the solution to their problems would apply to everyone else having vaguely similar connectivity problems.

What I am asking is why can’t ZHA make it be easier to troubleshoot with better detection and reporting of common issues, if not automatically fix it, before the problem is discovered when trying to use a device? Ideally easy enough that even my dad would have an idea what to do to permanently solve the issue without having technical knowledge of how Zigbee works or where to go to find answers. If that can be accomplished, it would help adoption expand beyond the technically inclined like myself. Relying 100% on the user to interpret the symptoms is putting too much trust in the user, many don’t even RTFM if one is even provided. I’m a software developer and know my way around these things fairly well and even I’m struggling with it. The main thing working against me here is the time and motivation to get into the weeds. I’d rather this “just work” so I can spend more time on things I enjoy spending time on, but if that’s too much to ask, I would like it to at least tell me a connected device isn’t working right before I have to find out the hard way.

The ability to automate restarting an integration would also help me work around it. I’m surprised that’s not an option while restarting HA is.

1 Like

I agree with what you post, I also find it quite hard to pinpoint why my 64 device network with 20+ routers sometimes reacts very slowly and sometimes doesn’t, or how to tell devices not to try routing through a router that broke last month and was replaced.
Somebody made a solution for the last problem in your list though, really nice blueprint, functionality only limited by the accuracy of battery readings devices report: 🪫 Low Battery Notifications & Actions . No need to add devices manually.

1 Like

Keep in mind that the way Zigbee works it is a combination of the Zigbee stack firmware running on your Zigbee Coordinator (a.k.a. controller) and Zigbee stack firmware running on uour Zigbee Router (a.k.a. repeaters) that does almost all of the Zigbee network mesh automatically all in their own, while the Zigbee Gateway more or less just send simple commands to the Zigbee Coordinator asking it to do stuff like enable joining/pairing mode and sending a command like on/off and device configuration changes, etc. so there is not a lot of micromanagement that the Zigbee gateway software can or should do.

Yes if you have not physically optimized your devices then troubleshooting Zigbee is hard regardless of which Zigbee Gateway you are using, which is FYI why I wrote this community guide with recommended best practices and tips how to proactivly avoid or workaround known issues, so strongly suggest everyone with problems or not try to follow all advice there before troubleshooting as that will at least make it easier to find the root cause later:

I so now however know how or if it is possible to automate anything of that in ZHA or Home Assistant as most of that are practical things in your enviroment that you need ro take actions on to change or adjust in order to optimize the conditions to give your Zigbee network a chance to work its own meshing magic. At its essense it is just the fact that you more or less always need to both add many more Zigbee Router devices and make sure to move your Zigbee Coordinator away from anything electronic by using a long USB extension cord.

If a device isn’t routing, how would ZHA know? The packet either takes another route to the coordinator or is dropped, either way, it’s not something that the coordinator knows about. Really all it knows is that a device hasn’t checked in for a defined period of time in which case it marks it as unavailable.

Also, not something ZHA can do much about. It doesn’t know that a specific device is overloading the mesh, that shows up in the form of dropped packets which similar to above. It might be able to report if it’s receiving a large number of packets from a specific device, but what is a large number? Would depend on the device type and the network. Also, Zigbee devices can send to other endpoints, so it’s possible to flood the network and the coordinator may never see the traffic.

Not as useful as you would think, if you have a robust mesh it will show that the Zigbee channel is full when in reality it just a bunch of Zigbee devices communicating. ZHA used to run an energy scan on startup, but it was commonly misinterpreted.

This can easily be done with an automation today, I wouldn’t bother, my experience is a device will drop off due to a dead battery and it will still be reporting 70%.

I will note, if you have 100 devices and only 5 routers, that’s way too few. I have a similar sized network and as a rough count, 35 routers. Also, some devices are just plain bad, there are routers that will drop packets from other manufacturers, fail to route randomly, etc. There isn’t a silver bullet here, you have to go through the network and identify what’s causing the problem. It very well could be a combination of things, but there isn’t going to be an easy button added to ZHA to do it for you, it’s just not possible.

3 Likes

SkyConnect has a lot of diagnostics entries, much more than Sonoff dongle (this is one reason I replaced mine). You can try to check those and see if there is anything specific visible when problems occur and how does the restart changed it.

I’m personally more in your side and don’t think the problem is in your network or in lack of routers when coordinator restarts help.

It needs some more time to dig into the ZB world but I’d still suggest you to buy a sniffer device (like CC2531 USB dongle for around $7). Even without any deep knowledge of IEEE 802.15.4 networking you’d might tell the difference in network activity and get more glue where to look into.

Thanks for the blueprint recommendation! I added it and it’s working great. I was ignoring several battery devices that died or disconnected until I set up the notifications. In fact, most of them had good battery but had to be re-paired, which I’ve seen a few times before, even on my parents’ super simple network. Odd and annoying.

Why would devices try changing to a router that is offline? That’s exactly what you don’t want a self-healing mesh to do. If that happens, it could explain some of what I’m seeing since I haven’t unpaired the routers that I normally leave unplugged.
Since you replaced it, I assume you unpaired it, in which case how do you know that devices are still trying to route through it?

That doesn’t apply to the problem that I find most frustrating, which is requests (e.g. turn on light) being sent to the device and it doesn’t respond. ZHA clearly knows about that because HA usually gives a toast message and logs an error when it happens (most often ZIGBEE_DELIVERY_FAILED), but it gives no useful information to help me troubleshoot and prevent it from happening again.

If a particular router is being overloaded or not working right, I’d think there would be some way to detect which router it is, like pinging devices and seeing if or how fast they respond and looking for a common route that is not performing. It doesn’t necessarily need to be automated detection. I’d be happy if I can just click a button to ping all of the online devices and see the latency results in the topography. Though I can’t emphasize enough how badly that needs to be redesigned for it to be quick and easy to read. A tree instead of a web would be a huge improvement.

I keep hearing that, but as I keep saying, my experience suggests this is not an issue with too few routers. You didn’t mention that you fixed any problems by adding routers, so you’ve done nothing to convince me.

Can you explain to me how restarting ZHA temporarily fixes a routing/congestion problem every single time and why the few end devices that are directly connected to the coordinator are equally affected?
What about my parents’ network where the two most unreliable end devices are the only two on a router, with good signal?

I can understand some devices being just plain bad. I think the most logical explanation would be if a device is somehow glitching out and saturating the entire network until the coordinator disconnects. Is there a list of known bad devices somewhere? Replacing all of the bulbs in my house with Sengled bulbs was a very expensive investment, so I hope they aren’t one of the baddies.

Interestingly, my network seems to have gotten a lot more stable since my OP. I think I could count on one hand the number of times I’ve had to restart ZHA in the last month. The only thing I really did was replace a couple batteries and re-pair a few end devices that totally stopped connecting. You’d think that would make it worse.

Like this?

2024-12-30 17:08:10.190 DEBUG (MainThread) [zigpy.device] Previously delayed device request is now running, delayed by 101.55s
2024-12-30 17:08:10.191 DEBUG (MainThread) [zigpy.application] Max concurrency (8) reached, delaying request (17 enqueued)
...
2024-12-30 17:08:10.728 DEBUG (MainThread) [bellows.ezsp.protocol] Received command messageSentHandler: {'type': <EmberOutgoingMessageType.OUTGOING_DIRECT: 0>, 'indexOrDestination': 63171, 'apsFrame': EmberApsFrame(profileId=260, clusterId=6, sourceEndpoint=1, destinationEndpoint=1, options=<EmberApsOption.APS_OPTION_RETRY|APS_OPTION_ENABLE_ROUTE_DISCOVERY: 320>, groupId=0, sequence=114), 'messageTag': 219, 'status': <EmberStatus.DELIVERY_FAILED: 102>, 'messageContents': b''}
2024-12-30 17:08:10.728 DEBUG (MainThread) [bellows.zigbee.application] Received messageSentHandler frame with [<EmberOutgoingMessageType.OUTGOING_DIRECT: 0>, 63171, EmberApsFrame(profileId=260, clusterId=6, sourceEndpoint=1, destinationEndpoint=1, options=<EmberApsOption.APS_OPTION_RETRY|APS_OPTION_ENABLE_ROUTE_DISCOVERY: 320>, groupId=0, sequence=114), 219, <EmberStatus.DELIVERY_FAILED: 102>, b'']
2024-12-30 17:08:10.731 DEBUG (MainThread) [zigpy.application] Previously delayed request is now running, delayed by 0.54s
2024-12-30 17:08:10.732 DEBUG (MainThread) [bellows.ezsp.protocol] Sending command  sendUnicast: () {'type': <EmberOutgoingMessageType.OUTGOING_DIRECT: 0>, 'indexOrDestination': 0x700B, 'apsFrame': EmberApsFrame(profileId=0, clusterId=32774, sourceEndpoint=0, destinationEndpoint=0, options=<EmberApsOption.APS_OPTION_RETRY|APS_OPTION_ENABLE_ROUTE_DISCOVERY: 320>, groupId=0, sequence=73), 'messageTag': 223, 'messageContents': b'I\x00\x00\x00\x01\x01'}
2024-12-30 17:08:10.738 DEBUG (bellows.thread_0) [bellows.ash] Sending frame DataFrame(frm_num=1, re_tx=False, ack_num=2, ezsp_frame=b'h\x00\x014\x00\x00\x0bp\x00\x00\x06\x80\x00\x00@\x01\x00\x00I\xdf\x06I\x00\x00\x00\x01\x01') + FLAG
2024-12-30 17:08:10.739 DEBUG (bellows.thread_0) [bellows.ash] Sending data  122a21a9602a15b929944a232a5592099d4e27e232c82e8bfdc66288b69a7e
2024-12-30 17:08:10.757 DEBUG (bellows.thread_0) [bellows.ash] Received data 222aa1a9602a15c403437e
2024-12-30 17:08:10.757 DEBUG (bellows.thread_0) [bellows.ash] Received frame DataFrame(frm_num=2, re_tx=0, ack_num=2, ezsp_frame=b'h\x80\x014\x00\x00v')
2024-12-30 17:08:10.758 DEBUG (bellows.thread_0) [bellows.ash] Sending frame AckFrame(res=0, ncp_ready=0, ack_num=3) + FLAG
2024-12-30 17:08:10.758 DEBUG (bellows.thread_0) [bellows.ash] Sending data  83401b7e
2024-12-30 17:08:10.733 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [139677583763072] Unexpected exception
Traceback (most recent call last):
   [traceback]
zigpy.exceptions.DeliveryError: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
   [traceback]
zha.exceptions.ZHAException: Failed to send request: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
   [traceback]
homeassistant.exceptions.HomeAssistantError: Failed to send request: Failed to deliver message: <sl_Status.ZIGBEE_DELIVERY_FAILED: 3074>

I don’t see anything very telling in there beyond the max concurrency reached.
As an Android dev, it looks much like an IOException for a timeout and there’s no point in printing those stacktraces, hence why I removed them since they were just a waste of space.

I don’t mean they try changing to it. But if the mesh visualization is an accurate representation (which people more knowledgable than me have said it’s not), devices are still attempting to route through a device that has been unplugged for a while, which makes me think the self-healing part is ineffective. But maybe they just check every now and then if it came back?

Best would probably be if ZHA could have a blacklist of specific devices with known bad firmware and warn the user in the UI and logs that there can be a problem with that specifc device due to bad firmware, (I know that Z-Wave JS has a similar feature).

You see the self-healing part is not ineffective in general but its designed so it is left up to each device to self-heal and the problem with that is that there are a few specific Zigbee devices/product that are infamously known to have badly written firmware or firmware bugs that either prevent them from changing route to use a different Zigbee Router, (like older Aqara and Xiaomi devices which have a bug so they need to be manually factory reset and re-paired if want to change route), or even worse some Zigbee Router devices with bad firmware which sometimes makes them act as a black-hole and not pass along all messages (like some older Tuya devices), as well as a few devices with bad firmware that spam/flood the Zigbee network with too many messages so they prevent other devices from communicating properly.

For those type of devices with bad firmware there is nothing that the Zigbee Gateway can do to solve or workaround the problem as the solution is for the user to re-pair devices that will not reconnct and remove devices that act as black-holes or spam/flood the network.

Thanks for the background info @Hedda. Is there a list somewhere of known poorly performing devices?

Unfortunatly I do not think so, that is something that each community have had to do on their own.

I read that Home Assistant’s founder had aspiration to put togther a databas of all IoT devices compatible with Home Assistant core componets, so maybe that could contain such info if that ever comes.

Until then it is up to the ZHA and Zigbee2MQTT, etc. communities to make their own lists.

The most comprehensive user-contributed list I know is the one maintained by Blakadder Database of Zigbee devices compatible with ZHA, Tasmota, Zigbee2MQTT, deCONZ, ZiGate and ioBroker / That would be the most logical place for me to add such information. Am I correct in assuming that these bad devices would be bad on any of the Zigbee versions (ZHA, Z2M, etc)? @Blacky would this be something you’d be willing to support?

Yes that is correct, they have the same problem regardless of which Zigbee Gateway you use.

Note that since the issue is bug in the device firmware the problem is sometimes only in a specific firmware version.

Most devices that are infamous for having these types of bugs have never gotten official firmware upgrades, (and many do not even support OTA updates at all), but some might have a new revision come with newer firmware from the factory without getting a new model number, so it can be a gamble buying old stock from a store or buying used hardware if it is one of those models that is known to have buggy firmware.

FYI, I posted a feature requerst to the zigpy/zha developers for that idea of having warning with comments for specific Zigbee device with known issue inside ZHA UI, so you guys could add to that list I started there with the infamously known problem products that I have read and/or used myself:

No, I mean these sensors. I’m not sure if it helps you, though

Are you sure those bulbs are not Zigbee routers?

From what I know, the official spec states that all mains powered (110/230V) Zigbee device should be a router.

I am only using Philips Hue lights and all types are routers. There’s only 1-2 models with built in battery that are end points only. All their buttons and sensors are end points as well.

I was running a combination of Z2M and ZHA for over a year, as power plugs with energy metering caused the whole Z2M network to slowdown so I had only my power plugs connected to ZHA.

Recently moved back from Z2M+ZHA to just ZHA, since I switched wall buttons from FriendsOfHue (ZGP) which is not supported by ZHA to the Philips Hue Tap Dial Switch which is just an endpoint device with battery.

That switch was added in ZHA 2024.10.0 release. Before it was added it constantly crashed ZHA, which is why I tried it in the past and moved back to the ZGP buttons with Z2M and all my lights in there.

The thing about specifications, is that manufacturers would actually need to follow them and most don’t bother. There are many mains powered devices that don’t route.

Here’s an example:
image

That said, I’m actually of the opinion that bulbs shouldn’t route, since they can be easily powered off.