TL;DR: my entire previously working/reliable Zigbee network went offline after restarting HA. It hasn’t come back after 36hr, despite multiple restarts, reboots, and power cycling devices. I’m wondering if anyone more experienced with Zigbee/ZHA has any pointers on understanding/resolving the problem before I have to consider manually rebuilding my ~80 device Zigbee network.
Questions:
- Would
TXStatus.NWK_ROUTE_DISCOVERY_FAILED
errors be a likely explanation for HA not being able to talk to my Zigbee devices? - What would be likely causes for getting lots of
TXStatus.NWK_ROUTE_DISCOVERY_FAILED
errors? - What does the “leave/join” option for the ConBeeII stick in the deConz app actually do? My understanding is that the coordinator (ConBeeII) initially distributes a network key to all the devices as they join the Zigbee mesh and that the mesh can survive the coordinator going offline (e.g. when you reboot your HA box). When the coordinator “leaves” what does that do? If you attempt to rejoin, does that change anything with the network key, or should everything return to as it was if everything is working properly?
Hardware/Software Setup Details:
- HA version: v0.115.5
- PC Host: Ubuntu 18.04
- Zigbee Controller: ConBee II
- HA Zigbee Integration: ZHA
- HA Device Config: using
/dev/serial/by-id
path and I’ve confirmed that the symbolic link is still present and pointing to/dev/ttyACM0
; I do not have a/dev/ttyACM
1device in the system - Zigbee Routers: ~40, all Sylvania (LEDVANCE) A19, BR30, and light strip products with current firmware
- Zigbee Endpoints: ~40 with a mix of Philips motion sensors, Samsung multipurpose sensors, a few other sensors, and 2 door locks
- Zigbee Network Status: before this incident, my mesh was well-connected with LQI values of 255 reported for all parent-child links in the
zha-map
neighbors files. Links between routers had varying LQI values, but there was at least oneLQI=255
link between each router and another router. Obviously not the case now.
Symptoms/Timeline:
Two nights ago, I restarted Home Assistant due to one of my BR30 bulbs being unresponsive - that has happened occasionally before, and either waiting a few hours or restarting HA has resolved it in the past. When HA restarted, ALL of my Zigbee devices were missing. My zha-network-card
showed all devices with Online=false
and LQI=N/A
. After waiting 15min, it was still the same. In the Logs section of HA, I saw one error: [zigpy_deconz.zigbee.application] Unexpected transmit confirm for request id [XX], Status: TXStatus.NWK_ROUTE_DISCOVERY_FAILED
. I’ve a similar error a few times in the past, but it never seemed to cause any visible issues.
I enabled the debug logging as recommended in the ZHA docs, restarted HA, and saw more error messages similar to the above one:
Error while sending 10 req id frame: TXStatus.NWK_ROUTE_DISCOVERY_FAILED
According to this post, that error code means “An attempt to discover a route has failed due to a reason other than a lack of routing capacity”. I’m not entire sure what that means in practice, but one thing that sounds plausible is that something is sending lots of Zigbee traffic and starving out other traffic - similar to a DoS attack. I don’t know how to confirm/investigate that with my current toolset, though. I don’t think anything has changed in my RF environment or my WiFi traffic to cause this, but I don’t have a spectrum analyzer, either.
After waiting overnight, it seems the Zigbee network is not totally dead, as a few of my motion sensors appear to have connected and provided HA with some occasional data. One was working reasonably reliably last night, but the 2 others that showed up on the zha-network-card
weren’t reliably checking in. After restarting HA this morning, only 2 are listed on the zha-network-card
and neither are transmitting data correctly.
I haven’t tried resettting/re-pairing any devices yet, but I can do that if necessary.
Currently, when I open/close some of my windows that have a battery-powered sensors, they still blink the green light that I think means that the data was successfully transmitted to their parent. I believe that they have an orange/red light when they can’t talk to their parent or are in pairing mode. That makes me think that the network may still be intact/active, but my ConBee II just can’t talk to it.
Things I’ve tried:
- Restart HA (multiple times): no change
- Gracefully shut down Ubuntu host that runs HA and then restart: no change
- Move ConBee II USB stick to PC and start up deConz: result was that the stick is detected and appears to be configured on ch15. No devices shown connected to ConBeeII. No change when ConBee II is reattached to HA host (and host is rebooted).
- Power down, wait for 10 sec, and then power up each of my mains-powered Zigbee devices (routers), one or two at a time: no change
- Leave HA running and all devices plugged in overnight to see if anything changes: A few endpoint devices show intermittent connections in
zha-network-card
- Use RF scan feature of UniFi wireless access point to scan for 2.4GHz interference: nothing higher than ~-90dBm across all 11 WiFi channels
Things remaining to try:
- Shut down HA Ubuntu host, turn off main breaker to house, power on HA host (it’s on a UPS), and then turn main breaker on - force mesh to come back all at once with HA freshly rebooted. Is that likely to do anything different than rebooting/restarting devices individually like I’ve already done?
- Move WAP channel to ch11 (since ConBeeII seems to be on Zigbee ch15) - but things were working fine for months with current channels, so not expecting a miracle there…
One historical note that might be of significance: months ago when initially setting up my devices, I attempted to set up my Zigbee network on ch25. I successfully configured the ConBeeII stick to ch25 by plugging it into my PC and setting the channel via the deConz app. I connected a few devices to it with deConz and things were good. I then moved the stick to my HA host, thinking the channel setting would be retained once I set up the ZHA integration. Since I didn’t do anyhing with the ZHA YAML config, my understanding is that ZHA probably switched the channel back to the default of 15 when I set up my Zigbee devices via ZHA in HA. On the off-chance that didn’t happen, could my network have been running on ch25 all this time and then something happened when restarting HA that switched the controller back to 15 while the rest of the network is on ch25? If so, would testing that theory be as simple as using deConz to leave the network, switch the channel to 25, and then try to rejoin the network? Or does that process regenerate/reset network keys and such and would render my ConBee II stick unable to rejoin my existing network on any channel?
Home Assistant and Zigbee in general are still relatively new for me (~5mo), and I’ve learned a lot from the community so far - thanks for that! Hopefully documenting this issue will help someone else out in the future, too. Debugging this is getting beyond my current expertise and my internet sleuthing has not yielded an obvious solution. Thus, I’d really appreciate any insight or suggestions to understand the problem, and better yet, find a solution.