So my mesh went a bit unstable recently, my sonoff valves often get stuck they are sending updates but not accepting commands, this is after a very long period of good stability.
What I noticed is a few things
Someone had unplugged one of my ti routers from the network as well as a tretakt smart plug which was also showing as a missing router. I am sure this caused a lot of things to have to re-build their routes. I have put them both back.
I had recently taken the number of devices to 51. I dropped it to 49 as I have seen comments about 50 in the past being a point of instability.
I am running a dongle E with EZSP 6.10.3 build 297 as the co-ordinator. I think this is not latest firmware! But see point about stability. I don’t know if that firmware had a 50 device limit or a 200 device limit, but the 51 devices might be the thing that really introduced instability. I have taken two devices out so now its back to 49.
Short of throwing the kitchen sink at it, firmware upgrades, etc etc possibly I could spend hours on this , I am wondering if anyone knows about firmware versions vs. EZSP device limits. Many places say I should be able to get to 200. I have new devices in a drawer that I want to add to the network. I also have another dongle and could maybe create 2 networks if 50 is a problematic threshold.
Given I have a spare dongle I am wondering what is the most seamless way to get to an upgraded co-ordinator, can I flash that with new firmware and then swap them with minimal downtime. …
Thanks – its all back and stable, under observation ; I used the log file to detect zigbee send timeouts and reset offending devices until all gone.
I wonder if the co-ordinator somehow got itself tied in knots and was dropping routes back to devices. My observations suggest that in all cases the devices could transmit packets but not receive them.
I think I will ready up an upgraded co-ordinator and follow the radio migration procedure. I will give it a week before I tentatively go back above 50 devices. My expectation is the mesh is better at self healing than what I have seen happen!
Update; I haven’t yet updated the co-ordinator, but I did unplug it, restart it and then restart HA. Having gone through this process I still had some sonoff battery TRV devices that were completely asymmetric in nature – they send updates but don’t receive any packets from the mesh. I tried re-pairing them, this process semi worked but did not complete. It only completed after removing their batteries and re-installing them. It makes me think the resiliency issue is with the firmware on these devices really and not the rest of the mesh, it is as if they are in a crashed state of some form until they have been reset/re-paired.
I am going to wait for another occurence, I am wondering what manual debug options there are, is there a zigbee equivalent of “trace route” so that I can figure out if I have a router that is misbehaving. Seems like the devices themselves though that need to be more resilient.
edit: view network often shows these devices as green but not connected to anything – could be related
Still havent updated the co ordinator BUT have become very suspiscious of a router on the network
The router is a Dongle-P (texas instruments) flashed with router firmware. I think it is going into some kind of asymmetric state. I removed it from the network and dropped a tretakt smart plug in the same location. Things are stable again but we will see.
I wonder if I could have found it quicker with any of the ZHA debug features.
That was it. One router doing bad things. Now removed.
Not only has the network been stable I have seen a lot of SELF HEALING of any issues e.g. with radiator valves which I have not seen before.
So now I need to remove the other one of these routers, an itead dongle-P running the z-stack router firmware. This needs to go.
By the way, 95% congested frequency band, lots of red connections on the ZHA network map, but its rock solid my error logs look better than they ever have. The problem is not interference or signal strength these devices are built to re-send and retry. The problem was a buggy router, there is something not right about the firmware on that device, maybe it cannot correctly handle the operating environment, everything else seems ok including Ikea Tretakt plugs and Candeo dimmer modules (these are brilliant, no neutral, standard module size, and they are also routers !!!).