WTH is ZHA/Zigbee so unstable and hard to troubleshoot?

I think I may have found a solution to the entire network going down, or at least something that should reduce the severity of the issue. Time will tell.

It was working mostly fine for a couple months. Then roughly around the time I updated HA to 2025.2.0, it started happening again almost daily without any changes to the network. I took another stab at looking for reports about the issue and found this zigpy bellows issue mentioning MAX_MESSAGE_LIMIT_REACHED and how a restart resolved it for a day or two.
That issue was closed in favor of this PR for silabs-firmware-builder to “Increase broadcast and unicast table sizes”.

I had SkyConnect firmware 7.1.1.0 and that PR was merged for 7.4.2.0. I updated to 7.4.4.0 after some difficulty.

The official firmware update instructions were useless because it only mentions using an add-on which isn’t supported by HA Container and a web flasher which only works for devices purchased after Oct 20, 2024 (I purchased Oct 2023).

I found someone mentioning a third option to flash via shell using universal-silabs-flasher. After shutting down HA, I kept getting [Errno 13] Permission denied when trying to probe or flash the SkyConnect on the HA system, even after adding it to udev rules and reloading udev. I moved SkyConnect to my main system and instead got [Errno 16] Device or resource busy, which was apparently caused by the gpsd service being overzealous, so I stopped it with sudo systemctl stop gpsd and was then able to probe and flash.

I expected one benefit of a Home Assistant branded hub would be that HA could help keep the firmware updated much like it does for other Zigbee devices. This experience proved that’s not the case at all for HA Container. That’s another one for my feature wishlist.