Update on this topic: after being on the verge of ditching ZigBee altogether, I was finally able to get to a stable state - at least for now, one week and counting.
There are all the things I did, in descending order of what I consider did the most difference:
- I used a WiFi Analyzer to discover sources of interference and moved Zigbee routers to alternate locations to avoid them.
- I upgraded the firmware on the Sonoff USB Dongle
- I added Zigbee devices in their final location, using Settings > Devices and Services > Zigbee Dongle “Configure” > “+ Add Device”
- I managed to move the 2.4Ghz channel on my mesh wifi routers
- I used a Zigbee sniffer to understand in more detail how were packages being routed and dropped
I initially did the following, which didn’t make any difference at all
- Turn off HASS for > 60 mins to make all Zigbee routers reconfigure
- Turn off the Xiaomi gateway for a few days to avoid interference with it (it is now ON again and not causing any trouble)
These things didn’t help but also got HASS working worse than before, so I disabled them again after doing some fruitless tests with them
- Installed and enabled ZHA-Toolkit to do things like pinging devices
- Enabling debug logging for several ZHA components
At this point, things are much more stable than before.
I’d be happy to share details on any of the things I tried and learned along the way if it helps anybody.
Thanks everyone for the help.