My ZHA integration from time to time starts showing all of my devices as unavailable. The time involved is unpredictable and can be anywhere from a couple hours to several weeks. When this happens, all of my devices are lost at the same time. The only cure I have found so far is to completely power off the server and power it back on (simple reboot is not sufficient). After that, all of the devices are recognized again.
As a software guy, I know the best advice is to disable everything else and reproduce the problem. That’s not really feasible in this case without losing the use of my HA server. What’s the best way for me to troubleshoot this?
I’m running current HAOS on generic x86-64 hardware. I’m using ZHA and a SkyConnect dongle. I have the Silicon Labs Multiprotocol addon installed (I once tried to uninstall this to help troubleshoot, but it created an error that I can’t recall offhand).
I turned on ZHA debug logging recently, and I was “lucky” enough to have the problem happen after less than a day. Alas, the log file is over 500mb. Even after weeding out a bunch of stuff that I know is not relevant, the log is still over 100mb.
So, what I am looking for is hints about what to look for in that log file. Suggestions?