Trying to setup a highly available Zigbee network

Godmag · January 17, 2025, 4:02pm

Hello,
title says most of it. This is my current setup:

2x Sonoff Dongle Plus same firmware each connected to
2x Rpi with a virtual IP and keepalived setup
Zigbee2Mqtt on my Hassio instance on an external server is accessing the dongle via the virtual IP and TCP and not through physical USB.

Setup with 1 dongle and rpi worked well before.

Since there are important sensors running on my network I wanted to have a fallback dongle. So in case one dongle goes down the other one on the 2nd RPI will take over. This is all organized with a virtual IP and the keepalived package which monitors basically a ping to the other RPI.

I flashed the second dongle with the identical firmware and gave it the same IEEE address from my primary dongle as secondary IEEE like this

python .\cc2538-bsl.py -p COM17 -evw --bootloader-sonoff-usb --ieee-address 00:12:4b:00:1c:dc:ed:b4 .\CC1352P2_CC2652P_other_coordinator_20210708.hex

This is also documented on Zigbee2Mqtt Copying the ieee address of an adapter | Zigbee2MQTT

So far so good. The Rpi config and virtual IP work flawlessly and the backup dongle is detected successfully by z2m when I poweroff my primary RPI. BUT I just can’t get it to work properly. The backup dongle won’t react to switches and buttons properly if at all. Some lights have a long delay other devices won’t react in any way.

This is the most common error I’ll get:

2025-01-17 16:43:03] error: z2m: Publish ‘set’ ‘state’ to ‘Arbeitszimmer Luftreiniger’ failed: ‘Error: ZCL command 0x70ac08fffe9484cb/1 genOnOff.off({}, {timeout:10000,disableResponse:false,disableRecovery:false,disableDefaultResponse:false,direction:0,reservedBits:0,writeUndiv:false}) failed (Timeout - 8714 - 1 - 15 - 6 - 11 after 10000ms)’

[2025-01-17 16:43:03] error: z2m: Publish ‘set’ ‘state’ to 'Schlafzimmer Luftreiniger ’ failed: ‘Error: ZCL command 0xf4b3b1fffea92a71/1 genOnOff.on({}, {timeout:10000,disableResponse:false,disableRecovery:false,disableDefaultResponse:false,direction:0,reservedBits:0,writeUndiv:false}) failed (Data request failed with error: ‘No network route’ (205))’

The rest of the logs (activated debug logging) look normal. Motion sensors and power plugs with current sensors will show up normally.

I also tried this only on my primary RPI to rule out a config mistake with the virtual IP/keepalived setup. So I just unplugged the primary dongle and connected the backup dongle, started Z2M but no joy.

Is there something I’m not understanding here? I essentially did a dongle migration. Only difference is I’m trying to use both independently. But here is the really weird thing. I tried to redo the flashing of the dongles and sometimes they would just change their role. The backup dongle would work normally and the primary dongle would cause problems. This happened to me several times.

Hints and help much appreciated.
Thanks

Edit: The hint from Cyberbeni seems to do the trick although I don’t understand why. Just power cycling one failing device like a plug will put the whole Zigbee network to normal. I will test this further but I’ hopeful that his is the solution.

jackjourneyman · January 17, 2025, 4:11pm

(post deleted by author)

Godmag · January 17, 2025, 4:13pm

ok but how does a migration work then? why does z2m detect a freshly flashed dongle as the chosen one? Is there a way to replicate this process? In theory there should be or am I missing something?

Also if there is really only one coordinator paired why is the second one detected and working (not well but still)

To avoid misunderstandings. I’m not trying to run both coordinators simultaniously. I’m running the primary dongle, when it is not available the secondary will step in.

Cyberbeni · January 17, 2025, 4:20pm

Did you do this part?

If re-pairing was not required and your devices do not respond; restart some routers by removing them from the mains power for a few seconds.

jackjourneyman · January 17, 2025, 4:30pm

(post deleted by author)

Godmag · January 17, 2025, 4:31pm

Lol I did this with one device which helped. Interestingly all devices work now with the secondary coordinator. BUT there is always this one light that will make problems.
Power cycling is not an option unfortunately. This would defeat the high availability goal.
This is so weird. I don’t understand this behaviour.

Godmag · January 17, 2025, 4:35pm

Was just replying to your deleted question about the backup. I also tried making a backup although that wouldn’t be very practical for my case. It doesn’t make a difference. As you can see in my previous reply it seems to be able to work. But the behaviour is full of random errors.

I have the feeling that this project won’t be successful but I’d like to understand why

Godmag · January 17, 2025, 5:07pm

Ok this is very interesting now. The tip with power cycling the device and the zigbee network coming alive seems to be something I can replicate. It is enough to just power cycle one failing zigbee plug eg and suddenly all other devices will follow to run normal. Very weird behaviour but it doesn’t surprise me Zigbee has always been very lively for me.

If this is really all it needs then hooking up a Tasmota plug behind the Zigbee plug would be a solution. Very janky but as long as it works I’m fine with it