Matter stop working over night?!

I’m using the Matter Integration with 10 Matter over Thread devices since beginning of the year. I worked until last week without any issues.

As Thread Border Router I use one Apple TV an 6 HomePod’s Mini. As OpenThread Border Router ConBee II flashed with the Thread Firmware. The Apple Network is the preferred none and the OTBR is part of the preferred network.

Last week more and more devices become offline, but only in Home Assistant. Within HomeKit all devices are still working.

Also, it was not possible anymore to add a new device via the Matter Integration.

Not as a new one and not by sharing the device via Home App.

Via HomeKit (Home App) it’s possible to add new devices without any issues.

I decided to delete all devices as well the Matter and Thread Integration as well the dedicated Add-Ons.

And set up everything from scratch one time again. But still the same issue, I can’t add new devices anymore.

I tested it with three different Home Assistant installations as well with two different ConBee II as OTBR.

  • HA OS 2024.8.1 as VM in Proxmox → Produktiv
  • HA OS 2024.8.1 as VM in Proxmox → Test System
  • HA OS 2024.8.1 Odroid M1 → Test System

Home Asssistant as well all Thread Border Router are in the same VLAN.

Here the Log from the OpenThread Border Router Add-On

-----------------------------------------------------------
 Add-on: OpenThread Border Router
 OpenThread Border Router add-on
-----------------------------------------------------------
 Add-on version: 2.9.1
 You are running the latest version of this add-on.
 System: Home Assistant OS 12.4  (amd64 / qemux86-64)
 Home Assistant Core: 2024.8.1
 Home Assistant Supervisor: 2024.08.0
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service universal-silabs-flasher: starting
[23:09:12] INFO: Flashing firmware is disabled
s6-rc: info: service universal-silabs-flasher successfully started
s6-rc: info: service otbr-agent: starting
[23:09:12] INFO: Setup OTBR firewall...
[23:09:12] INFO: Starting otbr-agent...
s6-rc: info: service otbr-agent successfully started
s6-rc: info: service otbr-agent-rest-discovery: starting
s6-rc: info: service otbr-agent-configure: starting
s6-rc: info: service otbr-web: starting
s6-rc: info: service otbr-web successfully started
[23:09:13] INFO: Starting otbr-web...
otbr-web[263]: [INFO]-WEB-----: Running 0.3.0-41474ce-dirty
otbr-web[263]: [INFO]-WEB-----: Border router web started on wpan0
[23:09:13] INFO: Enabling NAT64.
Done
Done
Done
s6-rc: info: service otbr-agent-configure successfully started
[23:09:13] INFO: Successfully sent discovery information to Home Assistant.
s6-rc: info: service otbr-agent-rest-discovery successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
49d.21:38:26.999 [C] P-SpinelDrive-: Software reset co-processor successfully
00:00:00.038 [W] P-Netif-------: Failed to process request#2: No such process
00:00:00.038 [W] P-Netif-------: Failed to process request#6: No such process
00:00:00.329 [W] P-Netif-------: Failed to process request#7: No such process
00:00:00.604 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:12.237 [W] P-Netif-------: Successfully added an external route ::/0 in kernel
00:00:12.237 [W] P-Netif-------: Successfully added an external route fd11:93b3:21ea:ffff:0:0::/96 in kernel
00:00:14.706 [W] DuaManager----: Failed to perform next registration: NotFound
00:00:22.020 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:22.435 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:22.472 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:27.719 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:30.242 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:35.577 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:46.980 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:47.990 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:47.996 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:00:57.245 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:07.833 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:09.514 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:11.561 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:14.724 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:43.509 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:44.340 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:44.741 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:45.178 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:45.579 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:45.987 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:46.384 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:46.808 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:47.471 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:48.010 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:48.040 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:48.420 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:48.455 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:48.853 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:49.434 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:50.901 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:55.281 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:58.283 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:58.555 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:59.412 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:01:59.911 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:02:00.330 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:02:00.724 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:02:04.011 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:02:05.234 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:02:12.610 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure
00:02:15.049 [W] P-RadioSpinel-: Handle transmit done failed: ChannelAccessFailure

Here the Log from the Matter Server Add-On

Add-on: Matter Server
 Matter WebSocket Server for Home Assistant Matter support.
-----------------------------------------------------------
 Add-on version: 6.4.1
 You are running the latest version of this add-on.
 System: Home Assistant OS 12.4  (amd64 / qemux86-64)
 Home Assistant Core: 2024.8.1
 Home Assistant Supervisor: 2024.08.0
-----------------------------------------------------------
 Please, share the above information when looking for help
 or support in, e.g., GitHub, forums or the Discord chat.
-----------------------------------------------------------
s6-rc: info: service banner successfully started
s6-rc: info: service matter-server: starting
[23:09:27] INFO: Starting Matter Server...
s6-rc: info: service matter-server successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
[23:09:28] INFO: Using 'enp0s18' as primary network interface.
[23:09:28] INFO: Successfully send discovery information to Home Assistant.
[1723410569.028440][126:126] CHIP:CTL: Setting attestation nonce to random value
[1723410569.028582][126:126] CHIP:CTL: Setting CSR nonce to random value
[1723410569.029093][126:126] CHIP:DL: ChipLinuxStorage::Init: Using KVS config file: /tmp/chip_kvs
[1723410569.029154][126:126] CHIP:DL: writing settings to file (/tmp/chip_kvs-y3h5RG)
[1723410569.029187][126:126] CHIP:DL: renamed tmp file to file (/tmp/chip_kvs)
[1723410569.029298][126:126] CHIP:DL: ChipLinuxStorage::Init: Using KVS config file: /data/chip_factory.ini
[1723410569.029349][126:126] CHIP:DL: ChipLinuxStorage::Init: Using KVS config file: /data/chip_config.ini
[1723410569.029365][126:126] CHIP:DL: ChipLinuxStorage::Init: Using KVS config file: /data/chip_counters.ini
[1723410569.029423][126:126] CHIP:DL: writing settings to file (/data/chip_counters.ini-Z8N3ZG)
[1723410569.029503][126:126] CHIP:DL: renamed tmp file to file (/data/chip_counters.ini)
[1723410569.029506][126:126] CHIP:DL: NVS set: chip-counters/reboot-count = 12 (0xC)
[1723410569.029666][126:126] CHIP:DL: Got Ethernet interface: enp0s18
[1723410569.029733][126:126] CHIP:DL: Found the primary Ethernet interface:enp0s18
[1723410569.029862][126:126] CHIP:DL: Failed to get WiFi interface
[1723410569.029866][126:126] CHIP:DL: Failed to reset WiFi statistic counts
2024-08-11 23:09:29.030 (MainThread) CHIP_PROGRESS [chip.native.TS] Last Known Good Time: 2023-10-14T01:16:48
2024-08-11 23:09:29.030 (MainThread) CHIP_PROGRESS [chip.native.FP] Fabric index 0x1 was retrieved from storage. Compressed FabricId 0x8FE8BB58D61145F7, FabricId 0x0000000000000002, NodeId 0x000000000001B669, VendorId 0x134B
2024-08-11 23:09:29.031 (MainThread) CHIP_PROGRESS [chip.native.ZCL] Using ZAP configuration...
2024-08-11 23:09:29.032 (MainThread) CHIP_PROGRESS [chip.native.IN] CASE Server enabling CASE session setups
2024-08-11 23:09:29.069 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Setting attestation nonce to random value
2024-08-11 23:09:29.070 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Setting CSR nonce to random value
2024-08-11 23:09:29.070 (Dummy-2) CHIP_PROGRESS [chip.native.SPT] Using device attestation PAA trust store path /data/credentials.
2024-08-11 23:09:29.094 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Generating NOC
2024-08-11 23:09:29.095 (Dummy-2) CHIP_PROGRESS [chip.native.FP] Validating NOC chain
2024-08-11 23:09:29.097 (Dummy-2) CHIP_PROGRESS [chip.native.FP] NOC chain validation successful
2024-08-11 23:09:29.097 (Dummy-2) CHIP_PROGRESS [chip.native.FP] Updated fabric at index: 0x1, Node ID: 0x000000000001B669
2024-08-11 23:09:29.097 (Dummy-2) CHIP_PROGRESS [chip.native.TS] Last Known Good Time: 2023-10-14T01:16:48
2024-08-11 23:09:29.097 (Dummy-2) CHIP_PROGRESS [chip.native.TS] New proposed Last Known Good Time: 2021-01-01T00:00:00
2024-08-11 23:09:29.097 (Dummy-2) CHIP_PROGRESS [chip.native.TS] Retaining current Last Known Good Time
2024-08-11 23:09:29.098 (Dummy-2) CHIP_PROGRESS [chip.native.FP] Metadata for Fabric 0x1 persisted to storage.
2024-08-11 23:09:29.098 (Dummy-2) CHIP_PROGRESS [chip.native.TS] Committing Last Known Good Time to storage: 2023-10-14T01:16:48
2024-08-11 23:09:29.098 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Joined the fabric at index 1. Fabric ID is 0x0000000000000002 (Compressed Fabric ID: 8FE8BB58D61145F7)
2024-08-11 23:09:29.098 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] *** Missing DeviceAttestationVerifier configuration at DeviceCommissioner init: using global default, consider passing one in CommissionerInitParams.
2024-08-11 23:09:29.098 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Updating services using commissioning mode 0
2024-08-11 23:09:29.099 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] CHIP minimal mDNS started advertising.
2024-08-11 23:09:29.099 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] Advertise operational node 8FE8BB58D61145F7-000000000001B669
2024-08-11 23:09:29.099 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] CHIP minimal mDNS configured as 'Operational device'; instance name: 8FE8BB58D61145F7-000000000001B669.
2024-08-11 23:09:29.101 (Dummy-2) CHIP_PROGRESS [chip.native.DIS] mDNS service published: _matter._tcp
2024-08-11 23:09:29.101 (Dummy-2) CHIP_PROGRESS [chip.native.SPT] Setting up group data for Fabric Index 1 with Compressed Fabric ID:
2024-08-11 23:09:58.601 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Setting attestation nonce to random value
2024-08-11 23:09:58.602 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Setting CSR nonce to random value
2024-08-11 23:09:58.603 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Starting commissioning discovery over DNS-SD
2024-08-11 23:10:28.604 (Dummy-2) CHIP_ERROR [chip.native.CTL] Discovery timed out
2024-08-11 23:10:28.604 (Dummy-2) CHIP_ERROR [chip.native.ZCL] Secure Pairing Failed
2024-08-11 23:10:28.604 (Dummy-2) WARNING [chip.ChipDeviceCtrl] Failed to establish secure session to device: src/controller/python/ChipDeviceController-ScriptDevicePairingDelegate.cpp:89: CHIP Error 0x00000003: Incorrect state
2024-08-11 23:10:28.605 (MainThread) ERROR [matter_server.server.client_handler] [139740526164176] Error while handling: commission_with_code: Commission with code failed for node 16.
2024-08-11 23:13:41.799 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Setting attestation nonce to random value
2024-08-11 23:13:41.800 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Setting CSR nonce to random value
2024-08-11 23:13:41.801 (Dummy-2) CHIP_PROGRESS [chip.native.CTL] Starting commissioning discovery over DNS-SD
2024-08-11 23:14:11.806 (Dummy-2) CHIP_ERROR [chip.native.CTL] Discovery timed out
2024-08-11 23:14:11.806 (Dummy-2) CHIP_ERROR [chip.native.ZCL] Secure Pairing Failed
2024-08-11 23:14:11.807 (Dummy-2) WARNING [chip.ChipDeviceCtrl] Failed to establish secure session to device: src/controller/python/ChipDeviceController-ScriptDevicePairingDelegate.cpp:89: CHIP Error 0x00000003: Incorrect state
2024-08-11 23:14:11.808 (MainThread) ERROR [matter_server.server.client_handler] [139740526164176] Error while handling: commission_with_code: Commission with code failed for node 17.

Any advice what is going wrong here?

Edit: Nobody else with Matter Server issues?!

Even after rebuild everything from scratch. Devices becomes offline after few hours, latest after one day. And I always have to restart the Matter server so that all devices are online again.

I have the exact same behavior since a couple of months. Restarting the matter server usually fixes it but it is sometimes necessary multiple times a day. Really annoying.

Sadly I have no solution for this so far.

Since Matter Server v6.5.1 and iOS 18 on all of my Apple Thread Border Routers, it seems to become stable again. 2 days without any downtime.

But only if I switch all of my OpenThread Border Routers off (ConBee II). As soon one OTBR is online, devices will go randomly offline again.

I’m experiencing the same issue: Apple Home Thread network runs fine, but when adding a SkyConnect (latest Thread-only firmware) via HA (latest version) to the Thread network, my Thread devices go offline frequently. I am using mostly Matter devices, but my Homekit device logs the Thread status – and interestingly it goes online exactly every 30 minutes, to go offline within 1-3 minutes.

What Thread Border Router setup are you using? Is that still happening with the Matter Server add-on 6.5.0? There was at least one bigger bugfix (specifically Retry subscription setup if necessary by agners · Pull Request #873 · home-assistant-libs/python-matter-server · GitHub) which made Home Assistant not automatically renew subscriptions in case of errors.

So this seems to be a common theme accross several reports: If a OTBR is part of the Thread network, devices seem to go unreachable somehow :thinking:

We track this now with Matter over Thread Devices unavailable - Home Assistant Only · Issue #123835 · home-assistant/core · GitHub. So @gato if you can share your setup there, that would be helpful (specifically the information requested by Marcel in this comment).

Hm, 30 minutes is the default timeout for mDNS entries. So it could be related to mDNS somehow.

I currently have two reasons I suspect here:

  • mDNS - This could be related to multicast not working correctly, e.g. Unify has Multicast filtering, there also have been issues with Linux bridges (e.g. as used in Proxmox) and multicast.
  • TREL - Thread Border Router have a feature called Thread Radio Encapsulation Link, which essentially forwards Thread packets to the nearest Thread border router (since only Thread border router have the information how the mesh is setup, that can only be done on a Thread border router). What I suspect here is that the OTBR somehow cannot reach the Apple BRs via TREL.

I’ve created a test setup with 2x Apple HomePod (with the latest 18.0 firmware) and a OpenThread Border Router (using SkyConnect) here. So far all devices stay available, but I’ll monitor this for a couple of days at least to see if I can reproduce this on my end.

1 Like

Stefan, thanks a lot. I’ve posted my observations and my setup in the github issue you mentioned above.

Regarding mDNS, I’ve checked the OPNsense firewall logs, and nothing appears to be blocked. I’m running OPNsense, so if there’s anything specific I should look into, please let me know.

As for TREL, you might be onto something. My HomePod Mini and OTBR are on separate floors, connected through an open stairwell where a Nanoleaf Lightstrip (HomeKit-only) acts as a Thread router/leader. Could this device be having issues with TREL, possibly causing problems between the two border routers? The direct connection between the two border routers is very weak (the floor is concrete), so it’ll travel through the stairwell.

I updated all my 4 HomePod minis plus 1 AppleTV and the Matter Server add-on over the weekend and ever since then it is all stable. I never lost a single Matter device, never had to restart the add on since then.

It’s quite amazing, there were so many add on updates in the past months and I hoped for it to be fixed and now all of a sudden all my troubles are gone (yet a new one appeared, unrelated to matter though)