Matter over Thread devices go offline every 6 hours on the dot (or?)

I have a strange problem where every six hours, on the dot, at 05:27, 11:27, 17:27, and 23:27, a big chunk of my Matter Server nodes go offline and then slowly come back online after 30-60 minutes.

Well, at least things were that way for about two days, but then the 05:27 and 17:27 matter “storms” moved to 04:22 and 16:22 a day ago. So I have two separate agitators that are operating 12 hours apart.

I can’t thiink of anything I’ve changed or added that would affect this, but the timing has to be a big clue. Any ideas?

I have HAOS installed on a Beelink Mini PC. I have 101 matter over thread devices (72 Inovelli white switches, 2 Aqara smart locks, 20 Sunricher RGB controllers, and 7 Eve smart plugs). My thread network is managed by 2 Apple TVs and two HomePod minis. All of my HA and Apple software is up to date. I have a Unifi network, and I don’t see any network obstacles.

About 1/2 to 2/3 of the devices go offline and then things rebuild over about 30-60 minutes. So it’s not a complete shutdown of the Matter Server. Nothing is rebooting and no software is updating.

In the logs I see the Matter server losing contact with dozens of devices in a few minutes. Then slowly things come back together.

Any thoughts or clues or any ideas where to look? I will take any advice on this one!

Are you using any routing endpoint devices that act as repeater? If the repeater goes offline, obviously any connected device is lost.

That doesn’t immediately explain the precise timing but possibly a lead worth pursuing.

1 Like

I am not. Thought if that too.

Are all offline devices of the same type, i.e. the same manufacturer, or do you experience a mixed picture?

I trust that you had already put the Matter server into debug mode and checked the logs, did you?

Can you tell a little bit more about the Beelink PC? Does it run WIndows or Linux, is HA installed as HAOS, in a VM or as docker?

Hey, thanks for your help!

The offline devices are of all types. I’ve been recently saving the logs from complete Matter “storms” and I was going to see if AI could spot a pattern. But in my spot checking, the offline devices aren’t of one type.

I did try debug mode and fed the results into AI and looked at them myself and it just looks like devices losing connection and devices coming back online.

The Beelink is running HAOS.

Something is happening every 6 hours – and I see online I’m not the only one experiencing this issue with this regularity. I had to reboot the other night and y Matter storms are occuring every six hours on the dot after that reboot time.

Hhm, magic happening on your machine :smiley:

I could think of two further analysis options. Unless you haven’t not already tried:

  1. Start HA in safe mode and let it run for 12-13 hours. Then check the various logs for anything of interest.
  2. Deactivate all add-ons except Matter and OBTR and check for disconnects

Mind posting a link?

Thanks for the suggestions!

  1. I can try this when I get back home. I’m actually away until tomorrow night.

  2. I don’t have an OTBR. I’m using only Apple TBRs. However, I did briefly install the ZBT-2 and when it made my network unreliable and crazy, I disconnected it within a day. It was after that that I rebooted my HAOS machine and the problems started. I have since deleted OTBR and the ZBT-2 device. But I do wonder if it’s related.

But I also did something else two weeks before the problems started: I installed the beta firmware for my Inovelli White switches. That involved putting a JSON file in the updates folder of my core matter server and running the updates. I attempted to delete those files and wonder if something went awry there.

Also, my only add-ons are: Advanced SSH & Web Terminal, File Editor, Get HACS, Matter Server, Node-Red (where all of my switch automations are) and Terminal & SSH. I can probably give this a try overnight one night.

Some links from others with the every 6 hour problem:

Thanks!

Thinking further, you could distinguish between internal (HAOS-related) or external factors:

Note the trigger time(s) and restart your system. If trigger times then change, the issue is somewhere in your system. You can repeat the experiment: the first “offline storm” should happen with the same delay after you restarted. If, however, the times are not affected, the culprit is something outside of your HA unit.

This would point to something in your neighborhood. Thread operates at 2.4 GHz and a strong signal source at this frequency could cause interference. I have no idea what kind of device would operate with such a time pattern - and why.

I understand that you did that weeks before the trouble started.I don’t think that thi is related.

Nothing exceptional here. Maybe worth deactivating Node-Red to check if this has any effect.

Hey again!

Trigger times changed when I rebooted the system. It’s now six hours from my last restart. That tells me it’s something in HA.

True that I did the updates weeks before but i hadn’t rebooted the system until right before the problem started. So maybe something I did with the updates folder didn’t kick in til the restart.

Good idea to stop node red. I think I’ll try that. Though I’m hearing that updating the Matter Server to matter.js has largely taken care of this problem. Don’t love beta software for stuff so important but this is so bad I think I’ll need to go for it when I return home.

Appreciate all the brainstorming you are doing with me!

I’ve updated it on my system. Works flawlessly here.

Did you consider to reinstall a clean HAOS and restore data from a backup?

Glad the beta works for you! I’m hoping the same!

I have not considered a full re-install from backup yet. I think I’ll try the beta first.

Also another clue about the 6-hour matter “storms”: they start out big and long-lasting but over time they get smaller and shorter. The one this morning only encompassed a handful of devices and was over in 5 minutes. Then 6-hours later there was no offline activity at all. Don’t know what that points to but it’s a clue.

Weird behavior. You would think that a digital setup, unless impacted by external factors, will always behave in the same manner. Not like the wipers on my Tesla :roll_eyes:

I am curious what the next 6h intervals will show.

After two days of progressively shrinking matter storms, most of which occurred every six hours from restart on the dot, the storms have stopped for about 24 hours now. I’m pretty sure I’ve experienced this before but didn’t think it through. I had several days of stability before after diminishing 6-hour storms, but thought it must’ve been something else I altered. This time I was watching closely and was able to figure out this behavior.

The question is: What happens with the Matter Server or HA in general that causes these regularly timed storms?

I’m home now and plan to move to the new Matter.js today. So maybe these storms will be a thing of the past. Let’s hope that I’m not starting the cycle again of two days or so of Matter chaos!

So how are things looking with the new server version?

The next time I rebooted after I said the matter storms went away, my every-6-hour storms came back.

I still have my every-6-hour issue, but with the new matter.js server, it’s MUCH more bearable. Instead of my matter issues taking about 30-60 minutes to resolve, it only takes around 10 minutes or so, and a lot fewer devices are involved. I’m less panicked, and it’s less disruptive, but still looking for a solution. My ONLY offline moments for any of my devices are only during these every-6-hour events.

It looks like something changes the ipv6 local link addresses of many of my devices every six hours. Different devices every time.

It’s probably something stupid. :slight_smile:

Good to read that the new version improves the situation. Did you install 8.2.2 or an older version?

Shall we take a look at your Thread mesh? Open the Matter Server web interface, top-right you have a new view Nodes|Thread|Wifi. Post the Thread picture please.

The interesting thing is to know if you have one or more Thread border routers in your mesh. If there is more than one, check if the failing devices are all bound to the same one.

Ok, attached is my Thread mesh. Not sure how helpful this image will be. I’m not sure mesh quality is the issue. Between my matter “storms” things are very stable. However, every six hours something disturbs the mesh somehow and the weakest devices farthest from the Apple TBRs pop offline and then come back. I’m actually a bit surprised that I have any distant areas because I have devices evenly spaced throughout the house.

I did an experiment today to “reset” the mesh. I shut down the HA host and then all of the Apple TBRs. Waited 30 minutes and turned on the TBRs. Waiting 30 more and turned on the HA host. Things settled quickly enough, but then the matter storm came just when expected. So it didn’t change anything.

Also, for some reason, 20 minutes after that, I had another big storm that lasted about an hour before everything came back online. It has been stable since.

I’m about to get to my next every-6-hour storm. Will update you how it goes!

The mesh shows three routers. Two are really central, one is at the “periphery” only.

Could one of the Apple units be causing the trouble If I were to test this, I would start with HA only and check if the storms still happen. If not, bring one of the Apples up and check again.

Four routers. Two of them (HomePods) are in an outbuilding and handle just a few devices. It far enough out to be a bit unstable out there without them.

Are you saying let a non-Apple device handle the Apple Thread network? I’m not sure how that would happen without breaking things.

My evening storm happened on time. I think I mentioned the start time drifts about a minute or so each time. The first offline device earlier was at 15:57. Tonight’s was at 21:58. I suspect tonight’s storm will start at 03:59 or so. The drift is a clue, but I’m not sure what it tells me.

No, I am saying that maybe one of the other border routers is triggering your storm.

Consider one of the Apple routers stopping to provide Matter service. Your mesh will immediately start to reorganize. Now the Apple router comes back - reorganization starts again.

That’s why I am suggesting to test if any of the other routers is responsible. If yes, you take a closer look at it (logs, update, etc)