HA disconnects from Mosquitto every X minutes

Couple of days ago, I noticed that all entities coming from zigbee devices are becoming unavailable every X minutes (varies from few minutes to few hours), and then getting its values after ~10 seconds. This happens to all such entities at the same time and has been happening for the past 3-4 days. I don’t think I have changed or upgraded anything in the days before I noticed it.

I have HA (was 2024.10, upgraded to 2024.11.2 yesterday), Zigbee2mqtt and Mosquitto, all running in Docker containers. Digging into the z2m logs, i noticed this:

[2024-11-16 22:47:13] debug:     z2m:mqtt: Received MQTT message on 'homeassistant/status' with data 'offline'
[2024-11-16 22:47:29] debug:     z2m:mqtt: Received MQTT message on 'homeassistant/status' with data 'online'

Then, in the Mosquitto logs i found this:

2024-11-17 14:28:38: Received PINGREQ from 5vZtSyPpf2TLqj2qyKZZUb
2024-11-17 14:28:38: Sending PINGRESP to 5vZtSyPpf2TLqj2qyKZZUb
2024-11-17 14:28:44: Received PINGREQ from mqttjs_13d8b3d3
2024-11-17 14:28:44: Sending PINGRESP to mqttjs_13d8b3d3
2024-11-17 14:29:44: Received PINGREQ from mqttjs_13d8b3d3
2024-11-17 14:29:44: Sending PINGRESP to mqttjs_13d8b3d3
2024-11-17 14:30:08: Client 5vZtSyPpf2TLqj2qyKZZUb has exceeded timeout, disconnecting.
2024-11-17 14:30:08: Sending PUBLISH to mqttjs_13d8b3d3 (d0, q0, r0, m0, 'homeassistant/status', ... (7 bytes))
2024-11-17 14:30:18: New connection from 172.17.0.1:42324 on port 1883.
2024-11-17 14:30:18: New client connected from 172.17.0.1:42324 as 5vZtSyPpf2TLqj2qyKZZUb (p2, c1, k60).
2024-11-17 14:30:18: Will message specified (7 bytes) (r0, q0).
2024-11-17 14:30:18:     homeassistant/status

As I understood, all clients send PINGREQ to Mosquitto to keep the connection alive, which is happening most of the time. I can see two clients - 5vZtSyPpf2TLqj2qyKZZUb and mqttjs_13d8b3d3 , of which I assume the first is HA and the second z2m.

We can see that at some point, the HA one stops sending the ping package, and after the keep alive timeout has passed (i have the default 60 seconds both in HA and Mosquitto), Mosquitto disconnects the client. 10 seconds after that, the client opens the connection again.

Does anyone have any idea where to look next and how to find out why is this happening?

I looked at the CPU and memory usage in HA, and there’s a small spike in the CPU usage whenever this happens, but even then it does not go above ~10%.

I’m running the system on a Raspberry Pi 4, with an SSD.

I also found this similar scenario, but in my case i could not relate the moments when Mosquitto is saving the database with the moments the entities are becoming unavailable.

Maybe try to increase the timeout value or lower the keep alive value (if that is possible).
TCP/IP networks do not guarantee a packet delivery, so you might miss one or two pings, which simply get lost.
There can also be delays in the sender and the receiver hosts, so packets might be a lot longer to take the round trip.
These two things then often occur at the same time, so they tend to make the outcome more extreme.

I tried changing the number of seconds for sending the ping from home assistant to 30, then 15 (lowest allowed value), but neither one helped.

Here’s a recent log:

2024-11-18 19:27:41: Received PINGREQ from 75l9qb6qzn9p2wDbWJINkD
2024-11-18 19:27:41: Sending PINGRESP to 75l9qb6qzn9p2wDbWJINkD
2024-11-18 19:27:44: Received PINGREQ from mqttjs_13d8b3d3
2024-11-18 19:27:44: Sending PINGRESP to mqttjs_13d8b3d3
2024-11-18 19:27:56: Received PINGREQ from 75l9qb6qzn9p2wDbWJINkD
2024-11-18 19:27:56: Sending PINGRESP to 75l9qb6qzn9p2wDbWJINkD
2024-11-18 19:28:23: Client 75l9qb6qzn9p2wDbWJINkD has exceeded timeout, disconnecting.
2024-11-18 19:28:23: Sending PUBLISH to mqttjs_13d8b3d3 (d0, q0, r0, m0, 'homeassistant/status', ... (7 bytes))
2024-11-18 19:28:33: New connection from 172.17.0.1:35774 on port 1883.
2024-11-18 19:28:33: New client connected from 172.17.0.1:35774 as 75l9qb6qzn9p2wDbWJINkD (p2, c1, k15).
2024-11-18 19:28:33: Will message specified (7 bytes) (r0, q0).

Maybe a ping does nothing to keep the connection alive.

I updated the value to 120 seconds last night, but it still happens, although much less often (every 2-3 hours)