Loose connection to mosquitto (?) (kinda) every night (kinda) at same hour

My HA works fine, and running rock solid on a raspberry pi 3 for like 4 years now.

HA is run as a docker container plus this script GitHub - tribut/homeassistant-docker-venv: Run Home Assistant as non-root using the official docker image to run as user.

The os is a Raspbian, where Mosquitto is installed as a service.

Other docker containers are:

  • postgres (where the database HA is using is hosted)
  • cloudflare tunnel
  • nodered

For some months now, I’ve been suffering from a very strange problem: often at 4:20 (or very close to this time) my HA loose connection to mosquitto for a bunch of seconds.
The result is that a lot of entities (zigbee2mqtt entities…) goes from their current status → unavailable → again the same status.
This is not causing serious troubles, since I don’t have automations that are fooled by this fault, but I really would like to understand what’s going on.

The problem occured on the following days:

  • 07/08
  • 08/08
  • 09/08
  • 28/08
  • 7/09
  • 12/10
  • 14/10
  • 17/10
  • 19/10
  • 6/11
  • 7/11
  • 10/11
  • 11/11
  • 12/11
  • 15/11
  • 18/11

I can’t find any recurring pattern/episode that could be a clue to find the culprit.
What I did on the 07/08?
Checking my git history, I see that, apart from a bunch of automations and script, I switched to 2022.8.1 version.
Anyway, I usually keep my system up to date, so I’m now running 2022.11.2 (and I have used 2022.9 and 2022.10 too).

What I have tried so far:

  • update the raspbian
  • update HA
  • set a static IP Address (used to be dhcp’d)
  • disable Fritz!box integration
  • keep my cameras off

but now, I’m running out of ideas.

I will post a bunch of logs now, hoping that someone could help me to understand what’s going on here!!

Logs are from today occurrence, that happened at around 04:18

Home Assistant

Logs

2022-11-18 04:15:13.344 WARNING (MainThread) [homeassistant.components.media_player] Updating samsungtv media_player took longer than the scheduled update interval 0:00:10

2022-11-18 04:18:05.458 WARNING (Thread-8) [pychromecast.socket_client] [Bagno(192.168.1.73):8009] Heartbeat timeout, resetting connection

2022-11-18 04:18:05.863 WARNING (Thread-4 (_thread_main)) [homeassistant.components.mqtt.client] Disconnected from MQTT server 127.0.0.1:1883 (16)

2022-11-18 04:18:05.899 WARNING (Thread-9) [pychromecast.socket_client] [CC Casa(192.168.1.73):32051] Heartbeat timeout, resetting connection

2022-11-18 04:18:05.907 WARNING (Thread-7) [pychromecast.socket_client] [Cucina(192.168.1.190):8009] Heartbeat timeout, resetting connection

2022-11-18 04:18:06.156 ERROR (Thread-8) [pychromecast.socket_client] [Bagno([::ffff:c0a8:149]):8009] Failed to connect to service ServiceInfo(type=‘host’, data=(‘[::ffff:c0a8:149]’, 8009)), retrying in 5.0s

2022-11-18 04:18:06.361 WARNING (MainThread) [homeassistant.components.media_player] Updating samsungtv media_player took longer than the scheduled update interval 0:00:10

2022-11-18 04:18:06.382 WARNING (Thread-4 (_thread_main)) [homeassistant.components.mqtt.client] Disconnected from MQTT server 127.0.0.1:1883 (16)

2022-11-18 04:18:07.574 ERROR (MainThread) [homeassistant.components.template.light] Received invalid light is_on state: unavailable for entity light.ga_lampadina_comodino_sara. Expected: on, off, true, false

2022-11-18 04:18:07.681 ERROR (MainThread) [homeassistant.components.template.light] Received invalid light is_on state: unavailable for entity light.ga_lampadina_camera. Expected: on, off, true, false

2022-11-18 04:18:07.765 WARNING (MainThread) [homeassistant.helpers.entity] Update of media_player.tv_soggiorno is taking over 10 seconds

2022-11-18 04:18:07.768 WARNING (MainThread) [homeassistant.components.media_player] Updating samsungtv media_player took longer than the scheduled update interval 0:00:10

2022-11-18 04:18:07.772 WARNING (MainThread) [homeassistant.components.sensor] Updating energy sensor took longer than the scheduled update interval 0:00:30

2022-11-18 04:18:07.779 WARNING (MainThread) [homeassistant.components.light] Updating xiaomi_miio light took longer than the scheduled update interval 0:00:30

2022-11-18 04:18:07.784 WARNING (MainThread) [homeassistant.components.light] Updating xiaomi_miio light took longer than the scheduled update interval 0:00:30

2022-11-18 04:18:07.788 WARNING (MainThread) [homeassistant.components.light] Updating xiaomi_miio light took longer than the scheduled update interval 0:00:30

2022-11-18 04:18:07.791 WARNING (MainThread) [homeassistant.components.media_player] Updating spotify media_player took longer than the scheduled update interval 0:00:30

2022-11-18 04:18:08.004 WARNING (MainThread) [homeassistant.components.automation.z2m_last_update] Z2M: last update: Already running

2022-11-18 04:18:09.097 WARNING (MainThread) [homeassistant.components.automation.z2m_last_update] Z2M: last update: Already running

2022-11-18 04:18:09.102 WARNING (MainThread) [homeassistant.components.automation.z2m_last_update] Z2M: last update: Already running

2022-11-18 04:18:12.922 ERROR (MainThread) [homeassistant.components.websocket_api.http.connection] [1856442056] Client unable to keep up with pending messages. Stayed over 512 for 5 seconds. The system’s load is too high or an integration is misbehaving

Journalctl

Logs

nov 18 04:15:13 rpicasina3 containerd[498]: time=“2022-11-18T04:15:12.737954896+01:00” level=error msg=“get state for 1f560b6d7c76163b531ec37775422be980e26a6bf0fbef34be8cbb4f04e753e6” error=“context deadline exceeded: unknown”
nov 18 04:15:14 rpicasina3 containerd[498]: time=“2022-11-18T04:15:13.558128870+01:00” level=warning msg=“unknown status” status=0
nov 18 04:15:14 rpicasina3 CRON[29098]: pam_unix(cron:session): session opened for user root by (uid=0)
nov 18 04:15:14 rpicasina3 CRON[29099]: pam_unix(cron:session): session opened for user pi by (uid=0)
nov 18 04:15:14 rpicasina3 CRON[29103]: (pi) CMD (/home/pi/bin/duia_myip)
nov 18 04:15:14 rpicasina3 CRON[29102]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
nov 18 04:15:14 rpicasina3 CRON[29098]: pam_unix(cron:session): session closed for user root
nov 18 04:15:19 rpicasina3 CRON[29099]: (CRON) info (No MTA installed, discarding output)
nov 18 04:15:19 rpicasina3 CRON[29099]: pam_unix(cron:session): session closed for user pi
nov 18 04:15:44 rpicasina3 systemd[1]: run-docker-runtime\x2drunc-moby-1f560b6d7c76163b531ec37775422be980e26a6bf0fbef34be8cbb4f04e753e6-runc.HOqdaV.mount: Succeeded.
nov 18 04:16:47 rpicasina3 systemd[1]: run-docker-runtime\x2drunc-moby-1f560b6d7c76163b531ec37775422be980e26a6bf0fbef34be8cbb4f04e753e6-runc.8IybrG.mount: Succeeded.
nov 18 04:17:09 rpicasina3 CRON[29201]: pam_unix(cron:session): session opened for user root by (uid=0)
nov 18 04:17:09 rpicasina3 CRON[29202]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
nov 18 04:17:10 rpicasina3 CRON[29201]: pam_unix(cron:session): session closed for user root
nov 18 04:18:57 rpicasina3 systemd[1]: run-docker-runtime\x2drunc-moby-1f560b6d7c76163b531ec37775422be980e26a6bf0fbef34be8cbb4f04e753e6-runc.XpighZ.mount: Succeeded.
nov 18 04:20:03 rpicasina3 CRON[29347]: pam_unix(cron:session): session opened for user pi by (uid=0)
nov 18 04:20:03 rpicasina3 CRON[29349]: (pi) CMD (/home/pi/bin/duia_myip)
nov 18 04:20:05 rpicasina3 CRON[29347]: (CRON) info (No MTA installed, discarding output)
nov 18 04:20:05 rpicasina3 CRON[29347]: pam_unix(cron:session): session closed for user pi
nov 18 04:25:02 rpicasina3 CRON[29628]: pam_unix(cron:session): session opened for user root by (uid=0)
nov 18 04:25:02 rpicasina3 CRON[29629]: pam_unix(cron:session): session opened for user pi by (uid=0)
nov 18 04:25:02 rpicasina3 CRON[29631]: (pi) CMD (/home/pi/bin/duia_myip)
nov 18 04:25:02 rpicasina3 CRON[29630]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
nov 18 04:25:02 rpicasina3 CRON[29628]: pam_unix(cron:session): session closed for user root
nov 18 04:25:04 rpicasina3 CRON[29629]: (CRON) info (No MTA installed, discarding output)

Please note that:

error=“context deadline exceeded: unknown”

seems to be a coincidence, since it happened only tonight. It never happened again in one of the days above.

Processor and memory

Mosquitto
please not that noone else is disconnecting at the same time

Logs

1668735014: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668736815: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668738616: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668740417: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668741482: Client 1YwNHsbb9Q4dZ7x0ODLOKr has exceeded timeout, disconnecting.
1668741482: Socket error on client 1YwNHsbb9Q4dZ7x0ODLOKr, disconnecting.
1668741487: New connection from 127.0.0.1 on port 1883.
1668741487: New client connected from 127.0.0.1 as 1YwNHsbb9Q4dZ7x0ODLOKr (c1, k60, u’rositaha’).
1668742218: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668744019: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668745820: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668747621: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1668749422: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.

Postgres
no logs recorded on the 18/11

Conclusion: I don’t know where else to search. The similar time seems to point to a scheduled job, but I cannot find it.
Can you help me?

you got yourself an interesting, but hard to troubleshoot problem.
No one can help you :smiley:

After 4 years I could see that a USB controller or the SD thinking of retiring.

regarding 4:20 and scheduled jobs
could be related to /home/pi/bin/duia_myip

this should get you started searching

$(for user in $(cut -f1 -d: /etc/passwd); do  crontab -u $user -l && echo $user ; done)

and systemctl list-timers + ls /etc/cron.*/*

or containerd has a hickup.

This is around the time that database maintenance (purging) is performed, 04:16 from memory.

Dear @tom_l ,

can I get that info in the HA logs?

If the culprit is the purging, what I can do?

@tom_l , you are the winner here.

Triggering a purge from HA, I get into the same situation.

Need to investigate further, now :slight_smile:

Thank to you all!

After some initial tests, I found out that the database user need to be owner and superuser to perform the purge task without error.

That seemed to fixed my original problem. It has been working for 6 days, but tonight the problem reapperead!.
No errors are present in the database log, so I can exclude a problem on the database side.

Can you please me with this? Or should I proceed and open a bug in the proper tracker platform?