MQTT integration breaks most of HAOS after 12 hours. Power cycle fixes it

I recently added the MQTT integration and set up a few clients that report a single JSON value every 10 seconds. Everything works for about 12 hours, but then things start breaking:

• Any cards showing historical data stop working.
• SSH, File Editor, Terminal, System Logs, and more are inaccessible.
• The UI won’t let me reboot the device.
Automations still work—including an MQTT-triggered lock.

I’m running HAOS on a 2TB NVMe (via Geekworm NVMe/PoE hat) with an 8GB Raspberry Pi 5.

Has anyone experienced something similar, or does anyone have debugging tips?
Config file:

mqtt:
  - sensor:
      name: "Pihole CPU Temp"
      unique_id: "pihole_cpu_temp"
      state_topic: "linux/pi/pihole/sensors"
      value_template: "{{ value_json.cpu_temp }}"
      unit_of_measurement: "°C"
      device_class: temperature
      state_class: measurement
  - sensor:
      name: "Mirror CPU Temp"
      unique_id: "mirror_cpu_temp"
      state_topic: "linux/pi/mirror/sensors"
      value_template: "{{ value_json.cpu_temp }}"
      unit_of_measurement: "°C"
      device_class: temperature
      state_class: measurement
  - sensor:
      name: "Pihole 2 CPU Temp"
      unique_id: "pihole2_cpu_temp"
      state_topic: "linux/pi/pihole2/sensors"
      value_template: "{{ value_json.cpu_temp }}"
      unit_of_measurement: "°C"
      device_class: temperature
      state_class: measurement

I think it sounds strange that it should be the MQTT integration.
What does the logs say?
Check both homeassistant.log and homeassistant.log.1

the .1 file is the log file from before the reboot/crash, which might tell you something about what happened up to the event.
The other file is the log file for the current session and especially the start cvan tell you something about what HA sees as general issues with your setup.

Also check your PoE power supply to see what standard it is offering.

Thanks for the tips. It’s still a little unclear to me what happened, but it may indeed be an issue with power delivery or more likely the NVMe drive. It’s connected to a PoE++ port on a USW Pro HD 24 switch.
Last lines in homeassistant.log.1

2025-02-27 21:39:04.292 INFO (MainThread) [homeassistant.components.automation.mirror_off_with_entry_1] Mirror off with Entry 1: Running automation actions
2025-02-27 21:39:04.292 INFO (MainThread) [homeassistant.components.automation.mirror_off_with_entry_1] Mirror off with Entry 1: Executing step call service
2025-02-28 01:28:53.919 INFO (MainThread) [custom_components.hacs] Loading known repositories

Next, in homeassistant.log

e[32m2025-02-28 01:33:20.863 INFO (MainThread) [supervisor.auth] Auth request from 'core_mosquitto' for 'mqttuser'e[0m
e[32m2025-02-28 01:33:21.159 INFO (MainThread) [supervisor.auth] Successful login for 'mqttuser'e[0m
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/udev.sh
[17:04:57] INFO: e[32mUsing udev information from hoste[0m
cont-init: info: /etc/cont-init.d/udev.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun supervisor (no readiness notification)
services-up: info: copying legacy longrun watchdog (no readiness notification)
[17:04:57] INFO: e[32mStarting local supervisor watchdog...e[0m
s6-rc: info: service legacy-services successfully started
e[32m2025-02-28 17:04:58.166 INFO (MainThread) [__main__] Initializing Supervisor setupe[0m

I ran supervisor repair and it repaired several containers and add-ons:

Repairing for add-on: core_samba
Repairing for add-on: a0d7b954_ssh
Repairing for add-on: core_mosquitto
Repairing for add-on: cb646a50_get
Repairing for add-on: core_configurator
Fix stale container on hassio network
Fix stale container on host network

If it happens again I’ll try ditching the NVMe drive and just let it run on the microsd. Still weird how it only started happening after adding MQTT.

With the modern storage options, like NVME, SSD, M.2, Flash and so on, it the write process that is the culprit of hard damage.
You will often see storage medias, where you can read all the data, but you can not write/change anything on the media.
So it is not strange that the installation of MQTT could have made it happen. It could have been any write operation, event a DB or log file write.

Hello cdrtaggart,

This sounds weird.
There is no core MQTT integration. Is it a custom one? Link?

So the add-on then?