Regular system failures

Hi all,

I am using HA for years. Sometimes I had some troubles when updates make my system hang, but normal HA works stable for at least some Months. But since some weeks, system hangs come up often.

I am using the latest Versions

  • Core 2024.9.2
  • Supervisor 2024.09.1
  • Operating System 13.0
  • Frontend 20240909.1

with a rpi3-64, Disk is a SSD. No Expansion Cards.

Addons

  • zigbee2mqqt
  • Mosquitto broker
  • File Editor

And HACS

  • Raspberry Pi 1-Wire via sysbus
  • Local Tuya
  • edl-21

I installed the developer SSH access. When the system hang up, there is no more response. No Ping, no SSH Access on Port 22222.

The journalctl give me no hints, so I ask if someone could help. Did you see something in my logs? What are my next debug Steps?

Here are some hangs: Bildschirmfoto-vom-2024-09-22-17-53-20 hosted at ImgBB — ImgBB

journalctl --list-boots

IDX BOOT ID                          FIRST ENTRY                 LAST ENTRY                 
 -6 002e703fee05467f952ba3b18ecb05bd Sat 2024-08-31 17:57:30 UTC Sat 2024-08-31 22:14:27 UTC
 -5 27160367a1bc4938abe2fe6fd5c746e6 Sat 2024-08-31 23:13:04 UTC Mon 2024-09-02 07:24:38 UTC
 -4 55da6fa0c0d94ab88ffc9ec0e4bf9a2f Tue 2024-09-03 20:37:25 UTC Sat 2024-09-07 00:54:21 UTC
 -3 c31329f3f4e3414d84a6db2f1278dc7e Tue 2024-09-10 12:20:21 UTC Thu 2024-09-12 23:51:48 UTC
 -2 8070471fd0ea45d89114d641a6593464 Fri 2024-09-13 20:41:17 UTC Fri 2024-09-13 21:05:40 UTC 
 -1 66f927354939411eacf8f5f159655576 Sat 2024-09-21 16:27:46 UTC Sun 2024-09-22 06:50:01 UTC (power was switched off)
  0 876291d0d4a246b28b5d260c80113ced Sun 2024-09-22 06:55:16 UTC Sun 2024-09-22 15:46:51 UTC

2024-09-17 16:35 (14:35 UTC)

journalctl --since "2024-09-17 16:00:00"

Sep 21 16:27:46 homeassistant systemd-time-wait-sync[502]: adjtime state 5 status 40 time Sat 2024-09-21 16:27:46.985550 UTC
Sep 21 16:27:46 homeassistant systemd-timesyncd[511]: System clock time unset or jumped backwards, restored from recorded time
Sep 22 06:48:32 homeassistant systemd-time-wait-sync[502]: adjtime state 0 status 2000 time Sun 2024-09-22 06:48:32.608025 UTC
Sep 21 16:27:46 homeassistant systemd-resolved[394]: Clock change detected. Flushing caches.
Sep 21 16:27:46 homeassistant systemd[1]: Started Network Time Synchronization.
Sep 22 06:48:32 homeassistant systemd-timesyncd[511]: Contacted time server 192.168.178.1:123 (192.168.178.1).
Sep 22 06:48:32 homeassistant systemd-timesyncd[511]: Initial clock synchronization to Sun 2024-09-22 06:48:32.607762 UTC.
Sep 22 06:48:32 homeassistant systemd-resolved[394]: Clock change detected. Flushing caches.
Sep 22 06:48:32 homeassistant systemd[1]: Finished Wait Until Kernel Time Synchronized.
Sep 22 06:48:32 homeassistant systemd[1]: Reached target System Time Set.
Sep 22 06:48:32 homeassistant systemd[1]: Reached target System Time Synchronized.
Sep 22 06:48:32 homeassistant systemd[1]: Started Discard unused filesystem blocks once a week.
Sep 22 06:48:32 homeassistant systemd[1]: Remove Bluetooth cache entries was skipped because of an unmet condition check (Cond
Sep 22 06:48:32 homeassistant systemd[1]: Reached target Timer Units.
Sep 22 06:48:32 homeassistant systemd[1]: Starting HassOS AppArmor...
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1300 audit(1726987712.725:14): arch=c00000b7 syscall=64 success=yes exit=393
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant audit[525]: SYSCALL arch=c00000b7 syscall=64 success=yes exit=39339 a0=6 a1=55a4ce2e00 a2=99ab a
Sep 22 06:48:32 homeassistant audit: PROCTITLE proctitle=61707061726D6F725F706172736572002D72002D57002D4C002F6D6E742F646174612
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.736280145Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.736692645Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.737048999Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant systemd[1]: Finished HassOS AppArmor.
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.746264676Z" level=info msg="skip loading plugin \"io.
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.746476656Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.747805093Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.750385458Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.752646603Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.752997333Z" level=info msg="metadata content store po
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.834947333Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.835271291Z" level=info msg="loading plugin \"io.conta


2024-09-18 13:10 (11:10 UTC) System Hang

journalctl --since "2024-09-18 11:00:00"

Sep 21 16:27:46 homeassistant systemd-time-wait-sync[502]: adjtime state 5 status 40 time Sat 2024-09-21 16:27:46.985550 UTC
Sep 21 16:27:46 homeassistant systemd-timesyncd[511]: System clock time unset or jumped backwards, restored from recorded time
Sep 22 06:48:32 homeassistant systemd-time-wait-sync[502]: adjtime state 0 status 2000 time Sun 2024-09-22 06:48:32.608025 UTC
Sep 21 16:27:46 homeassistant systemd-resolved[394]: Clock change detected. Flushing caches.
Sep 21 16:27:46 homeassistant systemd[1]: Started Network Time Synchronization.
Sep 22 06:48:32 homeassistant systemd-timesyncd[511]: Contacted time server 192.168.178.1:123 (192.168.178.1).
Sep 22 06:48:32 homeassistant systemd-timesyncd[511]: Initial clock synchronization to Sun 2024-09-22 06:48:32.607762 UTC.
Sep 22 06:48:32 homeassistant systemd-resolved[394]: Clock change detected. Flushing caches.
Sep 22 06:48:32 homeassistant systemd[1]: Finished Wait Until Kernel Time Synchronized.
Sep 22 06:48:32 homeassistant systemd[1]: Reached target System Time Set.
Sep 22 06:48:32 homeassistant systemd[1]: Reached target System Time Synchronized.
Sep 22 06:48:32 homeassistant systemd[1]: Started Discard unused filesystem blocks once a week.
Sep 22 06:48:32 homeassistant systemd[1]: Remove Bluetooth cache entries was skipped because of an unmet condition check (Cond
Sep 22 06:48:32 homeassistant systemd[1]: Reached target Timer Units.
Sep 22 06:48:32 homeassistant systemd[1]: Starting HassOS AppArmor...
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1300 audit(1726987712.725:14): arch=c00000b7 syscall=64 success=yes exit=393
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant audit[525]: SYSCALL arch=c00000b7 syscall=64 success=yes exit=39339 a0=6 a1=55a4ce2e00 a2=99ab a
Sep 22 06:48:32 homeassistant audit: PROCTITLE proctitle=61707061726D6F725F706172736572002D72002D57002D4C002F6D6E742F646174612
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.736280145Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.736692645Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.737048999Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant systemd[1]: Finished HassOS AppArmor.
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.746264676Z" level=info msg="skip loading plugin \"io.
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.746476656Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.747805093Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.750385458Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.752646603Z" level=info msg="loading plugin \"io.conta

2024-09-21 18:27 (16:27 UTC) System Hang

journalctl --since "2024-09-21 18:00:00"

Sep 21 16:27:46 homeassistant systemd-timesyncd[511]: System clock time unset or jumped backwards, restored from recorded time
Sep 22 06:48:32 homeassistant systemd-time-wait-sync[502]: adjtime state 0 status 2000 time Sun 2024-09-22 06:48:32.608025 UTC
Sep 21 16:27:46 homeassistant systemd-resolved[394]: Clock change detected. Flushing caches.
Sep 21 16:27:46 homeassistant systemd[1]: Started Network Time Synchronization.
Sep 22 06:48:32 homeassistant systemd-timesyncd[511]: Contacted time server 192.168.178.1:123 (192.168.178.1).
Sep 22 06:48:32 homeassistant systemd-timesyncd[511]: Initial clock synchronization to Sun 2024-09-22 06:48:32.607762 UTC.
Sep 22 06:48:32 homeassistant systemd-resolved[394]: Clock change detected. Flushing caches.
Sep 22 06:48:32 homeassistant systemd[1]: Finished Wait Until Kernel Time Synchronized.
Sep 22 06:48:32 homeassistant systemd[1]: Reached target System Time Set.
Sep 22 06:48:32 homeassistant systemd[1]: Reached target System Time Synchronized.
Sep 22 06:48:32 homeassistant systemd[1]: Started Discard unused filesystem blocks once a week.
Sep 22 06:48:32 homeassistant systemd[1]: Remove Bluetooth cache entries was skipped because of an unmet condition check (Cond
Sep 22 06:48:32 homeassistant systemd[1]: Reached target Timer Units.
Sep 22 06:48:32 homeassistant systemd[1]: Starting HassOS AppArmor...
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1400 audit(1726987712.725:14): apparmor="STATUS" operation="profile_load" pr
Sep 22 06:48:32 homeassistant kernel: audit: type=1300 audit(1726987712.725:14): arch=c00000b7 syscall=64 success=yes exit=393
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant audit[525]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="hassio-sup
Sep 22 06:48:32 homeassistant audit[525]: SYSCALL arch=c00000b7 syscall=64 success=yes exit=39339 a0=6 a1=55a4ce2e00 a2=99ab a
Sep 22 06:48:32 homeassistant audit: PROCTITLE proctitle=61707061726D6F725F706172736572002D72002D57002D4C002F6D6E742F646174612
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.736280145Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.736692645Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant containerd[476]: time="2024-09-22T06:48:32.737048999Z" level=info msg="loading plugin \"io.conta
Sep 22 06:48:32 homeassistant systemd[1]: Finished HassOS AppArmor.

Run diagnostics on your SSD. I know you said you have no SD, but an SSD can still fail.

Also check the temperature of the CPU.

There are several posts with RPi3s in relation to the last updates.
It looks like the 1Gb ram limit on the RPi3 is at the just balancing at a tipping point of what is possible to run HA on.
RPi3 have had a warning in the installation docs for some time and HA is becoming more demanding and so are many integrations and users use of HA.

3 Likes

Okay, I just export my HA with the backup function and set up a new clean install to a new SSD to check if there is a Problem with the disk. If this did not help, I will switch to a RPI4.

Thanks for your responses

Okay. So I make a fresh installation on my RPI3 a new SSD and imported the backup, and reconfigured my /mnt/boot/config.txt. The system was online again, but the system restarts every ~10h. Zigbee was working.

So I followed the second advice, and set up a fresh RPI4 system and imported the backup. The system runs smother, but the problem is, zigbee2mqtt shows all devices, but no switch is working.

The only message i got (put zigbee2mqtt to debug mode) is:

[18:15:06] INFO: Starting Zigbee2MQTT...
Starting Zigbee2MQTT without watchdog.
[2024-10-13 18:16:13] error: 	z2m: Failed to set permit join to false (APS TIMEOUT)
[2024-10-13 18:16:25] error: 	z2m: Publish 'set' 'state' to 'Kellerflur Licht' failed: 'Error: ZCL command 0xec1bbdfffe7bb9d8/1 genOnOff.on({}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"reservedBits":0,"writeUndiv":false}) failed (no response received (4) Error: waiting for response TIMEOUT)'
[2024-10-13 18:16:25] error: 	z2m: Publish 'set' 'state' to 'Waschraum Licht' failed: 'Error: ZCL command 0xec1bbdfffead12b7/1 genOnOff.off({}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"reservedBits":0,"writeUndiv":false}) failed (no response received (3) Error: waiting for response TIMEOUT)'

I revalidate the write permission on the serial device with success:

test -w /dev/ttyACM0 && echo success || echo failure
success

Any advices?

The Z2M problem was solved by for using a USB extension cable. They may be a USB3 interference signal. I will now monitor the System uptime.

So the resolution was to update to a RPI4, maybe the SSD was broken. But also use a USB extension cable, when switching to a Raspberry 4.