Fresh Install Pi5 - HA Core Stability + Pinned CPU's

Spritzup · May 28, 2024, 4:48pm

Core = 2024.5.5
Supervisor = 2024.05.01
OS = 12.3
Hardware = Raspberry Pi 5 8GB, Waveshare PoE HAT
USB = None

I’ve been running HA for a number of years now, alternating between running it on a container and running it in a VM. As I am moving to a new home shortly, I decided to start fresh on an 8GB Pi5, utilizing the waveshare PoE HAT. The original plan was to use a 64GB Sandisk Endurance card that I had, with having the DB on a separate system. However, even with a completely fresh install (no addons besides default Matter, no integrations, besides the default), HA Core would restart every ‘x’ amount of time (where x was completely random.

Voltage from the switch was showing ~5.1 watts at idle, with some spikes to 7.9 watts. Temperatures were consistently around the mid-40’s. And of course, no consistent error in the logs to indicate where the problem may lie. However, I did my due diligence and removed the HAT and use the proper PSU, same behaviour. I removed the board from the case (for better air flow), same behaviour. I replaced the SD card with a new Samsung, same behaviour. I also enabled “debug” in the configuration.yaml file.

Frankly, I was prepared to throw in the towel, when I decided to try once again. Fresh install, no usb devices, running off SD Card using the PoE Hat and with debug in configuration.yaml. The behaviour returned, but I saw two repeated errors in the host log stating:

homeassistant kernel: get_swap_device: Bad swap offset entry 00100000
homeassistant systemd-journald[11040]: Missed X kernel messages

Suspecting it may be an issue with swap (despite it not being used), and confirming via Google, I disabled the swap. I will note here that this wasn’t meant to be a “fix” as I suspected that HA requires a swap to exist, even if not used. This was rather a “nothing else has helped, lets see what happens”. So what happened was suprising. The system has now been up for the past 22 hours, and seems to be working just fine. The only thing that doesn’t appear to be working is that the only two logs being generated is core and host. As well as the above kernel error still remains repeated ad nauseum in the host log.

I installed system monitor as well as Advanced Terminal to have better eyes into what’s happening. HTOP is showing one core completely pinned, and a second core at about 50% usage, but I can’t see any indication of what / who the culprit is. If I were to hazard a guess, it’s probably supervisor trying to restart core (despite core functioning just fine)… but I have no idea.

Anyway, I know this is a long post, but has anybody seen this before. Any other steps I can do?

Thanks in advance!

EDIT - Rereading this, I want to clarify that the system was always “up”. The stability issues were that Supervisor was restarting core every “x” amount of time. The system itself never unexpectedly rebooted.

agners · May 29, 2024, 7:08am

I haven’t seen this before, but what I could have happened is that the SWAP file did not got created correctly. We do initialize the swap file on first startup, and write all zeros into it to make sure that the file system actually allocates all the space (if you are interested in the details, see mkswap man page which recommends dd on ext4). Anyhow, what could have happened is that in one of the booting attempt that dd did not complete.

What you can try is simply delete the swap file by running the following command on the OS shell:

# rm /mnt/data/swapfile

The next boot will take a bit longer as the swapfile will get generated again. See if that helps to get rid of the “Bad swap offset entry” messages.

Spritzup · May 29, 2024, 11:45am

Thank you for the reply @agners , I appreciate the help. To give an update, I got the idea that perhaps the official raspberry pi imager was not creating the image correctly (along the same thinking as you with swap not being created correctly). I downloaded the last HA OS, and used balena etcher to write the card.

Essentially the same behaviour, though I did notice on the creation of the HA instance, when it says “Please Wait” that it gave an error of “unexpected EOF” before hanging for 10-ish minutes, and then it finally came up.

Once it came up, the previous behaviour returned, that is of Core being restarted by Supervisor every “x” minutes. Thinking that perhaps it’s an error with Watchdog being to sensitive, I then disabled Watchdog for Core and was able to get ~9 hours of “stability” before core crashed and I needed to restart the system.

Next step was during my googling, I saw people with a similar issue a few years ago that was related to having “Trusted Proxies” setup in the cofiguration.yaml file. As this was something I had in my VM install of HA (which has no issues), I decided to give it a whirl. Surprisingly (at least initially), this has changed the behaviour slightly. I now see in the supervisor logs the following:

2024-05-29 07:28:19.003 ERROR (MainThread) [supervisor.homeassistant.api] Timeout on call http://172.30.32.1:8123/api/services.
2024-05-29 07:28:19.003 ERROR (MainThread) [supervisor.api.proxy] Error on API for request services

But no restart attempt. Then, the system eventually hung and rebooted on its own this time. Obviously not ideal, but at least it seems that some changes I’m making is changing the behaviour of the system. I will also try your suggestion regarding swap and see if that does anything (either good or bad). That said, my gut feeling (for what it’s worth) is that this is something related to the Pi5 specifically. Either a minor revision in hardware (due to supply chain), or something else. But due to the board being relative new, and relatively new to official HA support, I’m one of the lucky ones to experience it.

Finally (for now), for shits and giggles, I may try a backup from my working VM onto the Pi. Just to see if that (as weird as it would be) brings stability. I’ll keep the community posted as I move through this fun experience

EDIT - @agners, apologies, but how do I drop down to the OS Shell? Is the only way via Debugging the Home Assistant Operating System | Home Assistant Developer Docs ?

~Spritz

Spritzup · May 29, 2024, 7:00pm

Apologies, replying to myself. I was able to pull logs that cover ~the last 24 hours for both the home assistant docker and the hassio container. I’ve not yet had a chance to really look, as I need to go pick up the kids, but I’m posting now in case someone has a chance before I do.

Hassio Log
Home Assistant Container Log

~Spritz

Spritzup · June 2, 2024, 12:36pm

Again, bumping myself. Murphy’s law, I decided to throw in the towel on this as the troubleshooting has been one step forward and two steps back. The last thing did was enable debug mode again in config, and set “asyncio debug” to true… after a few hours of the normal abnormal behaviour (eg - core random restarts), the system has now been up for two days straight… because of course it has.

That said, nothing of interest is showing in the supervisor logs, but Core is showing:

Executing <Handle BaseSelectorEventLoop._sock_connect_cb(<Future finis...events.py:448>, <socket.socke...30.32.2', 80)>, ('172.30.32.2', 80)) created at /usr/local/lib/python3.12/asyncio/selector_events.py:310> took 0.237 seconds
Executing <TimerHandle when=42342.052261476 _run_async_call_action(<HomeAssistant RUNNING>, <Job None Has...0x7f7819ea20>>) at /usr/src/homeassistant/homeassistant/helpers/event.py:1500 created at /usr/src/homeassistant/homeassistant/helpers/event.py:1547> took 0.262 seconds
Executing <Task pending name='dhcp discovery' coro=<NetworkWatcher.async_discover() running at /usr/src/homeassistant/homeassistant/components/dhcp/__init__.py:334> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.12/asyncio/base_events.py:448> cb=[set.remove()] created at /usr/src/homeassistant/homeassistant/util/async_.py:40> took 0.274 seconds
Executing <TimerHandle when=93645.70074549 _run_async_call_action(<HomeAssistant RUNNING>, <Job None Has...0x7f7819ea20>>) at /usr/src/homeassistant/homeassistant/helpers/event.py:1500 created at /usr/src/homeassistant/homeassistant/helpers/event.py:1547> took 0.289 seconds
Executing <Handle _SelectorSocketTransport._read_ready() created at /usr/local/lib/python3.12/asyncio/selector_events.py:276> took 0.262 seconds

While the host logs are also indicating what appear to be some interesting errors, though I have no idea:

2024-06-02 11:09:43.615 homeassistant systemd[1]: containerd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
2024-06-02 11:09:43.616 homeassistant dockerd[617]: time="2024-06-02T11:09:43.613902460Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = error reading from server: EOF" module=libcontainerd namespace=moby
2024-06-02 11:09:43.616 homeassistant dockerd[617]: time="2024-06-02T11:09:43.613972663Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=moby
2024-06-02 11:09:43.616 homeassistant dockerd[617]: time="2024-06-02T11:09:43.613848294Z" level=error msg="Failed to get event" error="rpc error: code = Unavailable desc = error reading from server: EOF" module=libcontainerd namespace=plugins.moby
2024-06-02 11:09:43.616 homeassistant dockerd[617]: time="2024-06-02T11:09:43.614463826Z" level=info msg="Waiting for containerd to be ready to restart event processing" module=libcontainerd namespace=plugins.moby
2024-06-02 11:09:43.617 homeassistant systemd[1]: containerd.service: Failed with result 'exit-code'.
2024-06-02 11:09:43.617 homeassistant systemd[1]: containerd.service: Unit process 902 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.618 homeassistant systemd[1]: containerd.service: Unit process 1036 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.619 homeassistant systemd[1]: containerd.service: Unit process 1274 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.620 homeassistant systemd[1]: containerd.service: Unit process 1439 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.621 homeassistant systemd[1]: containerd.service: Unit process 1596 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.622 homeassistant systemd[1]: containerd.service: Unit process 1709 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.622 homeassistant systemd[1]: containerd.service: Unit process 2180 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.623 homeassistant systemd[1]: containerd.service: Unit process 2269 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.624 homeassistant systemd[1]: containerd.service: Unit process 2720 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.624 homeassistant systemd[1]: containerd.service: Unit process 2996 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.625 homeassistant systemd[1]: containerd.service: Unit process 3140 (containerd-shim) remains running after unit stopped.
2024-06-02 11:09:43.626 homeassistant systemd[1]: containerd.service: Consumed 26.936s CPU time.
2024-06-02 11:09:48.859 homeassistant systemd[1]: containerd.service: Scheduled restart job, restart counter is at 3.
2024-06-02 11:09:48.859 homeassistant systemd[1]: containerd.service: Found left-over process 902 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-02 11:09:48.859 homeassistant systemd[1]: containerd.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-02 11:09:48.859 homeassistant systemd[1]: containerd.service: Found left-over process 1036 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-02 11:09:48.859 homeassistant systemd[1]: containerd.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-02 11:09:48.859 homeassistant systemd[1]: containerd.service: Found left-over process 1274 (containerd-shim) in control group while starting unit. Ignoring.

Again, this is essentially a blanks system. Fresh install onto a new SD Card. The only addons installed are “File Editor” and “Advanced SSH & Terminal”. Any thoughts, guidance, prayers or laughter would be appreciated.

Thanks!

~Spritz