Raspberry goes unstable after some hours of working

semagarcia · May 21, 2024, 1:12pm

I’m trying to understand what could happen with my Raspberry Pi and the docker container which hosts the Home Assistant, because I’m a bit stuck on this problem.

When the RPi reaches over 15-20 hours of working (with low stress, none of high load), the Home Assistant instance is reachable, but some parts start to fail, like some .js files (making the HA a bit unstable):

As we can see in the screenshot, I can enter into HA, but some request fail, like ones of some custom-component (and the sankey-chart component fail because does not exist due to the lack of its JS file).

Also, some Shelly devices started to fail:

I’ve followed the instructions of this link (gen1 devices), but with no success. I left this comment if it could give more useful info (I don’t know if they are related or even one is a consecuence of the another one).

I’ve tried to connect thru SSH, but the RPi returns the following error:

kex_exchange_identification: read: Connection reset by peer
Connection reset by 192.168.XXX.XXX port 22

My first question is: WTF?? Why the RPi rejects at this moment my SSH connection attempt, if I can enter on HA?

If I reboot the RPi manually, all of these errors disappear and everything seems to work OK again (HA, SSH, etc), till some hours after when the error appears again.

The RPi is connected over ethernet.

Has anyone some hint or some idea why this problem is happening?

Thanks in advance!!

MaxK · May 21, 2024, 5:49pm

Could you please provide some more details on hardware (Pi4? SSD, etc.)?

How is HA installed (HA OS)?

And what version of Home Assistant Core, Supervisor, etc.

ShadowFist · May 21, 2024, 6:05pm

Follow up to @MaxK’s excellent questions. What other docker containers are you running on that pi?

semagarcia · May 21, 2024, 6:11pm

Thanks for your interest @MaxK

The hardware is a RPi4 with a 120GB SSD, connected over Ethernet thu a switch from the router. The RPi OS is a Debian GNU/Linux 11 (bullseye) installed on SSD.

HA is installed as a docker container updated to the last release (2024.5.4). The docker version is v24.0.5.

Replying to @ShadowFist, other containers are these ones (the same as some months ago):

My current HACS components are these ones:

If you need more details, please, don’t hesitate to ask for them.

Thanks in advance!

ShadowFist · May 21, 2024, 6:16pm

I know next to nothing when it comes to docker, but is there a way you can monitor the Pi’s resources (Ram, CPU & temperature mostly) to check if something is causing the Pi to throttle?

semagarcia · May 21, 2024, 6:29pm

The working ranges for those values are, for me, acceptable.

The most strange thing is the reason why RPi rejects SSH connection, but RPi is still reachable because I can log into HA. If I avoid that circumstance, the next question is why some components of HA itself (HACS components mainly) are failing and cannot be loaded, stopping the load of several sections and/or some dashboards.

MaxK · May 21, 2024, 6:30pm

There have been updates to release of 2024.5.4 to stop misbehaving integrations (integrations that are not thread safe). Here is a link to track down integration issues:

ShadowFist · May 21, 2024, 6:35pm

I’m honestly surprised a Pi4 can run influxdb, grafana and code server without running out of steam, but I’m in no position to contradict you.

Follow Max’s advice above. If that doesn’t help, disable 50% of your custom components at a time, until you narrow down the culprit.

semagarcia · May 21, 2024, 6:50pm

As I’ve restarted the RPi, I can now access to the history data (until it will get stuck again ). These are the data from CPU & RAM:

I’ll read the info provided by @MaxK and I’ll investigate about it.

Thanks to both!!

ShadowFist · May 21, 2024, 6:52pm

Again, I know next to nothing about docker, but are you sure those are the stats for the base hardware and not just the HA docker container?

semagarcia · May 21, 2024, 7:12pm

Yes, because I’ve checked against a data provided by host top command (or at least, I guess those values are correct)

Nick4 · May 23, 2024, 9:48pm

Am I seeing this right: you are running all this on a RPi with 4GB?!

nickrout · May 24, 2024, 6:47am

It doesn’t seem pegged out at that point. Make sure you keep the ssh connection open and check the stats when it is going wrong.

Also check iotop, it checks for overloading of IO.

dmesg will check if the os is seeing anything unusual like a failing disk.

semagarcia · May 24, 2024, 7:21pm

I can’t check anything, because when the SSH connection breaks, the current ssh-session closes and if I try to reconnect, the before message mentioned appears. I’m thinking that I could connect a keyboard and a monitor to be able to get info when the SSH connection breaks, because otherwise, I can’t connect to.

I’ll try this weekend to go deeper about this mistery

Thanks in advance

nickrout · May 24, 2024, 9:15pm

Yes a monitor and keyboard would be best.