How to detect/recover from HA node that has lost LAN connectivity?

Guff666 · January 2, 2025, 1:01pm

I have a HA node at a remote site running on a Pi4 and connected to a wired Ethernet LAN. Occasionally, the node becomes inaccessible. The LAN is OK because I can connect to the remote router, but the router cannot ping the HA node. My only solution right now is to cycle the power using a Tapo switch on the PI’s PSU. Not ideal.

This drastic solution makes it difficult to debug what’s going on because the systemd-journal gets corrupted - I can see this using ssh to the HAOS backend. It is quite possible that the node is still running - some sensor data is still being logged - but it has lost LAN connectivity.

I’ve considered adding an automation to detect loss of network connectivity, but simply restarting HA is unlikely to be sufficient. I’d need to reboot the Pi.

Any other solutions before I go down the route of an external espHome device that pings the node and resets it if it loses connectivity?

NathanCu · January 2, 2025, 1:19pm

And

Are essentially the same thing? The difference is do you get feedback from your local LAN before you do it. So an esp or a commercial solution that does exactly that will probably be where you head.

The real answer however is to troubleshoot the Pi and find out why it’s crashing (the most likely reason it’s losing connectivity - and sure resetting the PSU would recover that, log.1 should still be on the box? If it’s an older pi with lo memory running a current build I’d also venture a guess it’s probably memory exhausted.)

WallyR · January 3, 2025, 6:54am

Extract the homeassistent.log.1 file from the config directory.
It is the log from the previous run.

My guess is that your PSU is just at the limit, so your HA at times will lockup.

Guff666 · January 4, 2025, 9:27am

Thanks for the feedback, both of you.

HA itself is not crashing. Sensor readings are still being logged, and a closer inspection shows that the HA log is being updated right to the point where I powered off the Pi4.
The PSU is not the source of the problem. A: (as above) HA is not crashing, B: other devices fed from the same PSU continue to work.
According to the stats collected by the glances addon, the Pi4 is not working hard or get close to any resource limits: including CPU temp. In any case, the Argon40 case has amazing passive and active cooling.

The frustration is not being able to redirect the systemd-journal to another machine that might tell me more. On Linux, I’d simply add a remote destination for all log traffic. I can’t see how to do that on HAOS.

For now, I’ve added a, input_binary that is controlled by the results of a ping to the router. If the problem recurs, I can look at the history of that input_binary to see if LAN connectivity was lost, and when.

WallyR · January 5, 2025, 4:56am

The PSU of the Raspberry is rated just at the limit, so you will often get partial crashes where some parts of the hardware crash and others do not.

Guff666 · January 6, 2025, 11:36am

That’s an interesting point, @WallyR . Could you expand on it? Do you mean the unit providing the 5V, or the internal circuitry on the Pi?
In this particular case, the unit supplying power is capable of supplying 10A.

NathanCu · January 6, 2025, 12:13pm

The Pi is hard limited to USB amperage (3a) on the USB side of the voltage regulator. Check the forum and internet it’s quite a well documented condition.

Unless you modified your Pi to inject voltage there then it easily (like one ssd+one other bus powered device is enough) exhausts the power available on the USB bus. Doesn’t matter if your supply can provide a constant N amps.

Use a powered usb hub to avoid low power conditions on the usb bus.

Therefore the two questions to ask first when it’s a pi

… Are you using a powered usb hub. And are you using an SD card (the other most common failure mode) and then a close third… Out of ram.

WallyR · January 6, 2025, 1:11pm

Some also say the input side is limited, so when the USB limit is over stepped it does not have to be the USB device that fails, but it can also be any internal device, like the WiFi, Bluetooth, SDcard reader or whatever device it has.
If the ovestep is not too big, then the CPU might continue operating, but in a condition with uncertain outcome.

Guff666 · January 6, 2025, 4:37pm

Thanks.
Yes, the Pi is connected to a powered hub, and
No, it’s not using an SD card. It uses an M.2 SSD.

I’m not seeing anything to suggest there is a RAM shortage, or indeed any other shortage. Glances reports that memory is running at about 46% whenever I’ve looked at it.

Since posting about this, the crash frequency has increased and I’m beginning to think it’s an SSD problem.
I’m waiting for a replacement SSD to arrive, at which point I’ll rebuild the node and restore from the last successful nightly backup. That will entail a site visit though, which will probably not be before Thursday at the earliest.

Guff666 · January 6, 2025, 4:38pm

Makes you wonder if it’s such a good base for a standalone device then. Maybe I should be looking at different hardware.

WallyR · January 6, 2025, 4:45pm

I switched myself from a RPi4 with 4Gb ram to a Intel NUC 13 I3 CPU with 16 Gb ram and a couple of 1Tb SSDs.

The hardware has no cooperation limitations in such a setup, so SSDs just work, USB ports are powerful to handle everything and there are lots of them and the network card is better.

The ram, CPU and SSD sizes are for running VMs in Proxmox on it with HAOS as one of them.

I have could not have been more satisfied with the setup. No hardware issues and it is fast and stable.

Guff666 · January 6, 2025, 5:57pm

I run something similar for my home environment - also running on proxmox.
It just seems overkill for a remote node running just HA and a small collection of addons. I’d prefer to KISS with HAOS.
However, system reliability and availability is key; so I may have to go down a similar route.

WallyR · January 6, 2025, 6:06pm

I used a laptop a short while between the RPi and the NUC.
It was a fanless pentium N3000 CPU and 8Gb ram. The monitor was broken, so it was extremely cheap.
You need to check that ypu can get into BIOS a ønd change settings on an external monitor, which is sometimes not the case on laptops.
The benefit of the laptop was that the battery acted as an UPS and the battery charge, charging state and so on was just a few command line commands to include them in HA for shutdown automations.
Check also that the laptop can be set to boot on power connect, so a power failure will make it reboot at power restoration.

The NUC was only needed due to the my requirenent for more than on large SSD for a music server.