Random uncommanded HAOS restarts - how to debug?

I’m recently seeing sporadic uncommanded reboots of my HAOS, mostly at night and I’m wondering where to start to debug those. I’m an IT guy with 25 years of Linux experience but I’m not familiar with the architecture of HAOS.

System is an RPI4, 8GB model. I install pretty much every update immediately by hand. No auto updates are enabled. The SSL cert is updated from my main server and delivered to HAOS but that process runs daily at 00:05. I’m seeing those random reboots usually between 02:30 and 05:00, so they aren’t the cause. All stats, like memory, temperature, etc. are centrally monitored and all look good, so no overheating, no memory issues, etc.
Power is supplied through a separate PSU at 5.1V which is UPS backed. Those stats are also monitored and look good.

Stuff that acts up seem to be the NodeRED integration crashing every now and then or the whole OS going down and rebooting.

On the console, which logs should I look at and where do I find them?

Hi @sgofferj

When I used to run HA on a RPi4, random restarts were usually due to the Pi trying to draw more power than my power supply could supply. This would often happen when it was doing something intensive in the night - like a backup.

This may not be your issue but could be worth a look.

2 Likes

HA recorder database purge runs nightly at 04:12, so I would start there. Could be a corrupted or locked db.

Also, if you have automated backups set up, those usually run overnight, so check for disk space or connection (if the destination is remote) issues.

1 Like

It’s a 10A PSU, so I doubt that. Monitoring also doesn’t show anything there.

Mh, interesting. I’ll have a look at the recorder DB. Disk space locally is ~440GB free out of 500GB. USB connected SSD.

I wouldn’t exclude power issues just yet. You could plug in a 100A PSU into your pi - the USB power output is still limited & probably not sufficient for your SSD during heavy load.

2 Likes

AH! Ok, I’ll get that power thingy custom integration and monitor.

1 Like

Apparently I installed the power supply checker integration already :smiley: Looks all good, no bad values around the reboot time.

I’d be checking the specifications on power draw of the SSD and cabinet and the supply current capacity of the USB port it is plugged into. For instance, USB 2.0 is rated for 500mA current draw. Under intense activity, your SSD might draw more - check the specs and allow a margin for error as it ages. Maybe a Y cable from your 10A 5.1v supply could be spliced into the drive to assure it has enough grunt. I would also do a format of the SSD to ensure there is no data corruption from low write voltage levels and possible levelling activity during suspect conditions. Consider all data on there potentially corrupt if the power supply has been marginal.

Have you checked for compatibility issues where certain SSD controllers and USB 3.0 ports do not work well together? Often solved by moving them to USB 2.0 ports, however the current supply then becomes an issue to check. Running your SSD at 5.1volts shouldn’t be a drama. Again check the spec sheets for tolerances.

Is your power supply checker software sensing voltage, current draw or both? Just voltage only tells part of the story, especially in your case where an abundance of current can mask brief voltage dips that slip through in between sensor polling slots.

Power supply rated at 10 amps! How well is it regulated? What is the ripple? Switched or buck/boost analog regulation? Is the output adequately filtered by big fat capacitors that aren’t bulging? Do you have long cable runs to the equipment that may attract inducted interference? Try a 0.1uF capacitor at 50 volts and unpolarised at both ends to slurp up any stray spikes and buzz.

Interesting that the majority of writers who reply suspect the PSU…

To my understanding, the HA integration only listens to the RPI4’s low power alarm. My own monitoring measures the voltage at the RPI input via Y-cable.

I checked and the PSU is only rated for 6.5A but the HA RPI4 is the only “customer”. It’s a Mean Well HDR 60-5. Ripple is specified with 80mVp-p. I don’t have the equipment to verify that, though.

The SSD hangs at one USB3 port of the RPI4. It’s an ADATA SP900 128GB. According to specs, it pulls 0.5W idle and 0.9W active, i.e. 100 resp. 180mA. That should be well within the RPI4’s capabilities.

I’ll still set up some current measuring between the SSD and the RPI4.

Edit:
Compatibility issues shouldn’t be a thing. The system is running like that for a couple of years without problems. Before you ask - SMART monitoring is active and shows no problems.

Edit2:
After some checking of the Zabbix (monitoring software) event log, it looks like docker went down on 2025-12-22 at 03:09:43. Then follow a few alarms about containers being down (which makes sense if docker has died) and then I see a “System has been restarted (uptime <10m)” at 03:19:01.
That looks like a software issue but I still haven’t found out where the logs are hidden on HAOS, other than the GUI reader.

Good. If you cannot trust your hardware and power platforms then you are sunk. You seemed to have considered all those power supply issues.

Do you have a memory leak in one of your docker instances integrations?

Tracking down intermittent problems is hard. Any docker supervisors sitting above HomeAssistant that may be able to narrow it down to which instance/container is rogue and dying?

1 Like

Best guide for that

PS: seeing many people with repeated huge logfiles , i assume that the Devs have implemented some kind/level of “rotation”, even in their journald, to fit the basic hardware for Green and Yellow Devices

1 Like

Well, it’s a plain HAOS installation. I have installed the Zabbix Agent 2 add-on which reports system health data to my Zabbix server and they look all good. Overall mem usage never goes above 20-22%, CPU usage outside of updates or ESPHome compiling new firmwares around 10%. And nothing indicates any specific container misbehaving. At the moment it just looks like the docker daemon just dies for whatever reason…

I think, I’ll also setup a port mirror on my switch and wireshark every outgoing and incoming packets. Maybe there is some Zero Day DoS crap which can trigger something via the web ports…

THERE IS A PROMTAIL ADDON?!

That link is really helpful, thanks! I do have a central logserver running Loki, so I’ll set that up real quick and that should help me diagnose a lot!

BTW: Why would HAOS log in journal format if journalctl isn’t available for the platform…? :man_facepalming:

Well, unfortunately, The addon is not maintained and last update was 2022… sigh

Found logspout. That should solve my problems with log access. Now I get all HAOS logs into Loki and can grep the living daylights out of the logs next time something dies :slight_smile:

What makes you say that? I was using journalctl yesterday on HAOS.

The reason we keep looking at the power.

The +5 rail onthe pi is hard limited at the rail to USB amperage. Doesnt matter how much current you have available in the power rail itself is current limited for some bone head reason so… You can absolutely starve the box when you get borderline ocercurrent situation. It basically browns out for a sec and boom. Big 100a bench supply. Don’t matter no current. For you.

Get those aux loads off the psu by putting them on thier OWN powered USB hub and take I tout of the power Calc. Best bet is a usb2 for your dongles ( if you have any) to filter USB3 interference on the bus. You can use a three for that hdd. I had to run my pi like that for 6months or it would randomly die for no good reason. I’m still not convinced your problem isn’t power.

Is it dumb. Yes. Should you have to. No. But it was the only way I could run stable. Pull the disk off the USB power rail. (or my favorite choice and what I did when it got too much… Get a new NUC for Christmas.)

The post linked by @boheme61 a few posts up…