Help! My HA is restarting once a day

Hello,

my HA is restarting once a day and I don’t know how to debug it. Where do I start?

Thank you and kind regards,
Pete

First thing is to give the users here enough info to be able to help. Read this.

Then look at the troubleshooting section here.

This what @Arh says…

It would help to know your specs and what device you use, otherwise people cant really help you alot.

But generally 3 things that happened to me:

  1. CPU heat dissipation not good enough → CPU gets too hot and device turns off automatically/reboots to prevent damage
  2. RAM Memory Leak → Integrations mess something up
  3. Faulty PSU → interrupts power of your server and makes it power-cycle

With 1 and 2 you can use System Monitor Integration to gather information and get hints/clues.

In my example i created a history graph and according to my graphs everything seems fine

OK, thank you for the quick feedback. I’m sorry for posing the question so badly. I don’t know what level of information I can provide without overwhelming the potential readers.

I use a PCengines APU2 with 8 GB of RAM, I have a second APU2 (same make and model running pfsense for over 9 years now without any instabilities).
Both are installed in the same rackmount enclosure with a single PSU supplying them both, which rules out that angle (again, the other apu2 is rocksolid). I had Debian running before I went the HAOS direction recently and I restored from a backup that was running in a VM before - both Debian and the VM were rocksolid, so I’m uncertain if it’s a config issue.

The only thing that I could think of out of the ordinary is that I initially installed HAOS on an MSATA SSD with 16 GB storage, which ran full when I restored the VM backup (see above). I then used the “move data disk” function towards a 128 GB Samsung SSD. I don’t know if this could be the issue?

I have system monitor running, but I’m not so sure what to look out for. I suspect the graphs go up during a restart, but I can’t say for certain if they are the cause or the effect. I don’t see anything that is peaking out as unsually high (memory or CPU or thermals).

For the RAM memory leak - how would that materialize? What would I need to watch out for?

HA Core Logs only show some startup issues, as far as I can tell.
HA was restarted this morning at 9:28, and the Supervisor logs show this (just copied and pasted a couple of lines):

026-04-28 09:17:58.448 WARNING (SyncWorker_3) [supervisor.addons.validate] App config 'arch' uses deprecated values ['armhf', 'armv7', 'i386']. Please report this to the maintainer of ha-sip
2026-04-28 09:17:58.822 INFO (MainThread) [supervisor.store] Loading apps from store: 100 all - 0 new - 0 remove
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/udev.sh
[07:25:47] INFO: Using udev information from host
cont-init: info: /etc/cont-init.d/udev.sh exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun supervisor (no readiness notification)
services-up: info: copying legacy longrun watchdog (no readiness notification)
s6-rc: info: service legacy-services successfully started
[07:25:48] INFO: Starting local supervisor watchdog...
2026-04-28 07:25:54.055 INFO (MainThread) [__main__] Initializing Supervisor setup
2026-04-28 07:25:54.418 INFO (MainThread) [supervisor.coresys] Setting up coresys for machine: generic-x86-64
2026-04-28 09:25:54.436 INFO (MainThread) [supervisor.docker.supervisor] Attaching to Supervisor ghcr.io/home-assistant/amd64-hassio-supervisor with version 2026.04.0
2026-04-28 09:25:54.865 INFO (MainThread) [supervisor.resolution.evaluate] Starting system evaluation with state initialize
2026-04-28 09:25:54.875 INFO (MainThread) [supervisor.resolution.evaluate] System evaluation complete
2026-04-28 09:25:54.877 INFO (MainThread) [__main__] Setting up Supervisor

Host logs also seem ‘fine’:

2026-04-28 08:25:56.720 homeassistant systemd[1]: Starting Network Manager Script Dispatcher Service...
2026-04-28 08:25:56.798 homeassistant systemd[1]: Started Network Manager Script Dispatcher Service.
2026-04-28 08:26:06.832 homeassistant systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
2026-04-28 09:25:56.677 homeassistant NetworkManager[406]: <info>  [1777368356.6772] dhcp4 (enp1s0): state changed new lease, address=192.168.4.30
2026-04-28 09:25:56.719 homeassistant systemd[1]: Starting Network Manager Script Dispatcher Service...
2026-04-28 09:25:56.811 homeassistant systemd[1]: Started Network Manager Script Dispatcher Service.
2026-04-28 09:26:06.847 homeassistant systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
2026-04-28 09:36:16.292 homeassistant systemd[1]: Starting Hostname Service...

regarding memory leak: RAM usage grows constantly until it reaches 100% iirc. so setting up the system monitor will help to get an idea if its the issue or not.

Issues can be many: “faulty”/crappy custom integrations, faulty addons, bad integrations. It really is hard to determine and a PITA to debug. BUT it is doable.

A good starting point is to think about what integration you have added over time (in case it has ever worked stable) so you can just test and deactivate some integrations that might invoke that memory leak.

1 Like

Does this look out of the ordinary to you?

The downward peaks (troughs) are around the time the system restarts.
Before I migrated to hardware (inside the vm) everything ran fine. I don’t think I added integrations after that.
Shouldn’t any log capture a restart or something?

What strikes me as curious is that the system montior “time since last system start” (loosely translated) is so constant. Exactly 18 hours, 12 minutes and 10-20 seconds:

Thats very suspicious that its restarting/resetting in a pretty similiar time…how does cpu and hdd utilization look like?

I dont think there will be any logs about a restart If your system resets…have you looked into the other logs?

I’ve looked into all of them, but I’m not sure what to look out for. Host and Supervisor logs around the reboot time (9:28) are above.
HA Core logs begin at 9:26, right around the reboot time.
Is there a way to access the old logs before that time?

Kind regards, Pete

Which metrics for cpu and hdd should I post here exactly… there are a couple?

I mean any indicator but for me it is sensor.processor_use

there are several but they are mostly the same but they might give you some insights…
How much CPU load do you have? Can you make sure that CPU heat is not an issue?

also you can browse/see logs if you dig deeper…

with this maybe you can also find out if your system restarts or so…because for me I still dont know if just the hypervisor (your HA OS) or if you whole Bare metal goes to valhalla…

What Integrations are you using and is there a way to “start over” on a fresh install and look if:

  1. the issue persists
  2. The issue happens if you manually re-add your integrations manually by hand (without the restore - yes it is going to be a pain in the ass)
    3.Your device has some issues maybe?

But since it goes to hell after almost the same time I think there must be something that give the device heavy load that then creates issues - maybe a backup/copy job?