My previously super stable HA setup has started regularly restarting due to the main HA process getting a SIGSEGV.
Host OS is Centos 10 ARM64 with all latest updates. 4 CPU cores, 16 GB RAM, SSD storage.
HA is running in Docker. No resource limits applied. Issue started happening earlier today (HA 2024.10.4). I updated my HA to 2025.11.1 but issue persists. All I have done today is edit a few existing automations. I’ve checked and double checked them for any obvious issues but can’t see any. And anyway, HA should not crash if there is some error in an automation. No obvious errors in the container log; first sign of trouble is:
19:00:31] INFO: ^[[32mHome Assistant Core finish process exit code 256^[[0m
[19:00:31] INFO: ^[[32mHome Assistant Core finish process received signal 11^[[0m
s6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun home-assistant (no readiness notification)
s6-rc: info: service legacy-services successfully started
Host is not running low on memory or cpu power. Disk space is absolutely fine too. Really not sure how I can diagnose this in the absence of anything useful in the logs.
The odds of this are not high, but seeing segfaults on a previously stable system, and with software that is as stable as HA, is consistent with faulty RAM. But it would be odd to see the problem only show up in one application.
I agree with dominic.
Signal 11 is usually the kernel shutting down the process due to unreachable resource requests.
It can be due to errors, but it can also be due to out of memory issues.
The 16Gb is just one limit on the memory.
Others can be unable to allocation a large enough block of contiguous blocks or simply unable to store data on the stack or the memory heap.
You should check your automations and make sure there are no loops or at least that there is no buildup of data in the loops.
@d921@WallyR So the docker ‘host’ is in itself a VM. It is running under Parallels Desktop Professional on a Mac mini M4 Pro with 14 cores and 48 GB RAM. The macOS system and the VM are themselves completely stable (they are running lots of other stuff 24/7). Plenty of available memory at the macOS and VM levels. I’m monitoring the Virtual size and RSS of the HA core python3 process and it remains stable (with only the expected relatively small fluctuations) right up to the point when things go south. A SEGV is typically caused by a process trying to access memory outside of its address space (most often through a NULL pointer returned from a previous memory allocation which wasn’t properly checked at the time). It can also be caused if the malloc arena gets corrupted by bad pointer use. AFAIK this should not be possible in a python application unless (a) there is a bug in Python or (b) there is a bug in native code that Python is calling. I could be wrong though (I’m not a Python expert).
Interestingly I have narrowed the issue down to the 4 automations that I modified yesterday. I made similar changes to all of them; I added some trigger ids and added some ‘Choose’ options using ‘triggered by’. The code in those options is the same as code under other ‘Choose’ options. There are no (obvious) loops nor do the automations gather/accumulate data. If I disable those 4 automations then the system remains stable. If I have any of those automations enabled then the restarts occur; sometimes after a few minutes sometimes after an hour or so (and anything in between).
Here’s the YAML for one of them (they are all very similar):
I am uncertain if this definition makes it so that it triggers at each even minute during the hour or if it is just still a 2 minutes interval from when the cronjob first started.
If it is the last, then the hours key could be left out and that hours key is the only thing I found a bit odd to me.
Someone else might see some other interesting thing though.
Nothing to do with the question at hand, however I would consider using docker desktop or something similar to running the containers ‘closer’ to the Mac hardware. There are many layers of virtualization at play here.
Rick’s approach for troubleshooting sounds good btw.
@afsy_d
Well yes and no. I gave this a lot of thought and investigation before building out my infra.
On macOS ARM, Docker Desktop uses lightweight VMs to run containers.
macOS ARM virtualisation is pretty efficient and Parallels Desktop is built on that and is also pretty efficient (much more so than VirtualBox for example).
Docker Desktop has some significant restrictions compared to running docker natively under Linux. The biggest one is the lack of ‘host’ networking (though there are others).
Docker on Linux is not virtualised; it is more ‘compartmentalised’ using groups etc. so no extra virtualisation layer is involved.
I need both macOS and Linux hosts for (many) reasons other than running HA (and other containers).
My setup gives me lots of flexibility with low overhead and great performance. It also gives me 2 levels of redundancy as I can move specific containers to an identical VM running on my second (identical) macOS server. If need be I can move the entire VM to the other macOS Server.
@mekaneck I have looked extensively at the automation traces and nothing stands out. Everything seems normal and then the main HA python process just SEGVs. Today I (a) updated my HA to 2025.11.2 and (b) spent some time ‘polishing’ one of the automations and that seemed to avoid the issue. I applied the same changes to a second one and that also seems okay. I’ll run just those 2 for 24 hours and if things are still good then I’ll make the same changes to the other two and re-enable those. Nothing I changed should make any real difference and it is quite concerning that HA can just SEGV for no discernible reason.
Except it really doesn’t. This is not a problem people commonly report. For sure, there is some (probably minuscule) chance that your automation code somehow put HA in a state that tripped it over some python bug that caused the process to segfault, but to me that chance seems a lot lower than a problem (quite possibly a bug) somewhere in the virtualization stack.
My strong suspicion is that the correlation with your automations is a red herring.
@afsy_d Well the only way to know that would be to run both setups identically in parallel for a few months. Not really possible / practical. I’ve been running with this setup for 2 years and HA has been rock solid (along with all the other stuff I run) until this issue occurred. Personally, based on my experience (40+ years a software engineer) I doubt that removing the full Linux VM, and replacing it with a trimmed down Linux VM (Docker Desktop) would make much difference to either stability or performance, even if it were possible (which it isn’t for me). But it is all conjecture of course.
@d921 The correlation may be a red herring (though it was 100% reproducible - disable the automations, no issue, enable them and issue comes along within a relatively short time). Suggesting that there is more likely some incredibly subtle bug in the virtualisation stack as opposed to in Python and its dependencies seems a little far fetched (though certainly not totally impossible). FWIW, I run a bunch of other Python stuff both directly in the Linux VM and also in other containers in the VM with zero issues so there is no fundamental issue with Python in this environment. Anyhow, my ‘polishing’ (or maybe the upgrade to 2025.11.2) seems to be bearing fruit so I will observe and react accordingly.
I’ve sadly been tripped up over the years with many low-level bugs that behaved in similar ways: i.e., reproducible correlation with a higher-level aspect. Over time, you learn that the nature of those kinds of low-level bugs is that they are surprisingly susceptible to patterns like that.
FWIW, the reason I’m more inclined to think it’s the virtualization stack is that python and home assistant run closer to “bare-metal” (a term I hate but which is often just too convenient) without causing segfaults on a VERY large number of systems. By definition, it’s a lot more than the number of systems running those binaries on top of a more deeply virtualized configuration like yours.
Sure, it’s even possible that it’s a python bug that only manifests in a virtualized environment like this; but if I had to chase this down (and thank goodness I don’t), I wouldn’t start with python or hass.
You’re the one seeing this in action, so your own gut feeling is probably the way to go. Happy hunting. If you make additional observations, all of us are happy to try to help. And if you solve it, I hope you share your findings.
After polishing/slightly refactoring those 4 automations, and the update to 2025.11.2, things are now back to being completely stable. Sadly I still have no idea exactly what caused the issue. My suspicion (but with no hard evidence) is that something in the pattern of those automations tripped over some corner case in HA. Either the refactoring, or perhaps some changes/fixes in 2025.11.2, eliminated the issue. It’s frustrating not to have been able to get to the bottom of it, but at least I’m back to rock solid stability again now, which is the main thing.