HA Docker (2025.11.1) regularly restarting due to SIGSEGV

My previously super stable HA setup has started regularly restarting due to the main HA process getting a SIGSEGV.

Host OS is Centos 10 ARM64 with all latest updates. 4 CPU cores, 16 GB RAM, SSD storage.

HA is running in Docker. No resource limits applied. Issue started happening earlier today (HA 2024.10.4). I updated my HA to 2025.11.1 but issue persists. All I have done today is edit a few existing automations. I’ve checked and double checked them for any obvious issues but can’t see any. And anyway, HA should not crash if there is some error in an automation. No obvious errors in the container log; first sign of trouble is:

19:00:31] INFO: ^[[32mHome Assistant Core finish process exit code 256^[[0m

[19:00:31] INFO: ^[[32mHome Assistant Core finish process received signal 11^[[0m

s6-rc: info: service legacy-services: stopping

s6-rc: info: service legacy-services successfully stopped

s6-rc: info: service legacy-cont-init: stopping

s6-rc: info: service legacy-cont-init successfully stopped

s6-rc: info: service fix-attrs: stopping

s6-rc: info: service fix-attrs successfully stopped

s6-rc: info: service s6rc-oneshot-runner: stopping

s6-rc: info: service s6rc-oneshot-runner successfully stopped

s6-rc: info: service s6rc-oneshot-runner: starting

s6-rc: info: service s6rc-oneshot-runner successfully started

s6-rc: info: service fix-attrs: starting

s6-rc: info: service fix-attrs successfully started

s6-rc: info: service legacy-cont-init: starting

s6-rc: info: service legacy-cont-init successfully started

s6-rc: info: service legacy-services: starting

services-up: info: copying legacy longrun home-assistant (no readiness notification)

s6-rc: info: service legacy-services successfully started

Host is not running low on memory or cpu power. Disk space is absolutely fine too. Really not sure how I can diagnose this in the absence of anything useful in the logs.

Are you seeing issues with other containers?

The odds of this are not high, but seeing segfaults on a previously stable system, and with software that is as stable as HA, is consistent with faulty RAM. But it would be odd to see the problem only show up in one application.

I agree with dominic.
Signal 11 is usually the kernel shutting down the process due to unreachable resource requests.
It can be due to errors, but it can also be due to out of memory issues.
The 16Gb is just one limit on the memory.
Others can be unable to allocation a large enough block of contiguous blocks or simply unable to store data on the stack or the memory heap.

You should check your automations and make sure there are no loops or at least that there is no buildup of data in the loops.

@d921 @WallyR So the docker ‘host’ is in itself a VM. It is running under Parallels Desktop Professional on a Mac mini M4 Pro with 14 cores and 48 GB RAM. The macOS system and the VM are themselves completely stable (they are running lots of other stuff 24/7). Plenty of available memory at the macOS and VM levels. I’m monitoring the Virtual size and RSS of the HA core python3 process and it remains stable (with only the expected relatively small fluctuations) right up to the point when things go south. A SEGV is typically caused by a process trying to access memory outside of its address space (most often through a NULL pointer returned from a previous memory allocation which wasn’t properly checked at the time). It can also be caused if the malloc arena gets corrupted by bad pointer use. AFAIK this should not be possible in a python application unless (a) there is a bug in Python or (b) there is a bug in native code that Python is calling. I could be wrong though (I’m not a Python expert).

Interestingly I have narrowed the issue down to the 4 automations that I modified yesterday. I made similar changes to all of them; I added some trigger ids and added some ‘Choose’ options using ‘triggered by’. The code in those options is the same as code under other ‘Choose’ options. There are no (obvious) loops nor do the automations gather/accumulate data. If I disable those 4 automations then the system remains stable. If I have any of those automations enabled then the restarts occur; sometimes after a few minutes sometimes after an hour or so (and anything in between).

Here’s the YAML for one of them (they are all very similar):

alias: "LIGHTING: Hall Lamp control"
description: ""
triggers:
  - trigger: state
    entity_id:
      - input_boolean.hall_lighting_auto
  - trigger: state
    entity_id:
      - sensor.hall_occupancy
    id: hldelaystart
    from: "on"
    to: "off"
  - trigger: state
    entity_id:
      - timer.hall_occupancy_delay
    id: hldelayend
    from: active
    to: idle
  - trigger: state
    entity_id:
      - sensor.hall_occupancy
  - trigger: state
    entity_id:
      - sensor.hall_illuminance
  - trigger: state
    entity_id:
      - sensor.number_of_people_at_home
  - trigger: state
    entity_id:
      - timer.hall_lamp_on
  - trigger: state
    entity_id:
      - timer.hall_lamp_off
  - trigger: time_pattern
    hours: "*"
    minutes: /2
    enabled: true
conditions:
  - condition: state
    entity_id: input_boolean.hall_lighting_auto
    state: "on"
  - condition: not
    conditions:
      - condition: or
        conditions:
          - condition: state
            entity_id: sensor.hall_occupancy
            state: unavailable
          - condition: state
            entity_id: sensor.hall_occupancy
            state: unknown
          - condition: state
            entity_id: sensor.hall_illuminance
            state: unavailable
          - condition: state
            entity_id: sensor.hall_illuminance
            state: unknown
          - condition: state
            entity_id: binary_sensor.switch_hall_lamp
            state: unknown
          - condition: state
            entity_id: binary_sensor.switch_hall_lamp
            state: unavailable
actions:
  - choose:
      - conditions:
          - condition: trigger
            id:
              - hldelaystart
        sequence:
          - action: timer.start
            metadata: {}
            data: {}
            target:
              entity_id: timer.hall_occupancy_delay
      - conditions:
          - condition: trigger
            id:
              - hldelayend
          - condition: or
            conditions:
              - condition: time
                after: "22:15:00"
              - condition: time
                before: "06:30:00"
          - condition: state
            entity_id: sensor.hall_occupancy
            state: "off"
          - condition: or
            conditions:
              - condition: state
                entity_id: binary_sensor.switch_hall_lamp
                state: "on"
              - condition: state
                entity_id: input_boolean.hall_lamp_interlock
                state: "on"
        sequence:
          - action: input_boolean.turn_off
            metadata: {}
            data: {}
            target:
              entity_id: input_boolean.hall_lamp_interlock
          - action: switch.turn_off
            metadata: {}
            data: {}
            target:
              entity_id: switch.hall_lamp_switch
          - action: timer.start
            metadata: {}
            data: {}
            target:
              entity_id:
                - timer.hall_lamp_off
      - conditions:
          - condition: state
            entity_id: sensor.hall_occupancy
            state: "on"
          - condition: state
            entity_id: timer.hall_occupancy_delay
            state: active
          - condition: or
            conditions:
              - condition: state
                entity_id: input_boolean.hall_lamp_interlock
                state: "on"
              - condition: state
                entity_id: binary_sensor.switch_hall_lamp
                state: "on"
        sequence:
          - action: timer.cancel
            metadata: {}
            data: {}
            target:
              entity_id: timer.hall_occupancy_delay
          - action: timer.start
            metadata: {}
            data: {}
            target:
              entity_id: timer.hall_occupancy_delay
      - conditions:
          - condition: time
            after: "06:29:00"
            before: "22:16:00"
        sequence:
          - choose:
              - conditions:
                  - condition: numeric_state
                    entity_id: sensor.hall_illuminance
                    below: sensor.hall_lamp_on_threshold
                  - condition: numeric_state
                    entity_id: sensor.number_of_people_at_home
                    above: 0
                  - condition: state
                    entity_id: timer.hall_lamp_off
                    state: idle
                  - condition: or
                    conditions:
                      - condition: state
                        entity_id: binary_sensor.switch_hall_lamp
                        state: "off"
                      - condition: state
                        entity_id: input_boolean.hall_lamp_interlock
                        state: "off"
                sequence:
                  - action: input_boolean.turn_on
                    metadata: {}
                    data: {}
                    target:
                      entity_id: input_boolean.hall_lamp_interlock
                  - action: switch.turn_on
                    metadata: {}
                    data: {}
                    target:
                      entity_id: switch.hall_lamp_switch
                  - action: timer.start
                    metadata: {}
                    data: {}
                    target:
                      entity_id: timer.hall_lamp_on
              - conditions:
                  - condition: numeric_state
                    entity_id: sensor.hall_illuminance
                    above: sensor.hall_lamp_off_threshold
                  - condition: state
                    entity_id: timer.hall_lamp_on
                    state: idle
                  - condition: or
                    conditions:
                      - condition: state
                        entity_id: binary_sensor.switch_hall_lamp
                        state: "on"
                      - condition: state
                        entity_id: input_boolean.hall_lamp_interlock
                        state: "on"
                sequence:
                  - action: input_boolean.turn_off
                    metadata: {}
                    data: {}
                    target:
                      entity_id: input_boolean.hall_lamp_interlock
                  - action: switch.turn_off
                    metadata: {}
                    data: {}
                    target:
                      entity_id: switch.hall_lamp_switch
                  - action: timer.start
                    metadata: {}
                    data: {}
                    target:
                      entity_id:
                        - timer.hall_lamp_off
      - conditions:
          - condition: or
            conditions:
              - condition: time
                after: "22:15:00"
              - condition: time
                before: "06:30:00"
        sequence:
          - choose:
              - conditions:
                  - condition: state
                    entity_id: sensor.hall_occupancy
                    state: "on"
                  - condition: numeric_state
                    entity_id: sensor.hall_illuminance
                    below: sensor.hall_lamp_on_threshold
                  - condition: or
                    conditions:
                      - condition: state
                        entity_id: binary_sensor.switch_hall_lamp
                        state: "off"
                      - condition: state
                        entity_id: input_boolean.hall_lamp_interlock
                        state: "off"
                sequence:
                  - action: input_boolean.turn_on
                    metadata: {}
                    data: {}
                    target:
                      entity_id: input_boolean.hall_lamp_interlock
                  - action: switch.turn_on
                    metadata: {}
                    data: {}
                    target:
                      entity_id: switch.hall_lamp_switch
                  - action: timer.start
                    metadata: {}
                    data: {}
                    target:
                      entity_id:
                        - timer.hall_lamp_on
              - conditions:
                  - condition: state
                    entity_id: sensor.hall_occupancy
                    state: "off"
                  - condition: state
                    entity_id: timer.hall_occupancy_delay
                    state:
                      - idle
                  - condition: or
                    conditions:
                      - condition: state
                        entity_id: binary_sensor.switch_hall_lamp
                        state: "on"
                      - condition: state
                        entity_id: input_boolean.hall_lamp_interlock
                        state: "on"
                sequence:
                  - action: input_boolean.turn_off
                    metadata: {}
                    data: {}
                    target:
                      entity_id: input_boolean.hall_lamp_interlock
                  - action: switch.turn_off
                    metadata: {}
                    data: {}
                    target:
                      entity_id: switch.hall_lamp_switch
                  - action: timer.start
                    metadata: {}
                    data: {}
                    target:
                      entity_id:
                        - timer.hall_lamp_off
mode: single

Does anything leap out at you?

I see nothing special in your automation.

I only lingered a bit at the

  - trigger: time_pattern
    hours: "*"
    minutes: /2

I am uncertain if this definition makes it so that it triggers at each even minute during the hour or if it is just still a 2 minutes interval from when the cronjob first started.
If it is the last, then the hours key could be left out and that hours key is the only thing I found a bit odd to me.

Someone else might see some other interesting thing though. :slight_smile:

Can you download and share the last trace prior to your system becoming unstable?

I would disable 3 of the 4 automations, wait until the system becomes unstable, and then share the last trace of the remaining active automation.

Nothing to do with the question at hand, however I would consider using docker desktop or something similar to running the containers ‘closer’ to the Mac hardware. There are many layers of virtualization at play here.

Rick’s approach for troubleshooting sounds good btw.

@afsy_d
Well yes and no. I gave this a lot of thought and investigation before building out my infra.

  1. On macOS ARM, Docker Desktop uses lightweight VMs to run containers.

  2. macOS ARM virtualisation is pretty efficient and Parallels Desktop is built on that and is also pretty efficient (much more so than VirtualBox for example).

  3. Docker Desktop has some significant restrictions compared to running docker natively under Linux. The biggest one is the lack of ‘host’ networking (though there are others).

  4. Docker on Linux is not virtualised; it is more ‘compartmentalised’ using groups etc. so no extra virtualisation layer is involved.

  5. I need both macOS and Linux hosts for (many) reasons other than running HA (and other containers).

My setup gives me lots of flexibility with low overhead and great performance. It also gives me 2 levels of redundancy as I can move specific containers to an identical VM running on my second (identical) macOS server. If need be I can move the entire VM to the other macOS Server.

@mekaneck I have looked extensively at the automation traces and nothing stands out. Everything seems normal and then the main HA python process just SEGVs. Today I (a) updated my HA to 2025.11.2 and (b) spent some time ‘polishing’ one of the automations and that seemed to avoid the issue. I applied the same changes to a second one and that also seems okay. I’ll run just those 2 for 24 hours and if things are still good then I’ll make the same changes to the other two and re-enable those. Nothing I changed should make any real difference and it is quite concerning that HA can just SEGV for no discernible reason.

Except it really doesn’t. This is not a problem people commonly report. For sure, there is some (probably minuscule) chance that your automation code somehow put HA in a state that tripped it over some python bug that caused the process to segfault, but to me that chance seems a lot lower than a problem (quite possibly a bug) somewhere in the virtualization stack.

My strong suspicion is that the correlation with your automations is a red herring.

Ok seems like it was well thought through. I am still curious if it would run more stable if run directly on the hardware.

@afsy_d Well the only way to know that would be to run both setups identically in parallel for a few months. Not really possible / practical. I’ve been running with this setup for 2 years and HA has been rock solid (along with all the other stuff I run) until this issue occurred. Personally, based on my experience (40+ years a software engineer) I doubt that removing the full Linux VM, and replacing it with a trimmed down Linux VM (Docker Desktop) would make much difference to either stability or performance, even if it were possible (which it isn’t for me). But it is all conjecture of course.

@d921 The correlation may be a red herring (though it was 100% reproducible - disable the automations, no issue, enable them and issue comes along within a relatively short time). Suggesting that there is more likely some incredibly subtle bug in the virtualisation stack as opposed to in Python and its dependencies seems a little far fetched (though certainly not totally impossible). FWIW, I run a bunch of other Python stuff both directly in the Linux VM and also in other containers in the VM with zero issues so there is no fundamental issue with Python in this environment. Anyhow, my ‘polishing’ (or maybe the upgrade to 2025.11.2) seems to be bearing fruit so I will observe and react accordingly.

I’ve sadly been tripped up over the years with many low-level bugs that behaved in similar ways: i.e., reproducible correlation with a higher-level aspect. Over time, you learn that the nature of those kinds of low-level bugs is that they are surprisingly susceptible to patterns like that.

FWIW, the reason I’m more inclined to think it’s the virtualization stack is that python and home assistant run closer to “bare-metal” (a term I hate but which is often just too convenient) without causing segfaults on a VERY large number of systems. By definition, it’s a lot more than the number of systems running those binaries on top of a more deeply virtualized configuration like yours.

Sure, it’s even possible that it’s a python bug that only manifests in a virtualized environment like this; but if I had to chase this down (and thank goodness I don’t), I wouldn’t start with python or hass.

You’re the one seeing this in action, so your own gut feeling is probably the way to go. Happy hunting. If you make additional observations, all of us are happy to try to help. And if you solve it, I hope you share your findings.

So just to wrap this up…

After polishing/slightly refactoring those 4 automations, and the update to 2025.11.2, things are now back to being completely stable. Sadly I still have no idea exactly what caused the issue. My suspicion (but with no hard evidence) is that something in the pattern of those automations tripped over some corner case in HA. Either the refactoring, or perhaps some changes/fixes in 2025.11.2, eliminated the issue. It’s frustrating not to have been able to get to the bottom of it, but at least I’m back to rock solid stability again now, which is the main thing.