What is Home Assistant's concurrency model?

fuatakgun · September 20, 2024, 8:02pm

These things cannot run concurrently given that there is one thread executing these right? Until one of them yields, it will be executed to be completed before switching to something else.

Whatever is user perceiving is different what id happening behind the execution.

Anyway, what i am telling you how it is working, nothing right or wrong, feel free to test and share your findings with us.

petro · September 20, 2024, 9:19pm

Yep, it creates (and destroys) worker threads when it needs them with a limit of 64.

One of the automations will hit it first. It will be a race condition for which one gets it.

Try it out yourself. Very easy test.

a_test:
  variables:
    name: A
  sequence: &sequence
  - if: "{{ is_state('input_boolean.mutex', 'off') }}"
    then:
    - action: input_boolean.turn_on
      target:
        entity_id: input_boolean.mutex
    - action: persistent_notification.create
      data:
        title: "{{ name }}"
        message: "{{ name }} got there first."

b_test:
  variables:
    name: B
  sequence: *sequence

mutex_test:
  sequence:
  - parallel:
    - action: script.turn_on
      target:
        entity_id: script.a_test
    - action: script.turn_on
      target:
        entity_id: script.b_test

I ran this 10 times, A won every time, B never produced a notification.

You can see in the trace that A completed the if statement. And be (because it lost the race) never had a positive hit.

ChrisJ60 · September 21, 2024, 12:23pm

Well, I am not a Python programmer (and have no intention of learning now after 40+ years of C and Java). This kind of information is very important and should be spelt out clearly in the documentation. It should not require someone to study the HA source code to find out.

ChrisJ60 · September 21, 2024, 12:28pm

That’s somewhat encouraging. Sadly from experience race conditions can be very hard to provoke/find. If there were no failures after 100,000 iterations it would be more comforting (though still not definitive; I’ve seen race conditions that only show up once after months of continuous running).

It seems no one can make an authoritative statement on this topic, which really should be spelt out clearly in the documentation, as it is very important.

petro · September 21, 2024, 12:43pm

We are making authoritative statements, you aren’t accepting them and being doubtful.

If you want to run a 100000 iteration test, by all means do it so that you’ll accept the answers so we can move on.

a_test:
  variables:
    name: A
  sequence: &sequence
  - if: "{{ is_state('input_boolean.mutex', 'off') }}"
    then:
    - action: input_boolean.turn_on
      target:
        entity_id: input_boolean.mutex
    - action: system_log.write
      data:
        message: "{{ index }}: {{ name }} got there first."

b_test:
  variables:
    name: B
  sequence: *sequence

mutex_test:
  sequence:
  - repeat:
      until:
      - condition: template
        value_template: "{{ repeat.index == 100000 }}"
      sequence:
      - action: homeassistant.turn_off
        target:
          entity_id: input_boolean.mutex
      - parallel:
        - action: script.turn_on
          target:
            entity_id: script.a_test
          data:
            variables:
              index: "{{ repeat.index }}"
        - action: script.turn_on
          target:
            entity_id: script.b_test
          data:
            variables:
              index: "{{ repeat.index }}"

arturpragacz · September 21, 2024, 7:07pm

He is right to be doubtful.

You can run the automation however many times you want, it is no proof whatsoever.

In order to prove there is no race condition, you have to actually prove there is no race condition. Hand waving is no proof.

arturpragacz · September 21, 2024, 7:08pm

If the documentation doesn’t state that something is atomic, then you should not assume it is atomic. So it actually is specified by not being specified.

If you want to know whether the current implementation happens to behave in some specific way, you have to look at the source code, like I already told you.

nickrout · September 22, 2024, 3:48am

So hard that no one seems to have found one yet.

mirekmal · September 22, 2024, 9:10am

Well, actually some time ago I was hit by race condition in 2 of my automations. I do not recall exactly what was the case, but I spend countless hours troubleshooting, before I realized what was causing issues. Since then I always consider if there might be any race condition for new automation I create and either I combine it with other automation that uses the same trigger or add some safety helper that does not allow race to occur.
One thing I want to say; in my case there was ~30/70 chance of automation failing due to race, not 1/100,000… so it is possible.

PeteRage · September 22, 2024, 9:28am

It does not. An input boolean will not work, as the checking of the value and setting the value are two operations that do not happen atomically. If you really need this you could write an integration that exposes a “mutex” with appropriate actions.

The automation modes and script modes do work and can enforce sequencing if all the logic is within the same automation / scripts.

ChrisJ60 · September 22, 2024, 9:49am

Thanks for confirming that it is possible (as I surmised). What kind of safety helper do you use, since I am struggling to think of something that is 100% guaranteed (like an OS mutex) in the context of how HA works?

Mariusthvdb · September 22, 2024, 12:48pm

besides the theoretical exercise (which I follow with interest), is there in fact an actual issue you’ve run into and need to solve?

I mean, if you could provide that real life example, HA devs might be willing to have a look and see what could/needed be done.

mirekmal · September 22, 2024, 2:50pm

As I recall (I rewrote these automations later on from the ground), one of automations had higher priority for me, so I used input_boolean helper to indicate that is started (and clear it at the end). Then in second automation I added small delay of 0.5s as the first step and then in second step added condition to check if this helper is set and continue after it is cleared.
I was using these automations to control the lighting in my living room, where I had ‘2.5’ different modes; one mode with manually set whole room scenes and then second with different subarea scenes triggered by motion and presence sensors. On top I could manually switch on/off different lights, that eventually prevented some of automations to be disabled. This second case + manual light controll was causing some issues, as could lead to conflicting scenes being activated, depending on helpers statuses.
That was abit weird logic, I know, but that was time I just started with HA… later these automations were simplified and combined into single automation running all steps in proper order.

ChrisJ60 · September 22, 2024, 2:50pm

@Mariusthvdb My HA setup currently has 181 automations, 27 scripts and 294 helpers. Many of these things are just for monitoring/alerting (so not an issue) but I also have mechanisms implemented for active control of:

Hot Water - immersion heater and gas boiler.

Individual room heating - via presence based control.

Powerwall export mode - solar or everything.

Powerwall automatic backup reserve adjustment.

Powerwall charging control based on if my EV is charging or not.

Powerwall automatic on/off grid control - to optimise use versus export of solar power based on multiple factors.

Powerwall off-peak charging control.

These each involve multiple automations with different (but potentially overlapping in some cases) triggers and scripts which examine many sensors (many are shared between different automations and scripts) and modify the state of multiple entities, many of which are again shared between (i.e. can be modified by) several different automations and scripts.

This mostly works very well but occasionally I see unusual/unexpected behaviour. Sometimes this is due to bugs in my logic/code but occasionally it is hard to pinpoint an obvious cause. Given my setup, the potential for race conditions is theoretically significant, so I was trying to understand how likely this might actually be based on HAs architecture and automation/script execution concurrency model and hence whether I need to guard against it myself (and if so how). Sadly the answer seems to be ‘no one can really say’, though it does seem that HA is not immune to race conditions in user code.

I’ve looked into trying to reduce the number of automations/scripts but the downside is a significant increase in complexity (and hence an increased scope for errors and also more difficult to maintain).

Based on my 40+ years experience as a software engineer working on complex high concurrency systems I am only too aware of the havoc that race conditions and incorrect mutual exclusion control can cause and how hard it can be to track them down.

I guess I shall just have tohope that the remaining occasional glitches are bugs in my code and not due to race conditions between ‘concurrently executing’ automations/scripts.

As to what could be done, a great first step would be to provide detailed information, in the documentation, as to what isolation/concurrency controls/guarantees are, or are not, provided by HA in respect of concurrent execution of different automations and scripts (the modes for multiple instances of the same automation or script seem to be fairly well documented).

If that detail reveals that race conditions are possible then a second step would be to provide detailed coding guidelines for best practice to avoid (or at least minimise) the possibility of concurrency issues and maybe also for HA to provide some mechanism to help avoid them, such as mutexes for example, though I do foresee potential issues with allowing user level mutexes in the context of HAs asyncio implementation and execution paradigm.

My hope is that the HA devs could provide details and explanations to show that race conditions in user code are simply not possible (but I am not overly hopeful).

arturpragacz · September 22, 2024, 4:55pm

That is obviously not physically possible. As long as automations are concurrent, there is always possibility for a race condition by definition.

ChrisJ60 · September 22, 2024, 5:11pm

Yes of course. What I meant is that the HA devs might explain that HA takes care to prevent race conditions behind the scenes (which is possible, though non-trivial, and would likely hinder performance and scalability). I do not think that it does anything specific in this regard (hence my comment about not being hopeful).

Mariusthvdb · September 22, 2024, 5:49pm

well, tbh, this is not really a particular issue, but more a philosophical description of ifs and would be’s. You seem to have a very modest install, so no reason to suspect anything special there.

I mean, it’s not until you would provide an actual case (as in yaml automation/script etc etc) where you suspect some concurrency issue, we can have a go at analyzing it.

ChrisJ60 · September 23, 2024, 10:18am

Yes, for sure. As I originally said, at the moment I have no more than some suspicions/concerns. I certainly don’t expect folks here to look at a bunch of complex scripts/code and try to debug possible race conditions for me without any evidence that one exists.

Let’s go back to my original questions and the reason for posing them…

So far I have created a ‘modest’ number of helpers, scripts and automations. My knowledge and understanding of HA increased as I went through that process. Some of the automations/scripts are moderately complex (from my perspective) and there are, in some cases, potential interactions between different automations/scripts based on entities they examine/modify and it is definitely possible that in some scenarios some of these automations/scripts might be running ‘concurrently’ (to whatever degree HA permits, or does not permit).

Nonetheless, things are mostly working well but occasionally I see unexpected behaviour and need to figure out what causes it (in order to fix it). Mostly I (eventually) find bugs in my code/logic but occasionally I cannot find a cause (it may still be a bug in my code of course). Based on my experience of concurrent systems I wondered if race conditions might potentially occur in HA user code (between different scripts and/or automations executing ‘concurrently’) so I went looking online (HA doc, search etc.) for any details or references but surprisingly there is virtually nothing out there relating to race conditions in HA user code (as opposed to race conditions within HA code itself or user code race conditions due to HA bugs). This is very surprising, given how important a topic this is. There is a nod towards it in the relatively new ‘execution mode’ feature for individual automations/scripts but the issue of race conditions between different automations/scripts seems largely to be ignored.

So, I posed my original questions to try to understand if race conditions are something that can occur in HA user code or not, and if they can occur how do people mitigate them. A fairly simple question which should be clearly covered in the HA docs (even if it is just to say that HA takes care of it and you don’t need to worry about it [though I don’t believe that to be the case]). I was expecting some clear and authoritative answers but instead I got, well, just look at the thread…

Seriously, does no one know the answer to this? Not even the HA devs?

petro · September 23, 2024, 10:53am

A clear definitive answer is: Yes race conditions will occur. To avoid them, we’ve used input_booleans or used a template condition that checks the last time this (or other) automations triggered.

ChrisJ60 · September 23, 2024, 11:29am

Okay good. Race conditions can exist. Next question is how can using an input_boolean as a ‘mutex substitute’ resolve this? Unless HA is doing clever stuff under the covers, testing and then setting an input_boolean is not atomic and so is subject to the same potential for race conditions as anything else. Mutexes, and their lower level equivalents, require special implementation, including an atomic test and set operator, in order to function correctly. Will putting the testing and setting of the input_boolean into a script with the ‘single’ attribute be sufficient (assuming all test and set operations for the mutex are done only via that script)?