Automation that causes undesired never ending server restart cycle

louavzzniibmodjork · November 20, 2023, 11:03am

Point of post: To answer the question “Is there something wrong deep within Home Assistant, some protections that can be added, or am I just doing something dumb?”

I’ve got an automation or automation/integration combo that appears to send Home Assistant into a difficult to recover from cycle of restarts.

Integration:

Automation:

alias: Update OpenUV
description: ""
trigger:
  - platform: time_pattern
    minutes: /10
condition:
  - condition: sun
    after: sunrise
    before: sunset
    before_offset: "+00:45:00"
  - condition: template
    value_template: >-
      {% set updates_per_day = 30 %}

      {% set automation_last_triggered = state_attr('automation.update_openuv',
      'last_triggered') %}

      {% set next_setting_as_today = state_attr("sun.sun", "next_setting") |
      as_timestamp | timestamp_custom("%H:%M") | today_at %}

      {% set next_rising_as_today = state_attr("sun.sun", "next_rising") |
      as_timestamp | timestamp_custom("%H:%M") | today_at %}

      {% set hours_to_update = (next_setting_as_today - next_rising_as_today) |
      today_at | as_timestamp | timestamp_custom("%H") | int + 1 %}

      {% set updates_per_hour = updates_per_day / hours_to_update %}

      {% set minutes_per_update = (60 / updates_per_hour) | int + 1 %}

      {{
        automation_last_triggered == None
        or (
          now() - state_attr('automation.update_openuv', 'last_triggered')
        ) >= timedelta(hours = 0, minutes = minutes_per_update)
      }}
action:
  - service: homeassistant.update_entity
    target:
      entity_id:
        - sensor.openuv_current_uv_index
    data: {}
mode: single

Causes this:

Well, when I was able to finally disable the script and rebooted numerous times the above history settled down. No other changes made aside from disable and reset (after having had to do this multiple times while debugging, I narrowed it down to this).

The above script is similar to the one in the docs, except that it attempts to approximate the length of the current day’s daylight hours and split up the quota to fit within that timespan. This is different from the example in the docs which takes the longest daylight hours and uses that throughout the year.

Aside from restarting the Home Assistant server every 10 minutes (at the beginning) and then every 2 minutes at its most frequent, there are a number of factors which make this difficult to recover from.

Disabling the Integration doesn’t stop the restarts.

When the server comes back up, simply disabling the automation doesn’t work. It appears to be disabled, but then the server restarts and the automation is enabled again.

I think the way to get around this is to disable the automation and then quickly “Shut down the server” from the restart menu options. It hasn’t worked first time for me and I’ve repeated it in order to get the timing right.

Disabling parts of the automation does stay set between resets, but doesn’t prevent the next server restart from happening.

Another compounding factor is that every time the server restarts it uses up 1 more of the quota. When the quote has been used the entities are no longer considered to be provided by the Integration. So this results in additional errors in the automation where it says that it can’t update “sensor.openuv_current_uv_index” because it doesn’t exist. I don’t think this contributes to the above issue since the quota resets every day and there will be a period of time where the entity does exist.

I haven’t counted (or checked the quota) to know if the drop from 10 minutes restarts to 2 minute restarts is related to the entity no longer existing.

In the action I tried automation_last_triggered as a variable and inline to see if there was a difference and both caused server restarts.

I wouldn’t have expected this automation to have such a profound impact on the server.

Is there something wrong deep within Home Assistant, some protections that can be added, or am I just doing something dumb?

boheme61 · November 20, 2023, 12:28pm

In what way is it “similar” ?

You are “setting and calculating” (every /10 min) where as the example is just “getting/compare” the automations last_updated_state

Have you tried that template in template editor ?, every /10min

~~I think you have to call that script, in the action, and have the script call the service:~~
~~homeassistant.update_entity~~

EDIT: or maybe use a “helper”

On the other hand, i think maybe you are “overdoing it” , if you live in the northern or southern part of the world, the sun(daylight) various more during winter/summer time, and it’s mentioned max 50 request a day , with a 10 min interval per hour, it covers about 8hour ( not sufficient about 6month of the year ), maybe that’s why they also shows 20min as an example (16+ hour) a day
But if your “goal” is to max the “output” (even thou you look at the result 5 times a day), then calculate the " daylight-min / 50" from the sunrice-sunset and use that “time-interval” (variable_timedelta) output as the (n condition) for the service cal
You only need to “get” the update time-interval once a day (not every 10 min)

tom_l · November 20, 2023, 2:42pm

I can’t explain the behaviour you are seeing in your history strip chart. Even if your condition was somehow erroneous and allowed all triggers through, it still should not trigger more often than every 10 minutes with this trigger:

Did you edit it?

Was it originally seconds: /10 ?

petro · November 20, 2023, 2:49pm

Are you just trying to trigger on 30 even increments between rising and setting?

louavzzniibmodjork · November 20, 2023, 2:50pm

@tom_l Sorry, I wasn’t explicit about that history strip. That is showing the server uptime. So every time the server restarts it creates a new color.

So even though the automation doesn’t run any actions, the server is still restarting.

The leads me to believe that it’s the condition that is the evaluation of the condition that is causing the server to restart.

louavzzniibmodjork · November 20, 2023, 2:55pm

@petro That is what I was trying to accomplish with this state of the automation. If this had been successful, instead of putting the server into a restart cycle, then I would have iterated on the solution.

The next version of the Integration is bringing in changes that will cause further iteration, but I wanted to do something with this version.

I just didn’t expect HA to react this way.

NathanCu · November 20, 2023, 2:59pm

I see indication of a boot loop and I see what may be the trigger, but I do not see any logs to help you determine what is actually crashing…

What does the log say (the full log, in the uo it will only have since last restart) were all in the dark guessing…

If you’re out of bootloop land. Grab a log please…

petro · November 20, 2023, 3:17pm

Ok, well you’re going to run into issues using next_rising and next setting. Do this instead.

Create an input_select helper with the name Update OpenUV.
Use the following automation

alias: Set Schedule for OpenUV
trigger:
- platfrom: time
  at: "00:00:00"
- platform: homeassistant
  event: start
action:
- service: input_select.set_options
  target:
    entity_id: input_select.update_openuv
  data:
    options: >
      {# change updates for the number of updates between sunrise and sunset #}
      {% set updates = 30 %}
      {# change include_senset to True to include the sunset as the last trigger time #}
      {% set include_sunset = False %}
      {% set setting = (state_attr("sun.sun", "next_setting") | as_datetime | as_local).replace(day=now().day) %}
      {% set rising = (state_attr("sun.sun", "next_rising") | as_datetime | as_local).replace(day=now().day) %}
      {% set ns = namespace(items=[]) %}
      {% set inc = (setting - rising) / (updates - 1 if include_sunset else updates) %}
      {% for i in range(updates) %}
        {% set ns.items = ns.items + [ (rising + i * inc).strftime("%H:%M") ] %}
      {% endfor %}
      {{ ns.items }}

Add the following automation

alias: Update OpenUV
description: ""
trigger:
  - platform: template
    minutes: "{{ now().strftime('%H:%M') in state_attr('input_select.update_openuv', 'options') }}"
action:
  - service: homeassistant.update_entity
    target:
      entity_id:
        - sensor.openuv_current_uv_index
    data: {}
mode: single

This automation will schedule times throught the day that the automation will run. The scheduled times are calculated at midnight when rising and setting are correct for the current day.

If by chance you restart, the rising and setting may be slightly different depending on when you restart. However, you’ll still get a schedule that is “close” to the times you want.

The second automation is what will run at each period.

Going this route removes all issues you’d have from the other route.

123 · November 20, 2023, 3:31pm

Anyone see why this automation is the alleged cause of the problem?

It’s designed to fire every 10 minutes, no more, no less.
Its Sun Condition limits its actions to (apprx) daylight hours.
Its Template Condition doesn’t appear to contain anything the Jinja processor can’t easily digest. It’s a linear sequence of simple calculations (no control structures like if or for).
Its sole action is straightforward and doesn’t influence its own trigger.

It might be useful to review the automation’s most recent traces, especially the times when each trace was produced.

petro · November 20, 2023, 3:33pm

Just to explain some things, when the sun component passes the sunrise or sunset, it calculates the next sunrise or sunset. Which slightly changes the overall calculation. If you’re looking for the most accurate result, you want to use todays sunrise and sunset only. Which means you can only run the calculation before todays sunrise.

petro · November 20, 2023, 3:35pm

I’d wager the issue lies with the updating of the entity and mode single but that remains to be seen. The template he’s using is odd and will behave odd when it crosses the transitions (although I don’t think that will cause HA to crash).

louavzzniibmodjork · November 20, 2023, 3:38pm

@NathanCu Thanks for your input. Yeah, I’ve been in the dark trying to figure this out.

I am out of bootloop land now, and I’ve installed the add-on, but it refuses to start. It’s shows CPU activity after I hit the start button, but then immediately stops and refreshing the page just says it isn’t started. And, of course, I can’t access the logs to see why this is happening.

Looking at logs is quite an important task, I would have hoped that there was a builtin way of accessing this information.

I’ll need to set aside some time to try and gain access to the logs. That time is not right now, unfortunately.

Are you willing to share your possible reasons on what could be wrong with the automation that would cause HA to enter a bootloop, or would you prefer to wait for the logs?

Thanks, again.

123 · November 20, 2023, 3:45pm

I agree the revised sunrise/sunset times will be altered by a minute or three. However I don’t see how that small change would cause the kind of catastrophic failure that’s been described.

It uses a simple Time Pattern Trigger so single mode ought to be fine … unless the action takes more than 10 minutes to complete which seems highly unlikely.

The nature of the failure suggests that something is taxing the system to the extent it can no longer process additional tasks. I did this once by mistakenly putting a repeat into an endless loop (without any delay within the loop) and, yes, it was a bear to regain control of the system. However, I don’t see anything in that Template Condition that is capable of this result.

NathanCu · November 20, 2023, 3:49pm

I will not speculate without logs.

I’m with @123 on possibility but I’ve chased too many ghosts.

Without data you’re guessing, correlation is not causation.

louavzzniibmodjork · November 20, 2023, 3:50pm

@123 Thanks for pushing the question of why!

Of note is that the automation says that it was last triggered 16 hours ago, which is well before I finally managed to get the bootloops to stop. The automation is now disabled, so won’t be triggered now. My assumption here is that the time the automation is triggered is updated after the conditions are evaluated and that’s why it wasn’t being updated.

Unfortunately I can’t access the traces because the server has rebooted.

From memory looking at earlier traces when I could there wasn’t anything helpful in them. Either they would stop at the condition, or they would complain that the entity wasn’t available. (because I was over the quota)

The various tabs of the trace didn’t show anything that stood out to me; no resolved variables, etc. Copying the variables to the template renderer showed what I expected the values to be.

I tried both the variables: section of the yaml, and the template code that I posted in this thread. I don’t remember if the variables: caused the bootloop, but it obviously wasn’t working correctly to cause me to move to the template.

123 · November 20, 2023, 3:57pm

That seems nominal and serves as evidence that there’s nothing inherently wrong with the automation.

The sensor’s unavailable status should be explored. The OpenUV integration marks it as unavailable when you have exceeded your daily allowance of polling inquiries?

I don’t know how Home Assistant (and/or the OpenUV integration) handles an attempt to update the state of an unavailable entity. I assume gracefully (along with an error or warning message) but maybe not.

EDIT

Pure Speculation:

Imagine that the OpenUV integration doesn’t like being instructed to update a sensor’s value after you have exceeded your daily quota. It proceeds to malfunction and overload Home Assistant. It’s cleared after a restart but misbehaves again the moment when the automation tells it to update the sensor. Basically, the automation looks like the culprit but it’s actually the integration.

louavzzniibmodjork · November 20, 2023, 4:01pm

@petro If you’re able to describe what it “odd” about the template that would help with me learning.

Understood about the runrise/set changing based on the time of the day, which is why the template coerces the times with today_at to approximate the time to be close enough for what was supposed to be a “small iteration” of the automation. The next version of the Integration will provide additional information that will change the calculations anyways. This thread isn’t intended to discuss improvement, but to question why the bootloops.

When you say transitions, do you mean the sunrise/set transitions? The bootloops in the history grab are well before sunrise.

louavzzniibmodjork · November 20, 2023, 4:03pm

@NathanCu Fair enough. If you do think of something where I can do chasing (in addition to the the logs) then please do let me know.

louavzzniibmodjork · November 20, 2023, 4:13pm

@123

I have reenabled the Integration in order to get that screenshot. (to make sure there wasn’t a difference when the Integration is enabled/disabled.)

I guess we will find out soon if it’s the integration. But note that disabling the Integration hasn’t fixed the issue previously. It was only after disabling the Integration and then disabling the automation that HA would stop the bootloop.

Update:

Oh, and the fact that the bootloop happens when I have a full quota at the start of the new day.

So far no reboots.

123 · November 20, 2023, 4:22pm

What is responsible for the ‘forced update’ in those messages? The integration or your automation.

If it’s your automation then why has the integration not created those entities yet … unless (speculation) it’s designed not to if it detects you’ve exceeded your daily quota.