Automation time not working correctly

chrisfairbrother · February 3, 2023, 4:41pm

I tried your value_template
automation time 10:00:00 trace 10:03:30
much better, but not as quick as it should be.

I have created a new HA on an old intel pc using a generic X86-84.
added my shelly 2 lamp and one automation
automation time 11:00:00 trace 11:00:00
no delay at all. This proves there is nothing wrong (delay) with the shelly 2 device.
It does boot very quickly compared to my old mac mini.
I am now suspecting the mac mini is underpowered?

I built a new VM for HA.
added my shelly 2 lamp and one automation
automation time 12:00:00 trace 12:02:15 using value_template
Looked at the paravirtualization on the VM. This was set to Default. for Linux guests, it is recommended to use KVM. Now changed, but still about 2 minutes delay.

On my existing VM
changed the paravirtualization to KVM
automation time 16:05:00 trace 16:05:20 using value_template; much better.

The cpu load and memory usage graphs on the mac mini show about 15% cpu and 20% memory usage.
So its not working very hard, even when the automation is triggered.
The memory usage of the HA (settings->system:>hardware) show
Screenshot 2023-02-03 at 16.05.22

still nothing looks like a problem.

I will test again tonight at sunset and 22:15 and report back.

pnbruckner · February 3, 2023, 7:32pm

That is my suspicion as well. And the VM seems to be making it worse, especially when not using some hardware-based virtualization acceleration. (Besides KVM itself, there’s also the enabling of the hardware-based acceleration. On an x86 PC, that’s done in the BIOS. Not sure for a mac or macmini.)

As I mentioned earlier, for the template trigger that uses the now() function, it gets re-evaluated every minute. But it’s important to note that exactly when that happens (and, hence, when it might possibly “fire”) is based on the monotonic clock, not the real-time clock. So, it is, at least somewhat, affected by the same theoretical problem. Also, there are probably a lot of things in the system that are being run every minute, as well, so they can kind of “stack up”, further delaying the template trigger, and possibly making it jitter over time. Still, the reason I suggested it is, if there is some sort of drift going on, I would think the affect would be less for something scheduled every minute (which compares against the real-time clock) as opposed to something that is scheduled far into the future (like once a day.)

Regarding the “clock drift” custom component I alluded to earlier, I think I have it in a form that you (or anyone) could try. I made some changes to how it works, so I had to purge the history I had collected, but for the short time I’ve been running the latest version on my system it looks like this:

The scale is msec, so the drift so far is within +/- 1.5 µsec. Since I started this experiment, I think I’ve seen it within +/- 2 µsec. The interesting thing is to see if it stays pretty much constant (within a small range) or continues to mostly slope up or down. So far, I’ve only seen it stay pretty much constant, no long-term drift.

If you know how to add a custom repository via HACS, it should be pretty easy for you to give it a try on your system(s). It could help prove or disprove the theory I’ve suggested.

tom_l · February 4, 2023, 1:38am

Do you have a link to your clock drift integration Phil?

I’d like to try it.

pnbruckner · February 4, 2023, 4:55am

Yep, here it is:

I haven’t created any official releases yet, just the one commit so far.

@chrisfairbrother I’ve been trying to think what could account for minutes, or hours, of delay like you said you’ve seen. Honestly, it’s really hard to imagine.

There are ~~three~~ four things I can think of that can cause a time trigger to be delayed (as opposed to completely missed, some of which we’ve already discussed):

The event loop’s time reference (which, AFAIK, is basically time.monotonic()) running too slowly.
The hardware (and/or the VM) is just way too underpowered to run the system, or is feeding bad clocks to the system.
Some asyncio code running in your system that is not behaving correctly. (Basically, asyncio is a cooperative multi-tasking system, and if a task runs too long before await'ing something (e.g., it calls a callback that runs too long), then it can prevent the event loop from properly running things at the correct time.)

EDIT:

The underlying OS’s clock changing, e.g., being adjusted by NTP. Normally, though, NTP is supposed to adjust the clock slowly if it’s off, and then tweak it from time to time to keep it accurate. It’s not normally supposed to jump a significant amount. And, BTW, DST changing shouldn’t cause this. Yes, the time will jump an hour at a DST switch, but I’ve seen recent changes in HA’s code base to try to handle this better. Even so, that’s only twice a year, not every day.

My clock drift custom integration is an attempt to see if #1 is happening.

Regarding #3, it’s possible to run HA in a debug mode, which will enable debug in asyncio. This should generate warnings if any piece of code is misbehaving (basically taking more than 0.1 seconds between task switch points.) If you can run HA from the command line, then: hass --debug. If not, you can also enable asyncio debug by setting the environment variable PYTHONDEVMODE to a non-empty string. I’m not sure which can work for you since you’re using the HA OS.

In any case, I’m really struggling to understand how you could be experiencing minutes of delay for a time trigger, even considering all of those potential causes.

chrisfairbrother · February 5, 2023, 12:26pm

lights on at sunset and lights off at 10:15 - nothing not even hours later. and my notification of my garage door is open is not working now!

mac mini virtualization acceleration
“Basically, you need to ensure your software is updated to latest version . VT-x is enabled by default, but there’s no “setting” or command that can be issued to turn it on.”

debugging:- the only way I found to turn on the debugging was:-
on the command line
su options --debug=true
su reload

I assume extra logs will be added to home-assistant.log?

I tried the garage door open automation this morning,
2023-02-05 11:39:05.043 ERROR (MainThread) [homeassistant.components.mobile_app.notify] Requested entity was not found. This message is generated externally to Home Assistant.

I will test your drift sensor

I suppose my existing HA could have a problem? It seems to be getting worse.
I will be the first to admit my Mac-mini is slow. I don’t get the coloured wheel (waiting to do something)
It only uses 15 watts, is silent and all my temperature gauges in HA work fine. It only has problems with automations.

I will work on the new HA VM I built the other day, Using only 3 automations and see what happens.

chrisfairbrother · February 5, 2023, 6:32pm

Stooped my existing HA
working on the new installation of HA as a VM. Trying to prove its a timing error and not something in my existing HA
added only :-
my test_relay, shelly 2 lamp, garage door sensor, terminal+ssh, hacs and your clock_drift (many thanks)
and 3 automations
garage door sensor → message to iphone is spot on
test relay on and off within a few milliseconds.
Looking good
added

clock_drift:
logger:
to configuration.yaml, restart.

started testing at 17:13 by 17:51 the clock drift is over a second.
sampling at about 10 minutes although your script = “INTERVAL = timedelta(minutes=5)”?

test_relay on at 17:40 trace Executed: 5 February 2023 at 17:42:52 nearly 3 minutes
test_relay off at 17:45 trace Executed: 5 February 2023 at 17:50:54 nearly 6 minutes

test_relay on at 18:10 trace Executed: 5 February 2023 at 18:10:49 nearly 1 minute
test_relay off at 18:15 trace Executed: 5 February 2023 at 18:22:48 nearly 8 minutes
really weird times!
your clock_drift now shows -2 seconds at 18:28
nothing relevant in the home-assistant.log

pnbruckner · February 5, 2023, 10:41pm

At 17:51, if I’m reading it correctly (i.e., commas vs periods, I am in the US!), it looks like the drift was 1129 seconds (remember, the scale is milliseconds), which is almost 19 minutes, or 1/2 of the total time it was running.

When it samples depends on the same infrastructure as the time trigger, so the fact that the sampling is delayed is just further proof of the problem.

However, that shouldn’t affect the drift calculation, which is based on two sets of readings, each of the two clocks. The first “reference” set is captured shortly after startup, and the second set each time it samples. For each clock it calculates how much time elapsed between the reference and the sample.

(The elapsed time calculated from time.time() should be indicative of how much wall time actually elapsed, possibly affected by NTP clock adjustments, whereas the elapsed time calculated from hass.loop.time() is an indication of how much time HA’s scheduler thought should have elapsed.)

All being perfect the two clocks should show the same amount of time elapsed. The drift is then the difference between these two elapsed times. Positive values mean the scheduler is running too fast (time scheduled events happen too soon), whereas negative values mean the scheduler is running too slowly (time scheduled events happen late.)

After you’ve run for a while, does the drift sensor graph continue to slope in the same direction and at about the same rate? Or does it sometimes jump up and down, too? I’d be interested to see a longer run of that graph.

Clearly, though, something is wrong with the underlying OS clock that hass.loop.time() is based on (which is generally time.monotonic().)

pnbruckner · February 5, 2023, 10:52pm

Currently the integration writes INFO messages to home-assistant.log. To see them you may have to set:

logger:
  default: info

or

logger:
  default: warning
  logs:
    custom_components.clock_drift: info

Also, INFO messages don’t normally show up on the UI log page. You have to go look at the file itself, or click one of the buttons in the UI to download or load the entire log.

E.g., on my system, the last message was:

2023-02-05 16:46:16.600 INFO (MainThread) [custom_components.clock_drift] time_delta = 6610.037360668, loop_time_delta = 6610.037359718, drift = -0.000000950

The times in the log are in seconds. The first two numbers are the reference → sample elapsed time calculations, and the last is the difference, which is the drift.

tom_l · February 5, 2023, 10:53pm

Just FYI my clock drift is averaging about 0.0005ms (0.5us).

pnbruckner · February 5, 2023, 11:00pm

It’s important to note whether or not it is relatively stable, or continues to slope and/or shows big (like seconds, minutes, etc.) jumps, over a period of time. The “steady state” value, so to speak, isn’t very meaningful, or at least, probably isn’t (especially values in the microseconds. That’s just “noise.”)

EDIT:

Oh, and BTW, big spikes (say milliseconds, and where it goes back to the steady state value), could be indicative of poorly behaving code, e.g., that might be doing I/O in the event loop. Any jump 0.1 sec or more should also generate an asyncio warning if its debug mode is enabled.

tom_l · February 5, 2023, 11:02pm

Yeah, it has only been running for 9 hours but looks pretty flat on average.

chrisfairbrother · February 6, 2023, 9:49am

todays drift, 8.3 hours

pnbruckner · February 6, 2023, 2:54pm

So clearly hass.loop.time() (which, again, I believe is just calling time.monotonic()), is running at only half the rate it should be, meaning any scheduled time event will happen twice as far into the future as it should.

The question now is, why? Is it the macmini’s OS? VirtualBox? Or HA OS?

You should be able to narrow down the culprit by running a small Python script at each level, starting with the macmini OS. E.g., you could run something like this:

$ python3
Python 3.10.8 (main, Oct 12 2022, 19:14:26) [GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> def test():
...     t0 = time.time()
...     m0 = time.monotonic()
...     while time.time() < t0 + 10:
...         pass
...     t1 = time.time()
...     m1 = time.monotonic()
...     td = t1 - t0
...     md = m1 - m0
...     print(f"td: {td:0.9f}, md: {md:0.9f}, diff: {md - td:0.9f}, ratio: {md/td:0.3f}")
...
>>> test()
td: 10.000000477, md: 10.000000452, diff: -0.000000025, ratio: 1.000

The ratio should be one. Try it natively on the macmini. If that’s not one, then that’s the problem. If it is, try to run it at a HA OS command line. I’m guessing you’ll see a ratio of 0.5. If so, that confirms the problem. But is it the HA OS, or VirtualBox. To answer that question, maybe try creating a new VM, install some easy OS (say, Ubuntu), and then try again in a terminal window in that guest OS.

If the ratio is correct natively on the macmini, but it’s wrong in HA OS (and ok in another VBox guest), then I’m starting to wonder if they did something funky in the HA OS that’s not working correctly in this environment (maybe ARM vs Intel???)

EDIT:

Here’s the code you can copy directly into a .py file (e.g., clock_test.py), and then run from the command line (e.g., “python3 clock_test.py”):

import time
def test():
    t0 = time.time()
    m0 = time.monotonic()
    while time.time() < t0 + 10:
        pass
    t1 = time.time()
    m1 = time.monotonic()
    td = t1 - t0
    md = m1 - m0
    print(f"td: {td:0.9f}, md: {md:0.9f}, diff: {md - td:0.9f}, ratio: {md/td:0.3f}")
if __name__ == "__main__":
    test()

I just tried it on Ubuntu running on VMware Workstation Player running on Windows 10 on an x86 laptop:

$ python3 clock_test.py 
td: 10.000000238, md: 10.000000391, diff: 0.000000153, ratio: 1.000

chrisfairbrother · February 6, 2023, 5:38pm

macmini
td: 10.000000000, md 10.000110955, diff: 0.000110955, ratio: 1.000


ok. I have created test.py in config/python_scripts, and added python_script: to configuration.yaml.
As I am new to HA, how do I run it please Phil ?
tried python ./test.py
and #!/srv/homeassistant/bin/python test.py

pnbruckner · February 6, 2023, 5:53pm

Ok, looks like the macmini & its OS aren’t the source of the problem.

Once you’ve got that created, you can invoke it via the SERVICES page. Look for python_script.test.

Unfortunately, although you can use time.time(), you can’t use time.monotonic() or hass.loop.time() from that context. (I actually tried when I was first creating the clock_drift script.) That context only allows a “safe”, “sandboxed” subset of Python.

Also, the script above isn’t designed to be used in that context, either. For that, it would look more like this:

def test():
    t0 = time.time()
    m0 = time.monotonic()
    while time.time() < t0 + 10:
        pass
    t1 = time.time()
    m1 = time.monotonic()
    td = t1 - t0
    md = m1 - m0
    logger.info(
        "td: %0.9f, md: %0.9f, diff: %0.9f, ratio: %0.3f",
        td, md, md - td, md/td,
    )
test()

You’d have to be able to do this from a HA OS command line. I’ve only tried using HA OS in a VM once, and not for very long, so I’m not even sure this is possible.

I think more enlightening would be creating another, generic Linux VM on VirtualBox, and trying it there. Although it’s possible the problem is due to something HA is doing, it’s more likely something caused by VBox or HA OS (or the Python implementation/environment that HA OS creates, and I suspect trying this in HA OS will just provide the same results as what you’ve already seen within HA via my clock_drift integration. But, I guess, you never know until you try it.)

chrisfairbrother · February 6, 2023, 6:32pm

ok thanks Phil, Downloading ubuntu. Will post results tomorrow

pnbruckner · February 8, 2023, 3:12pm

You’re keeping us in suspense! Were you able to try it yet on Ubuntu on VBox? If so, what were the results?

chrisfairbrother · February 13, 2023, 1:47pm

Sorry Phil, I have had visitors up until this morning. I have installed Ubuntu. it took ages. the result was
td: 10.000000238, md: 9.999999003, diff: -0.000001235, ratio: 1.000
It still a ratio of 1 but quite a drift after 10 seconds…

I have now installed a test version of HA on my newer iMac as a virtual machine. It runs ok; the timing for my test relay is within less than a second. I will add everything I had on the mac mini version today and test the sunset tonight.
Chris

pnbruckner · February 13, 2023, 5:38pm

No problem. Just curious what the ultimate cause is, in case it’s something in how HA (or probably more likely HA OS) is configured or is run on a Mac.

Why do you say it was quite a drift? After 10 seconds it was only off by about a microsecond. Can’t ask for much more than that. (That is basically in the noise.)

So, it looks like macmini native is ok. Ubuntu on macmini on VBox is ok. It’s only HA / HA OS on macmini on VBox that isn’t working correctly. My best guess at this point is it’s not HA, so it must be HA OS, or at least some portion of the environment that the HA OS image creates to run HA within. I’m starting to wonder if maybe the core team has modified the Python run-time environment such that time.monotonic() isn’t working correctly on macmini/VBox…

chrisfairbrother · February 14, 2023, 10:16am

Yes you are correct. The drift is ok (I was rushing to reply to your post). My new HA vm on my Imac is still running very good. I will continue to use and test it. The Imac is much faster than the mac mini. This is really noticeable when compiling code using esphome. I will keep the HA on the mac mini to see if the cause can be found.