WTH stopping service is too slow

Cougar · August 20, 2020, 4:59pm

When docker stop or docker restart sends SIGTERM only main thread stops but workers still continue running. The container stops after 10 sec timeout when docker sends SIGKILL. Basically, the container restart takes 10 sec more because of this behavior. There should be some way to ask all threads to exit immediately.

Edit: the internal restart is even slower. Docker at least kills the process with all its threads in 10 sec while restart from HA takes about 60 sec to finish

2020-08-20 22:29:40 INFO (MainThread) [homeassistant.components.script.restart_home_assistant] Restart Home Assistant: Running script 
2020-08-20 22:29:40 INFO (MainThread) [homeassistant.components.script.restart_home_assistant] Restart Home Assistant: Executing step call service 
…
2020-08-20 22:29:42 INFO (MainThread) [homeassistant.components.stream] Stopped stream workers 
2020-08-20 22:29:42 INFO (MainThread) [homeassistant.components.websocket_api.http.connection.140395086521536] Connection closed by client 
…
Home Assistant attempting to restart. 
Restarting Home Assistant 
2020-08-20 22:30:48 INFO (MainThread) [homeassistant.setup] Setup of domain logger took 0.0 seconds 
…

bdraco · August 20, 2020, 11:07pm

I just timed it on mine, and takes about 3-4 seconds to shutdown.

2020-08-20 23:02:37 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_stop[L]>
2020-08-20 23:02:40 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_final_write[L]>
2020-08-20 23:02:40 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_close[L]>

It would be helpful to see where the time is spent on your instance in order to see if there is an opportunity to make it faster.

Can you turn on debug logging for homeassistant.core, restart, and see if it is taking a while for the homeassistant_stop, homeassistant_final_write or homeassistant_close events ?

logger:
  default: warning
  logs:
     homeassistant.core: debug

Cougar · August 21, 2020, 5:02pm

Thanks for your assistance. I tried different things and was able to stop core faster but there is still some unknown delay in it. Following is the description what I did.

I see that after restart event only the websocket stream to the browser stops immediately but I still see most sensors sending data to the MainThread. I inititated restart via homeassistant.restart service. I removed some integrations (like tracker) to keep log for myself a little bit shorter.

I think this is normal that between restart service call and homeassistant_stop event it takes couple of seconds if there are multiple events in the queue that still needs to be processed.

I often see long time between homeassistant_stop and homeassistant_final_write and random state_changed events during this time. This is where most time goes.

Here it takes 28 sec while first 3 sec goes from service call to homeassistant_stop event.

2020-08-21 08:26:24 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L] domain=homeassistant, service=restart, service_data=> 

… many Bus:Handling <Event state_changed[L] from other sensors …

2020-08-21 08:26:27 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_stop[L] 

2020-08-21 08:26:35 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.influxdb_bme280_temperature, old_state=<state sensor.influxdb_bme280_temperature=19.4; unit_of_measurement=°C, friendly_name=Temperatuur väljas @ 2020-08-21T11:25:35.262676+03:00>, new_state=<state sensor.influxdb_bme280_temperature=19.6; unit_of_measurement=°C, friendly_name=Temperatuur väljas @ 2020-08-21T11:26:35.285844+03:00>> 
2020-08-21 08:26:38 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=binary_sensor.iss, old_state=<state binary_sensor.iss=off; number_of_people_in_space=3, next_rise=2020-08-21T11:58:30+03:00, long=-106.4767, lat=-36.5395, friendly_name=ISS, device_class=visible @ 2020-08-21T11:25:36.779790+03:00>, new_state=<state binary_sensor.iss=off; number_of_people_in_space=3, next_rise=2020-08-21T11:58:30+03:00, long=-102.9810, lat=-33.9159, friendly_name=ISS, device_class=visible @ 2020-08-21T11:25:36.779790+03:00>> 

2020-08-21 08:26:52 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_final_write[L] 
2020-08-21 08:26:52 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_close[L] 
Home Assistant attempting to restart.

In another run the homeassistant_stop happens almost immediately but then it still takes 22 sec before homeassistant_final_write gets fired:

2020-08-21 08:29:53 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L] domain=homeassistant, service=restart, service_data=> 
2020-08-21 08:29:56 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_stop[L] 
2020-08-21 08:30:00 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.time_utc, old_state=<state sensor.time_utc=08:29; friendly_name=Time (UTC), icon=mdi:clock @ 2020-08-21T11:29:00.179148+03:00>, new_state=<state sensor.time_utc=08:30; friendly_name=Time (UTC), icon=mdi:clock @ 2020-08-21T11:30:00.180158+03:00>> 
2020-08-21 08:30:00 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.time, old_state=<state sensor.time=11:29; friendly_name=Time, icon=mdi:clock @ 2020-08-21T11:29:00.179555+03:00>, new_state=<state sensor.time=11:30; friendly_name=Time, icon=mdi:clock @ 2020-08-21T11:30:00.180555+03:00>> 
2020-08-21 08:30:00 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.date_time_utc, old_state=<state sensor.date_time_utc=2020-08-21, 08:29; friendly_name=Date & Time (UTC), icon=mdi:calendar-clock @ 2020-08-21T11:29:00.179856+03:00>, new_state=<state sensor.date_time_utc=2020-08-21, 08:30; friendly_name=Date & Time (UTC), icon=mdi:calendar-clock @ 2020-08-21T11:30:00.181839+03:00>> 
2020-08-21 08:30:00 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.date_time_iso, old_state=<state sensor.date_time_iso=2020-08-21T11:29:00; friendly_name=Date & Time (ISO), icon=mdi:calendar-clock @ 2020-08-21T11:29:00.180137+03:00>, new_state=<state sensor.date_time_iso=2020-08-21T11:30:00; friendly_name=Date & Time (ISO), icon=mdi:calendar-clock @ 2020-08-21T11:30:00.182265+03:00>> 
2020-08-21 08:30:00 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.last_power_update, old_state=<state sensor.last_power_update=3; unit_of_measurement=sec, friendly_name=Last power update from CC128 via MQTT, icon=mdi:refresh @ 2020-08-21T11:28:00.181929+03:00>, new_state=<state sensor.last_power_update=9; unit_of_measurement=sec, friendly_name=Last power update from CC128 via MQTT, icon=mdi:refresh @ 2020-08-21T11:30:00.189377+03:00>> 
2020-08-21 08:30:18 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_final_write[L] 
2020-08-21 08:30:18 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_close[L] 
Home Assistant attempting to restart.

Then I disabled time_date sensor (because I almost every time see its events after reload) and now log is almost empty but still half of minute goes to int homeassistant_stop event

2020-08-21 08:41:34 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L] domain=homeassistant, service=restart, service_data=> 
2020-08-21 08:41:37 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_stop[L] 
2020-08-21 08:41:37 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event state_changed[L] entity_id=sensor.cc128_sensor_3185_ch1_power, old_state=<state sensor.cc128_sensor_3185_ch1_power=676; unit_of_measurement=W, friendly_name=1. faasi hetkevõimsus, device_class=power @ 2020-08-21T11:41:31.593262+03:00>, new_state=<state sensor.cc128_sensor_3185_ch1_power=663; unit_of_measurement=W, friendly_name=1. faasi hetkevõimsus, device_class=power @ 2020-08-21T11:41:37.546394+03:00>> 
2020-08-21 08:42:06 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_final_write[L] 
2020-08-21 08:42:06 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_close[L] 
Home Assistant attempting to restart.

I changed logging for task waiting in core.py to show waiting tasks every second (by default only tasks takeng more than 60 sec will be logged).

@@ -423,12 +423,14 @@ class HomeAssistant:
     async def _await_and_log_pending(self, pending: Iterable[Awaitable[Any]]) -> None:
         """Await and log tasks that take a long time."""
         wait_time = 0
         while pending:
-            _, pending = await asyncio.wait(pending, timeout=BLOCK_LOG_TIMEOUT)
+            _, pending = await asyncio.wait(pending, timeout=1)
             if not pending:
                 return
-            wait_time += BLOCK_LOG_TIMEOUT
+            wait_time += 1
             for task in pending:
                 _LOGGER.debug("Waited %s seconds for task: %s", wait_time, task)

And there was only one task that took more to 2 sec to finish and the last logline was following:

2020-08-21 09:30:17 DEBUG (MainThread) [homeassistant.core] Waited 24 seconds for task: <Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.8/asyncio/futures.py:360, _wait.<locals>._on_completion() at /usr/local/lib/python3.8/asyncio/tasks.py:507]>

Here I’m a little bit stuck now. I don’t know how to figure out what task is this one, that is waiting there so long. Any hints, how to find that?

bdraco · August 21, 2020, 5:59pm

Here are all the core integrations that wait for STOP.

https://paste.ubuntu.com/p/WkxpzyQkS8/

You could try disabling them one by one until it is fast.

I’d start with zwave and stream

bdraco · August 21, 2020, 6:59pm

@Cougar

This commit will record what happens during stop. After cherry-picking it in, restart twice you will get a stop.cprof file.

pyprof2calltree -i stop.cprof -o callgrind.out.111100. (you may have to pip install pyprof2calltree)
Copy it down an open it with qcachegrind

To install qcachegrind on a mac

brew update

brew install graphviz

brew install qcachegrind

launch the mac app using something like

open /usr/local/Cellar/qcachegrind/18.04.3_1/qcachegrind.app

load the generated callgrind.out.XXXXXX file

You’ll end up with something like this. In my case emulated_hue takes a while to shutdown (I’ve already made it faster but maybe I can find a way to speed it up some more)

bdraco · August 21, 2020, 9:07pm

emulated_hue now shuts down almost immediately

I also found that DataUpdateCoordinator would still do updates after the stop event. Something like this should resolve that

diff --git a/homeassistant/helpers/update_coordinator.py b/homeassistant/helpers/update_coordinator.py
index 7b7e6af4d6..35bbddc89b 100644
--- a/homeassistant/helpers/update_coordinator.py
+++ b/homeassistant/helpers/update_coordinator.py
@@ -138,6 +138,9 @@ class DataUpdateCoordinator(Generic[T]):
 
         self._debounced_refresh.async_cancel()
 
+        if self.hass.is_stopping:
+            return
+
         try:
             start = monotonic()
             self.data = await self._async_update_data()

Cougar · August 22, 2020, 2:50pm

Hi,

I my case, I don’t see (or understand) any delay in python code. Looks like some timeout is set for the select.epoll and this is not canceled properly. Now the question is, is this some module/integration that is waiting for something. Or could it be something in MainThread?

I uploaded it also to github if you’d like to check it yourself.

bdraco · August 22, 2020, 10:59pm

You can ignore select and the poll as this is effectively the idle time in the event loop.

It looks like its entity updates and template entities being re-rendered. We don’t stop tracking template entities ever in 0.114, but that is fixed in 0.115 already and being tracked here

The fix I posted above for DataUpdateCoordinator likely fixes the bulk of integrations that are still doing updates after the stop event. I’ll work on turning that into a PR.

Cougar · August 23, 2020, 2:30pm

I got some more ideas now and after half day of debugging, I finally found something. I did a lot of tests (every waited at least one minute to let system fully load all components and become stable). Sometimes stage 1 took 3-4 sec, sometimes it took 20 or more sec to finish.

_await_and_log_pending() shows last waiting tasks always <Future pending cb=[_chain_future.<locals>._call_check_cancel() at /usr/local/lib/python3.8/asyncio/futures.py:360, _wait.<locals>._on_completion() at /usr/local/lib/python3.8/asyncio/tasks.py:507]>. This however doesn’t tell me much as it is just a piece of async library itself.

In every profile I took, almost all time has been passed just in select call. This is the problem with async code where no part of program run but it still stays “blocked” outside of it and takes time. It should be some component that fires async call with timeout but doesn’t listen EVENT_HOMEASSISTANT_STOP event or doesn’t stop itself before some timeout.

After a lot of debugging and straces, I finally found the culprit. Matrix thread was almost always the last one that exited just before stage 2 started.

Here is the log about how connection with matrix.org has been used and how the thread exit after it read the socket. The whole restart took 30 sec:

… here the self._sync() in matrix will be called
10211 16:56:30.736086 write(6, "2020-08-23 16:56:30 ERROR (Thread-3) [matrix_client.client] XXX ENTER self._sync()\n", 83 <unfinished ...>
… socket changed to be blocking here and read from server is started
10226 16:56:30.736129 ioctl(9, FIONBIO, [0] <unfinished ...>
10226 16:56:30.736332 read(9,  <unfinished ...>

… restart event
10211 16:56:32.928265 write(2, "2020-08-23 16:56:32 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L]: domain=homeassistant, service=restart, service_data=>", 150 <unfinished ...>
… exit stage 1 starts
10211 16:56:36.667704 write(2, "2020-08-23 16:56:36 WARNING (MainThread) [homeassistant.core] XXX async_stop start STAGE 1", 90 <unfinished ...>
… everything else is done but we wait for Matrix thread another 24 sec until here we get data from server
10226 16:57:00.823733 <... read resumed>"\27\3\3\3\212", 5) = 5
10211 16:57:00.825589 write(2, "2020-08-23 16:57:00 ERROR (Thread-3) [matrix_client.client] XXX EXIT self._sync()", 81 <unfinished ...>
… now self.should_listen get checked and loop terminates
10226 16:57:00.829448 +++ exited with 0 +++
… exit stage 1 ends just after that
10211 16:57:00.837447 write(2, "2020-08-23 16:57:00 WARNING (MainThread) [homeassistant.core] XXX async_stop start STAGE 2", 90) = 90
… and HA is ready to restart
210   16:57:00.910010 write(2, "Home Assistant attempting to restart.\n", 38 <unfinished ...>

Looks like the problem is in stop_client() in matrix component which calls stop_listener_thread() in Matrix Client SDK. This function sets self.should_listen = False but the check of this flag happens after every 30 sec only and for me it seems that nothing can stop it inside this 30 sec timeout. When I reload HA just after self._sync(timeout_ms) call, it takes the most time and always happens after this call returns.

Without matrix integration, restart become very snappy:

2020-08-23 15:57:02 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event call_service[L] domain=homeassistant, service=restart, service_data=> 
…
2020-08-23 15:57:06 DEBUG (MainThread) [homeassistant.core] Bus:Handling <Event homeassistant_close[L] 
Home Assistant attempting to restart.

Looks like it’s time to switch to matrix-nio as suggested here by the original matrix-python-sdk maintainer by himself.

Thanks a lot @bdraco for all tips and hints how to debug it! I have written Python quite a lot but so far haven’t had chance to debug any async code so deeply