BUG: Recorder stops logging events

squirtbrnr · October 31, 2017, 4:22pm

For the past couple of days I have to restart my HA docker container to get the recorder working again.

My HA is running in a docker container on my Synology. HA is recording to a MariaDB database running on my Synology. My Recorder, Logbook, and History are configured properly and have been working as I see past entries in the Logbook and History pages as well as the state history for entities. My homeassistant.log does not show any errors in it. My database is accessible via phpMyAdmin using the user I setup for HA and it shows past entries in the events table. However, the recorder just stops and no further events are logged after a certain point until I restart HA.

Unfortunately I don’t have a GitHub account so I can’t post a formal bug report (no I’m not going to create one right now) but I wanted to put some feelers out and see if it’s just me, or maybe a bug. I am running the latest HA docker container 0.56.2. I just recently updated my Synology to the latest DSM release (6.1.3-15152-Update 8). All packages are up to date. My recorder database is about 635MB.

Here’s my YAML config:

recorder:
  db_url: !secret mysql_db
  purge_interval: 1
  purge_keep_days: 28
  exclude:
    entities:
      - sensor.date
      - sensor.time
      - sensor.time_utc
      - sensor.pws_weather_1d
      - sensor.pws_weather_1n
      - sensor.pws_weather_2d
      - sensor.pws_weather_2n
      - sensor.pws_weather_3d
      - sensor.pws_weather_3n
    domains:
      - camera

One commonality I see is it has happened either immediately or within 2-3 hours after I run a Z Wave network heal. I have one scheduled to run at 3am everyday (and recorder stopped writing to the database around 6:30am), but I think it also happened after I manually initiated one yesterday afternoon. It may not be related, but that’s what I was doing when it happened both times.

squirtbrnr · November 16, 2017, 10:03pm

so this happened again. and in fact it has happened a few other times between when I reported it and now. I restarted HA this morning cause I was configuring a couple components and trying clean up a few things. I restarted it around 7:20am and by 8am it had stopped updating history on the history page and for components. however this time when I restarted HA, all the history was there and repopulated correctly. So it was recording to the database, but HA was failing to retrieve the recorded data. There were a few python errors in my log file.

Where should I put/upload the files, who wants to look at them, anyone else seeing or has seen this issue, does anyone care about the history as much as I do?

squirtbrnr · November 17, 2017, 2:10pm

and yet again! recording died sometime before midnight. I did a restart of my docker container, everything was working fine and within 1 hour of restarting the container the recorder stopped logging. I can access my database just fine using the same credentials as HA. I can browse the database and see it just stopped recording entries.

Someone open a bug report for me. I’ll gladly provide documentation and logs.

EDIT: Here’s the error I just got in my homeassistant.log

2017-11-17 14:59:09 ERROR (Recorder) [homeassistant.components.recorder.util] Error executing query: (_mysql_exceptions.OperationalError) (1206, 'The total number of locks exceeds the lock table size') [SQL: 'DELETE FROM events WHERE events.time_fired < %s'] [parameters: (datetime.datetime(2017, 11, 3, 14, 53, 40, 921474, tzinfo=<UTC>),)]

And again, it’s not recording. This time within 30 minutes of restarting.

EDIT 2: Every 30-60 minutes after restart, recorder stops recording. Just happened again and even after restarting my entire system (synology, docker, and MariaDB). Plain and simple, something in the recorder is broken

PeteB · November 17, 2017, 4:48pm

@squirtbrnr, from the log, it looks like Recorder is running a “purge” when the error is generated. When a purge happens, the Recorder first deletes old state records, then deletes old events records and finally recovers the database space. It looks like during/after the event deletions that a lock error is raised for some reason.I am not sure if this is the source of the problem or rather a symptom.

Can you let me know if you are still on version 0.56.2 as there have been a few changes to the Recorder component over the last month or two. If you now have a newer versions of HA (v 0.57 or later) then Recorder will check at start-up if a “purge” is overdue and will schedule it within the first hour. This may be why you are seeing consistently after a restart. Again, I dont think this is the source of the problem but rather a symptom that might help us track down the root-cause.

Three things that might help diagnose:

could you increase the logging level associated with the Recorder component so that we can see if there are other warnings being generated by amending configuration.yaml to include the following:

logger:
  default: warning
  logs:
    homeassistant.components.recorder: debug

if you are on a new HA version then you can trigger a manual purge with the “Developer Tools” on the front page of the web interface. Selecting the “Services” icon. Select “recorder” in the Domain field, “purge” in the service field and type {"keep_days": 28} into the Service Data field. If you could try this immediately after a HA restart then it might help determine if there is a Recorder problem.
I am not familiar with your setup and MariaDb but can you try deleting a record via the phpMyAdmin interface following a docker reboot but before starting up HA. This might help uncover if there is an underlying issue with the database unrelated to HA.

I have done some work with the Recorder component and happy to try and help diagnose but I’m no sql expert so we may need other assistance here.

PeteB · November 17, 2017, 5:01pm

@squirtbrnr, looking at github issues, you might want to check out the following with an associated fix coming in 0.58.

squirtbrnr · November 17, 2017, 5:29pm

Thanks @PeteB. I am on 0.57.2. I dropped the entire database through phpMyAdmin and started fresh. It seems to be logging to the database, but absolutely nothing is changing in state history for components or on the history page. It’s usually not until I restart my HA docker that the data sometimes will show up in history, otherwise it gets dropped completely and is gone.

I had my logger set to “info”, so it was missing all the debug messages. I will set it to debug and see what it puts in the log.

I did try a manual purge using the services tab and that’s when the error in the log I posted above showed up.

The first Github link you posted sounds almost identical to my issue. As you can see in my original post I have the purge_interval and purge_keep_days setup. I didn’t start seeing this issue consistently until I reached the purge_keep_days number. coincidentally, my database has entries that are older than the purge_keep_days. It seems like every time it tries to run a purge (within an hour of HA starting) it fails and the recorder stops logging data.

PeteB · November 17, 2017, 5:53pm

My read of @milanvo’s proposed fix is that if an error occurs during the last stage of purging (i.e. insufficient temporary disk space to compact the database) then this can cause subsequent disruption to recording. It looks like his fix introduces some error handling to log the root-cause of the problem and to more gracefully exit. His fix has been merged and looks to be coming in the next release (0.58). It may be worth holding out to upgrade and also exploring whether there could be any disk space issues on the Synology.

squirtbrnr · November 17, 2017, 5:57pm

No disk space issues on the synology. plenty of space left (6.8TB). I don’t have any quotas setup for any of the users for HA or MariaDB. I’ll hold out this weekend and try 0.58 when it’s released. In the meantime I guess I’ll just disable recorder/history.

milanvo · November 21, 2017, 2:45pm

Hi,

my fix #10405 mentioned by PeteB is probably not related to your issue - it adds exception handling to vacuum operation related to SQLite specifically. Your error:
(_mysql_exceptions.OperationalError) (1206, 'The total number of locks exceeds the lock table size')
on purge is probably caused by MySQL operation on a large number of rows - it’s true for purge high number of records. Try look at this and change MySQL parameters:

https://major.io/2010/02/16/mysql-the-total-number-of-locks-exceeds-the-lock-table-size-2/

squirtbrnr · November 21, 2017, 3:12pm

That may have been the cause of the error, but not the root cause of the problem.
The problem was when it came time to purge records, the command failed and recorder stopped recording data. After upgrading to HA 0.58 I have not had the error again and the recorder database is being managed properly. It’s recording data and purging data older than my purge_keep_days: 28. I’m not inclined to change any default InnoDB settings unless I see the error again.

squirtbrnr · November 21, 2017, 3:29pm

I spoke too soon. this morning after HA had been running for almost 3 days with no issues, it stopped recording again around 5:15am local time (11:15 UTC). Recorder and history data from then till now is just gone. No errors in the HA log.

EDIT: 1 error right at the time it stopped recording. This error lines up with the last entry in the recorder database.

2017-11-21 11:15:17 ERROR (Recorder) [homeassistant.components.recorder.util] Error executing query: Object of type 'function' is not JSON serializable

milanvo · November 22, 2017, 12:26pm

Unfortunately too general error Looks like invalid state or event data from a component that cannot be converted and stored to DB. Is there any related log or exception around this time ? What logger level do you have ? Consider to increase it to info to catch wrong component.

Exception handling in recorder should be improved anyway.

Now I see sensor.pws... excludes in your config. WUnderground is known to present invalid states. This looks very similar:

I have WIP pull request to improve error handling in WUnder., need some time to finish it. If you are interested, I can provide current script I am using at my system to test.

squirtbrnr · November 22, 2017, 1:56pm

yes, I do have a few errors in the WUnderground sensors around this same time. Those errors seem to happen in spurts. Most of the errors I have seen have a note about an invalid template or invalid data. My logger is configurable from the frontend using select lists and an automation. currently I am suppressing everything except for debug and higher on the recorder component.

milanvo · November 22, 2017, 2:33pm

This is WUnder. with fixed error handling. You can try changed code as custom component:
https://raw.githubusercontent.com/milanvo/home-assistant/0446e1810fef2bb933c32fe6ce005d11f510df52/homeassistant/components/sensor/wunderground.py

Just copy to .homeassistant/custom_components/sensor/wunderground.py and restart HA

squirtbrnr · November 22, 2017, 3:10pm

Yeah I run HA in a docker container. I’ll just wait till the PR is merged and released.

milanvo · November 26, 2017, 10:02pm

PR is merged, should be included in next HA version

raccettura · December 26, 2017, 9:00pm

Hm, I’m having what might be a similar issue. I just turned on debugging to help get more data. Several times now HA just stops recording events until I restart.

This time it was 5AM… nothing in the log to suggest a cause. Just no more recording.

squirtbrnr · December 26, 2017, 9:58pm

What version of HA are you on? Are you using the built in SQLite DB or an external DB? when you get a log that shows errors, post it with your config.

since the wunderground error handling was fixed about a month ago, I haven’t had any more issues with the recorder stopping. However, I did have another issue like I posted above where there were too many records in the query while purging and it got hung up on that. The fix for that was to change the innodb_buffer_pool_size in my.cnf for my MariaDB database as posted in post 9.

raccettura · December 26, 2017, 10:35pm

Running dev right now. Been happening since 0.59.

I’m using external mySQL.