How recorder.purge service works?

mirekmal · August 7, 2020, 1:03pm

I think I found the problem I started to face recently with displaying history and events not being properly recorded into DB (described elsewhere). Issues started few days ago and I just realised, that I just reached configured purge limit for my recorder DB - after DB schema for HA were updated to improve performance, I started with the fresh MariaDB instance and configured it to hold 31 days of history. Now this limit was reached just few days ago, when my problems started. Indeed, when I look at performance data of DB engine in Synology NAS, I see exact moment I start to observe issues around (around 4:00 in the morning) huge spike in IOPS and volume utilisation. I might only guess it is caused by purge operation - for past days, before reaching limit there was nothing like that. Now I could interrupt the process by restarting the HA… which probably is not good for DB, but after restart HA nor DB is no longer loaded, so performance of system and speed of displaying historical data returns to normal - until next morning.
So this leads me to the question; how recorder.purge service works? Is it simple SQL query send to database engine to delete obsolete record and the DB engine handles this request on its own? In such scenario I should not observe such issues, since I assume problem is on side of combined HA and DB performance during purge, so HA restart should not heal situation. Other possible scenario is that HA is handling somehow this operation until completed and the interrupting it stops the proces, which would be in line with my observation - performance is restored, old records are not deleted from DB, issue reappears early in the morning, when new purge cycle starts.
Also as follow up question; will forcing purging history to lets say 7 days heal the situation? Now purge runs for few hours (or at least this is for how long I see spike in CPU and IO consumption). Will shorter history help to remediate?

tom_l · August 7, 2020, 2:04pm

Anecdotally based on a sample size of one (me) - yes.

I keep 7 days of data in MariaDB and have not seen any issues like this.

One thing you might want to try is a repack and purge. See:

mirekmal · August 7, 2020, 3:38pm

Thanks @tom_l . Indeed with lots of entities (1600+) and 31 days of history database is large… I have dedicated volume on my NAS running on mirrored SSD and it reaches ~20GB, even if I already excluded majority of frequently updating sensors. Until I reached 31 days limit, so no purge was performed, performance of this setup was flawless! Loading of default history was instantaneous and loading lets say a week of history from the past was up to 15 seconds. Problems started with purge.
Now what I did trying to reduce the history is I ran manually recorder.purge down to 7 days WITH repack option… so this explains why it is running already for several hours (9 hours as we speak). But also while doing regular maintenance this peak in utilisation was observed, so i suspected that repack: true is the default option.
So as at the moment history is barely usable and I have no clue how long it might take to complete purge (not interrupting this time), I started to think about resetting the DB entirely and starting from scratch. Side effect of system irresponsivness is flood of error messages in log; after one day I have already 334,000 of entries (!) mainly warnings from frequently updating SNMP sensors (update took longer than scheduled update interval), that cannot be written to DB (i think).

tom_l · August 7, 2020, 3:51pm

I don’t think it is. After the recent optimisations I did not see the expected reduction in size of my database until I performed a manual purge with repack set to true.

bdraco confirmed that the optimisation only marks the data as deleted and to purge it you have to repack.

How attached to your data are you?

If it’s not that important to you it might be easier to copy your MariaDB config to a text file. Delete the addon. Reinstall it and paste back your config. Then restarting would probably be wise.

Start with a clean slate.

mirekmal · August 7, 2020, 5:27pm

Well, I’ll let process go until tomorrow moring, if not completed I’ll probably reset setup. If completed I’ll test performance and what difference going down from 31 to 7 days will make. Regarding history… I’ll recover all I need within 4 days max - this is what I use for my irrigation system, in meantime not a problem to run it manually, if needed. I use some long lasting history to observe trends that I use to finetune some of my system parameters, so this might be loss… but can always consider Influx + Grafana if really needed.
I’m not using add-on btw, I have MariaDB running on Synology NAS, so I’ll just delete/reinstal this on NAS to start fresh, without need to do anything on HA, just to shut it down for the period of DB reinitialization.

tom_l · August 7, 2020, 5:58pm

Yeah I use influx/grafana for long term data. It seems to be quite a bit more space efficient.

mirekmal · August 8, 2020, 1:55pm

OK, so seems purge was not finished over night, so I stopped everything, reset DB and started over with fresh history. Obviously so far, so good.
This time I took slightly different approach; instead of limiting troublesome, generating lots of events entities, I started with everything excluded and only adding these entities, that I want to present in graphs or need some historical data for automations. This should also significantly reduce amout of data written to DB.
I noticed, however some strange behaviour of some sensors, looks like their hoistorical data is stored outside of recorder DB. Here is the example; graph showing inside and outside temperatures comparison over past 5 days onver the background of day/night sensor:
Screenshot 2020-08-08 at 3.47.44
While temperature is shown properly, only few hours or recorded changes, day/night sensor shows full history over past few days, that was deleted along with database! This is template sensor, created as conversion of deconz daylight sensor to 1 and 0 states, so definitely is not taking its data from any other source…
Similarly I noticed that all Netatmo Weather Sation sensors are also showing their full history
OK, now I need to wait 7 days to see how the purge will work under these new conditions and also start planning for Influx/Grafana…