Recorder Retention Period By Entity

Your original post is from Jan 2021. The database has been significantly optimized since than and I’d expect the same data would be nearly an order of magnitude smaller to store.

I wouldn’t expect new users to ever have to configure excludes at all, and the defaults should be just fine unless there is an integration that is misbehaving. At least from the databases I’ve received from users with scale problems, the issue has almost always been a specific custom integration that is overloading the system and the large database is only a symptom of a wider problem. If a user is routinely needing to configure specific purge intervals, I think we are solving the wrong problem.

I’m over simplifying but the current purge query takes the top rows off the database and deletes them until it reaches the time frame to stop removing them. That makes the query very efficient to avoid overloading the system during a purge. When you start adding more conditions to scan for you end up doing multiple queries to find the rows to delete. Right now the system ends up searching for the rows once and always removing everything you search for except for the last query which may or not return enough rows to delete and than you are done. By adding multiple options you end up searching over rows that you are not going to delete since they are excluded (the index will reduce the number rows that will actually need to be searched), but you still end up with significantly more rows being searched every time a purge runs.

I’ve experienced that. In my case I wouldn’t characterize it as ‘misbehaving’ but ‘overly zealous’. :slightly_smiling_face:

I have several Homekit-compatible devices (ecobee Switch+) that report motion, temperature, and illuminance. All changes are reported promptly, which is great, but because illuminance is reported with three decimal places, it produces a firehose of data (and for all practical purposes, I don’t need to know if the light level changed by a mere fraction).

A long time ago, I asked the developer to consider modifying Homekit Controller to round down illuminance to the nearest integer value (request remains pending). Anyway, what I have done, with every new release of Home Assistant, is patch the integration to round illuminance (using a script to modify the code in the container). That helps to reduce the reporting of illuminance changes at the source and keep the sensor’s history to a reasonable amount.

If I didn’t patch the code, I too would be looking for a way to selectively purge history, for the illuminance sensors, on a more frequent basis than 10 days (like every two days or even daily).

I’ll keep patching the integration, but it’s good to know that there’ll be a recorder.purge_entities service call.


EDIT

Full Disclosure: It’s been so long since I started patching the integration that I no longer recall if illuminance is reported with three or two decimal places. Either way, IMHO it doesn’t need any decimal places because fractional values of 1 lux aren’t useful.

How many state changes is it generating per day without the patch?

Thank you for the thoughtful reply, and all the database improvements.

Assuming you are correct that the default values are now adequate for the majority of users, and given that there are options for more advanced users to tweak things even more, I’d be OK withdrawing this FR.

[One year later: I think the above may have been premature. I still see regular reports in this forum of problems with oversized databases and general frustration with the one-size-fits-all keep_days. There are still spammy entities, and for some entities, the entire list of attributes is saved, as a new record, each time any one of them changes. So even a so-called “binary” entity can rapidly bloat the database. The long-term, archival data is still stored in the same database as the live, dynamic data, which is a less-than-ideal design. I suppose I should be glad that people are still finding, and voting for, this FR. But I wish they didn’t have to.]

Honestly, I no longer recall because it’s been almost three years since I started patching the integration (FR to report value as integer was posted in 2020).

My recorder’s purge period is 5 days. I queried the database for just one light sensor and it reported ~1000 records so let’s say about 200 per day. I find that perfectly acceptable (this is with the patched integration that report values as integers not floats).

I can remove the patch, but if you don’t mind, when recorder.purge_entities becomes available so I can easily discard the excess records after a day or two of testing.

Sounds good. States rows should at least 26 bytes smaller per row in 2023.4.x as well so that will help a bit as long as the attributes are not constantly changing.

At least on my large production database (default settings) the minimum savings is ~54MB. The actually savings was closer to 180MB because thats the minimum savings per row and the indices are smaller as well.

sqlite> select count(state_id) from states;
2211850
sqlite> select count(state_id) * 26 from states ;
57508386
sqlite> select count(state_id) * 26 / 1024 / 1024 from states ;
54

I haven’t read through everything in this post, but I have some other thoughts. I did write up a simple script to throw away every second sampling point for a specific list of entities. If I get the time I will polish it and release it as a stand-alone command-line tool.

1 Like

Can you please post a script @tobixen ?

@tobixen I’m also interested in your script :grinning:

Another rather simple way to reduce the database size somewhat would be to drop all no longer used columns after they’ve been replaced: null columns still use up some storage in each row (*). For example in the states table are 8 out of the 15 columns never populated in my database: entity_id, attributes, event_id, last_changed, last_updated, context_id, context_user_id and context_parent_id. Dropping these unused columns from my states table -with 5.2 million rows in it- frees up almost 40MB of data, and most likely even more disk space because a smaller row size can increase page fill efficiency. And states is not the only table that has such columns that are always null.

(*) Storing a null value requires 1 byte per column per row according to this: (https://stackoverflow.com/questions/14934488/is-disk-space-consumed-when-storing-null-data).

Completely agree. I need to store data for central heating over the winter. The rest of data I don’t care much - 10 days or even less is probably enough.

Please let’s get this retention period per entity done.

Thanks

Yes, I would love to see this in the future HA version!

I guess this would be the easiest way to setup:

3 Likes

I think that I would like to see days retention implemented.

My personal preference would be to be able to bulk apply recorder strategies to entities.

Temperature strategy…

max_days: 10
precision: 1
rounding: ‘half’

I implement this already with template sensors for precision and rounding, and with a combination of labels and scripts that purge entities at midnight.

1 Like