Your original post is from Jan 2021. The database has been significantly optimized since than and I’d expect the same data would be nearly an order of magnitude smaller to store.
I wouldn’t expect new users to ever have to configure excludes at all, and the defaults should be just fine unless there is an integration that is misbehaving. At least from the databases I’ve received from users with scale problems, the issue has almost always been a specific custom integration that is overloading the system and the large database is only a symptom of a wider problem. If a user is routinely needing to configure specific purge intervals, I think we are solving the wrong problem.
I’m over simplifying but the current purge query takes the top rows off the database and deletes them until it reaches the time frame to stop removing them. That makes the query very efficient to avoid overloading the system during a purge. When you start adding more conditions to scan for you end up doing multiple queries to find the rows to delete. Right now the system ends up searching for the rows once and always removing everything you search for except for the last query which may or not return enough rows to delete and than you are done. By adding multiple options you end up searching over rows that you are not going to delete since they are excluded (the index will reduce the number rows that will actually need to be searched), but you still end up with significantly more rows being searched every time a purge runs.
I’ve experienced that. In my case I wouldn’t characterize it as ‘misbehaving’ but ‘overly zealous’.
I have several Homekit-compatible devices (ecobee Switch+) that report motion, temperature, and illuminance. All changes are reported promptly, which is great, but because illuminance is reported with three decimal places, it produces a firehose of data (and for all practical purposes, I don’t need to know if the light level changed by a mere fraction).
A long time ago, I asked the developer to consider modifying Homekit Controller to round down illuminance to the nearest integer value (request remains pending). Anyway, what I have done, with every new release of Home Assistant, is patch the integration to round illuminance (using a script to modify the code in the container). That helps to reduce the reporting of illuminance changes at the source and keep the sensor’s history to a reasonable amount.
If I didn’t patch the code, I too would be looking for a way to selectively purge history, for the illuminance sensors, on a more frequent basis than 10 days (like every two days or even daily).
I’ll keep patching the integration, but it’s good to know that there’ll be a recorder.purge_entities service call.
EDIT
Full Disclosure: It’s been so long since I started patching the integration that I no longer recall if illuminance is reported with three or two decimal places. Either way, IMHO it doesn’t need any decimal places because fractional values of 1 lux aren’t useful.
Thank you for the thoughtful reply, and all the database improvements.
Assuming you are correct that the default values are now adequate for the majority of users, and given that there are options for more advanced users to tweak things even more, I’d be OK withdrawing this FR.
[One year later: I think the above may have been premature. I still see regular reports in this forum of problems with oversized databases and general frustration with the one-size-fits-all keep_days. There are still spammy entities, and for some entities, the entire list of attributes is saved, as a new record, each time any one of them changes. So even a so-called “binary” entity can rapidly bloat the database. The long-term, archival data is still stored in the same database as the live, dynamic data, which is a less-than-ideal design. I suppose I should be glad that people are still finding, and voting for, this FR. But I wish they didn’t have to.]
Honestly, I no longer recall because it’s been almost three years since I started patching the integration (FR to report value as integer was posted in 2020).
My recorder’s purge period is 5 days. I queried the database for just one light sensor and it reported ~1000 records so let’s say about 200 per day. I find that perfectly acceptable (this is with the patched integration that report values as integers not floats).
I can remove the patch, but if you don’t mind, when recorder.purge_entities becomes available so I can easily discard the excess records after a day or two of testing.
Sounds good. States rows should at least 26 bytes smaller per row in 2023.4.x as well so that will help a bit as long as the attributes are not constantly changing.
At least on my large production database (default settings) the minimum savings is ~54MB. The actually savings was closer to 180MB because thats the minimum savings per row and the indices are smaller as well.
sqlite> select count(state_id) from states;
2211850
sqlite> select count(state_id) * 26 from states ;
57508386
sqlite> select count(state_id) * 26 / 1024 / 1024 from states ;
54
I haven’t read through everything in this post, but I have some other thoughts. I did write up a simple script to throw away every second sampling point for a specific list of entities. If I get the time I will polish it and release it as a stand-alone command-line tool.
Another rather simple way to reduce the database size somewhat would be to drop all no longer used columns after they’ve been replaced: null columns still use up some storage in each row (*). For example in the states table are 8 out of the 15 columns never populated in my database: entity_id, attributes, event_id, last_changed, last_updated, context_id, context_user_id and context_parent_id. Dropping these unused columns from my states table -with 5.2 million rows in it- frees up almost 40MB of data, and most likely even more disk space because a smaller row size can increase page fill efficiency. And states is not the only table that has such columns that are always null.
Completely agree. I need to store data for central heating over the winter. The rest of data I don’t care much - 10 days or even less is probably enough.
Please let’s get this retention period per entity done.
I think that I would like to see days retention implemented.
My personal preference would be to be able to bulk apply recorder strategies to entities.
Temperature strategy…
max_days: 10
precision: 1
rounding: ‘half’
I implement this already with template sensors for precision and rounding, and with a combination of labels and scripts that purge entities at midnight.
So I know I’m late to the party, but my backups are becoming a problem as my MariaDB is full of long term stats. The actual DB is 370MB, but of that 300MB is long terms stats…my HA is connected to a super slow Internet connection to backups take ages to complete.
Most of the LT stats data is junk - for example I have an insane number of temperature and humidity readings, specifically 25,000 records for each sensor. That’s 3 years of HA data but I only need a week or a month at most. 99% of the data is unwanted junk… The recorder config is set to 2 day history but obviously that changes nothing in the LT space.
Technically all I want is my energy stats…
I really need a way to shed this dead load…is there a way to do this?
As an aside, and worse still, all this data also exists in InfluxDB (so I can get it if I need it), but here too there is no way to purge entities I no longer want or even no longer have…everything just grows and grows…it’s completely out of control!
Where integrations define a stats class you don’t want, you can override it in your customize.yaml by setting it to an empty string. You’ll then get a “fix issue” notice under the stats tab. One of the options will be to remove the data. Where you have defined a stats class for sensors, remove it and fix the resulting stats issue as mentioned.
However, it still merrily generates long term statistics…I think I need more elaboration from @parautenbach as to exactly what he means. Override it how? Where?
It’s not necessarily the case for example that I never want temperature statistics, but I certainly don’t want long term stats for all of them. So, excluding a whole class isn’t necessarily the answer.
I have half a mind to ditch some integrations and then re-add them but I don’t know if that would clear the data…and to be fair it isn’t really a sustainable approach.
Seriously you post in a Feature Request, For Retention Periods By Entity And This is mend for the Native SQLite DB
And for some reasons you have both MariaDB AND InfluxDB
Please open a separate Topic for your “Issues”
Beside you might be better of just using “Include” as i mentioned !, as you Basically says
However you should still not post your Custom Integration " Issues " in a FR for Native Components