History data in DB should be reduced

leonidostrovski · April 10, 2023, 12:26am

Some sensors make too much data for DB.
For example, I get memory free status every 5 seconds.
If I need to know what the status of free memory was before 24 hours in resolution 5s? Definitely not. It is enough for me to know what a resolution of five minutes.

In this case, it is possible to reduce ~17k rows in 24h to 300 rows in 24h (~-98%).

Same issue in the case of the binary sensor.
In the case of old data, all I need to know is when the sensor status changes (new status != old status).

And the question is whether there are any ready-made tools or scripts for it.

nickrout · April 10, 2023, 2:32am

Every 15sec actually.

leonidostrovski · April 10, 2023, 2:40am

Thank you.
If I show the history chart for last week, it will be impossible to see points at a resolution of 15 seconds.
In the case of a 15-second resolution, it will be ~40,300 points for one week only…

EdwardTFN · April 10, 2023, 3:01am

What is the problem(s) you are trying to solve?

Too many data stored is making your system slow?
Are you getting out of space?
Are your backups too big?
Are your plots taking too long to show?
Are you having problems reading the information from your plots?
…

By the way, are you aware of the options you can set in your Recorder?

leonidostrovski · April 10, 2023, 3:29am

In all cases, the response is “YES”
My DB increase rate is ~0.5GB/24h
As a result, I am limited to keeping history data for only a few days.
I want to have more time.

I am familiar with Recorder.

In the Recorder I can set what type of data to store.
In addition, I can run a script to purge data by type and time.

I do not know any way to get a partial deletion of historical data.

We can think of this idea like video file compression removes details that the human eye cannot see. I cannot see in the history chart millions of points. Possible remove 99% of points and keep 99.9% of really important data

nickrout · April 10, 2023, 3:50am

There is a HACS integration for a RRD database which does just that.

leonidostrovski · April 10, 2023, 3:56am

I will check it.
Thank you.

koying · April 10, 2023, 11:51am

Are you aware of the fact that “long-term statistics” are already aggregated on a 5m and 1h cycle, so basically fullfiling your FR?
With those, and imho, there is no reason to keep detailed history more than a couple of days.

HeyImAlex · April 10, 2023, 12:38pm

That’s not the job of a general purpose short term history DB like the one used by HA, because it doesn’t know what you consider important data. In fact it is very important for it to store the exact raw data it receives and not alter or distort it. You may not see the details at a larger time range, but you certainly see them on closer ranges. And these can be very important to find spikes, anomalies, etc.

Your problem is not the database, it’s that some of your sensors are too chatty. If these sensors don’t allow a user defined update interval, then you can do it manually using an automation that will take the sensor data, process it in some way (average, local min/max, spike detection, whatever you need) and output the result at a lower frequency or only for changes you consider relevant. That result will then be stored in the DB rather than the original sensor. You may be able to use some pre-existing filter components for that.

LTS won’t store non-numerical sensors. That may or may not be a problem with your specific setup, but I like having my binary sensors available for a month or so.

koying · April 10, 2023, 12:42pm

That’s correct. But, in relation to this FR, I don’t really see how you can aggregate non-numerical data so that’s probably another debate

HeyImAlex · April 10, 2023, 1:01pm

I think this would depend on the type of non-numerical data. There’s no general one fits all solution for these of course. But depending on what your sensors represent, and how you use this data, you may be able to filter out state changes that would be deemed unimportant for longer term storage.

For example, your yard (binary) motion sensors would only store state changes while you’re away. Because those are the state changes you would want to look into maybe a month later. The motion sensor state changes happening while you’re home may be useful for short term debugging, but are just noise for longer term storage. Prioritizing data for permanent or longer term storage like this can be done using automations, but of course all that is entirely custom and very much dependent on your setup. There’s nothing HA can really do to generalize this.

szastan · April 10, 2023, 5:53pm

Hi, this would require us to change current recorder to RRD one. In my case, using MariaDB which is being used for some extra scripts this is not an option although what RRD does is exactly the feature that @leonidostrovski asks for and what would also suit me. Summarizing, it would be great if recorder integration would allow for setting own time range that averages to one data point with possibility for multiple of such conditions, no matter what engine is being used. for example:

for data newer than a week - no change
for data between week and a month, average each minute data points to one record
for data older than one to three months, average each hour data points to one record

… and so on.

leonidostrovski · April 12, 2023, 11:41am

Many thanks
It appears that the RRD could help, but it is not part of the existing database. The RRD makes the system more difficult. As I understand it create additional db.

I want an integrated system that I can customize in the same way I customize Recorder. For example, I might set to store sensor X but not sensor Y, and I might also set to remove all “identical” (row n = row n+1) rows for binary sensors from the history older than 1 day. Or, calculate the mean for a time range of 1 hour, write the result to db, and delate all rows used for the mean calculation.

EdwardTFN · April 12, 2023, 3:04pm

But currently a new state is stored only when the state changes, no? So you shouldn’t have that many cases where row n = row n-1.

By the way, don’t forget to vote for your own request.

bmaehr · August 20, 2024, 11:12pm

I think the inverted solution is more helpful: First store every few seconds and after a day or two weeks drop the unnecessary lines.
In fact I can think about implementing such a job myself.