Large statistics database (1,9G, 40% of total DB)

Everywhere I read this table should not consume much space, and therefore is never cleaned-up. Long term history is stored forever. (Which is nice)

I have a quite large (and messy?) setup and the table is now 1.9 GB in size (that’s not small right, of the 4,8 in total). It contains 12.8 Million records. What is wrong here?

The states table was even worse but I started cleaning a lot, excluding lot from the recorder, disabled many entities, reduced update rates, etc… This table is now reduced from 4.4 GB (14.5M records) to 1.9GB (3.5 M records). I’m still cleaning and the recorder history/purge settings will clean this tables over time.

I removed many entities from the developer tools > statistics that didn’t have any states anymore, old, removed, etc. I can;t clean anything else from here.
However, the amount of records in this table is only slightly reduced from 13M to 12.8M.

How can I trim this statistics table (without removing important data)? Is there a way to figure out if there are some specific values overflowing this table?

I keep reading this table should be very small in size, 1,9 GB or about 40% of the total DB size is not small. So, I assume something is wrong?

I used this to count the amount of line of each entity.
There are currently 1190 unique entities.

SELECT 
  statistics.metadata_id,
  statistics_meta.id,
  statistics_meta.statistic_id, 
  COUNT(*)
FROM statistics, statistics_meta
WHERE statistics.metadata_id = statistics_meta.id
GROUP BY
    statistics_meta.statistic_id
ORDER BY
    COUNT(*) DESC;

some entities have 27K entries in this table, while other have only 3-5K.
27K hourly entries is about 3Y, which might be possible and match the introduction of statistics. So, not sure if this is strange or just normal.

What I do see, is still some old/removed statistics. For example some linkquality entities that I disabled (and “fixed”) in the statistics page. However, the values are still in the table.

Did you change your recorder purge keep days?

The 5 minute LTS data is not deleted until after that time.

I recently had mine taking up over 80% of my database.

OK, I admit I’m a bit aggressive about keeping my recorder database trim, and this was just after a purge. So it was an extreme example. But you’re not wrong that this is bad design.

To add insult to injury, there’s no “purge” function for the long-term statistics tables. Personally, I don’t use these data. Forcing users to incur the storage and read/write overhead this process requires flies in the face of the whole “local control” idea of HA. If I have other ways to track and analyze my long-term data, I shouldn’t be forced to store and retain unused and unwanted data.

The best workaround I’ve found is to hammer the DB directly. By that I mean, while I have HA shut down for an update anyway, I’ll run some SQL to manually do a purge of LTS data.

DELETE FROM statistics WHERE start_ts < (CAST(strftime('%s', 'now', '-4 days') AS FLOAT));

DELETE FROM statistics_short_term WHERE start_ts < (CAST(strftime('%s', 'now', '-4 days') AS FLOAT));

DELETE FROM statistics_meta
WHERE id NOT IN (SELECT DISTINCT metadata_id FROM statistics)
  AND id NOT IN (SELECT DISTINCT metadata_id FROM statistics_short_term);

Change the DELETE FROM to SELECT * FROM if you just want to see what’s going to be deleted before running the DELETE. These lines hard-code a 4-day retention period, but you could change that.

That last SQL statement cleans up any “orphaned” data from the statistics_meta table, which may be what you’re really after.

Another workaround you may want to try is to change the state_class of the offending entities. I’m still not 100% clear on how best to do that, or any unintended consequences of doing that. Bur presumably it could eliminate any unwanted LTS data altogether.

1 Like

Not recently, the purge_keep_days = 7
So I understand that all optimizations I make (reduce update frequencies, disable entities, etc), will have their full effect after 7 days, which is fine. Hence, see the current difference in the states table.

This is a screenshot ~1 weeks ago:

This is the current overview:

The issue I face is that the statistics table can;t be reduced.

I was thinking of such an approach, but this is removing it all, right?

I can’t find any “offending” entities, some are just 3 years old.
I still find some of the data valuable (energy stats for example)

But there are many uninteresting values. A toggle would be nice, or opt-in for interesting values.

DELETE FROM statistics_meta
WHERE id NOT IN (SELECT DISTINCT metadata_id FROM statistics)
  AND id NOT IN (SELECT DISTINCT metadata_id FROM statistics_short_term);

Interesting, however is this not only removing the meta data? The bulk of the info is in the statistics table, right?

Yes, I don’t use LTS at all, so removing it all (or, in my example, all but the last 4 days) is what I want.

I agree there should be a toggle for each entity, indicating whether or not to bother keeping LTS for that entity. I suspect that the vast majority of them would be a firm “no.” Of course excluding them from Recorder altogether would work, but then you wouldn’t even have data back to your purge_days setting, which isn’t appropriate for all entities.

As I mentioned, there is also the option to change the state_class, which would presumably “toggle” LTS off for that entity, but I haven’t had time to go down that rabbit hole.

While we’re on the subject, not all entities require the same purge_days, either. Some things I might want to keep around for a few days, some for a month, a year, or whatever. I’ve recommended setting a purge_days for each entity, but that’s gone nowhere.

Right. This was just a clean-up to remove “orphaned” data from entities which no longer exist. I only included it because your first post mentioned cleaning up unused data.

You can probably use customize to set the state_class to none.

I cannot believe that LTS can be 1.9 GB.
How many sensors with state_class do you have?
And how many months does your LTS include?
Having these two values, we can estimate a possible size.

Once the 5 minute data is purged there is only hourly data. That’s 24 samples per day. The samples contain one each of average, min and max. Assuming double precision floating point (64bit):

24 x 3 x 64 = 4.608kb per day per entity, or 0.576kB per day per entity, kept forever. This equates to 205kB per year per entity i.e. bugger all.

For the 5 minute data (only kept for seven days) it is 288 times this amount, or ~166kB per entity per day. Kept for 7 days that’s about 1MB per entity for the whole week. Still not a lot.

That’s for the measurement state class. For the total device classes only a count is kept. So 3 times less.

EDIT: fixed math

1 Like

Are you sure?
64 bit = 8 byte, so it is 24 * 3 * 8=576 byte. About 0,5 Kbyte. Per entity. Per day.

Hah yeah. Still drinking my coffee. :man_facepalming:

Will update.

And I do not remember what year LTS was incorporated. Assume it was 2021, and LTS is kept for 1000 days.
Then we have 0,5Kbyte * 1000 = 0.5 Mbyte approx. For 1000 days.
How many entities are needed to occupy 1GB - it is 2000 sensors approx.
Well, may be someone really have such a number of sensors with state_class))))

There will be other factors like database table indexing as overhead but it won’t be much.

Here’s my database for the last two years:

It’s a bit difficult to actually plot a trend due to me adding and removing things all the time and the optimisations that have occurred. Also this includes event data. I keep 7 days and exclude a lot of stuff I don’t need from the recorder and have about 750 sensors.

alarm_control_panel: 1
automation: 299
binary_sensor: 231
button: 168
calendar: 4
camera: 14
climate: 3
conversation: 1
device_tracker: 2
event: 2
fan: 3
group: 1
image: 5
input_boolean: 18
input_datetime: 12
input_number: 30
input_select: 19
input_text: 1
light: 62
lock: 11
media_player: 15
number: 86
person: 1
remote: 4
schedule: 7
script: 246
select: 92
sensor: 756
stt: 1
sun: 1
switch: 112
tag: 7
todo: 5
tts: 1
update: 96
vacuum: 2
weather: 2
zone: 6

My DB is about 1.5 GB with about 3500 entities with 30 days.
It used to be larger - up to 10GB - and it happened several times only because of “sys_error_log”. Then I excluded this event.

I have about 1190 entities covered by the LTS.
(Or at least unique metadata_id’s)
1.9GB (for only the statistics table) is a fact, so that is about 1,5MB per item. (doesn’t make sense)

(Total DB size = 4,5. No issues with auto purging “states” and “short_term stats”, understand that part. Cleaning up entities will reduce size eventually)

SELECT 
  statistics.metadata_id,
  statistics_meta.id,
  statistics_meta.statistic_id, 
  COUNT(*)
FROM statistics, statistics_meta
WHERE statistics.metadata_id = statistics_meta.id
GROUP BY
    statistics_meta.statistic_id
ORDER BY
    COUNT(*) DESC;

The ones with highest number of entries have 24478 records, about 1020 days.
(So, nothing strange, right?).

About data calculations: There are are a created_ts and start_ts for every record (2 more double;s).
Still, even if all where 300kB/per year, (some will be less, total device class), and for sure not all exist for the full three years.
I should have 1200 (entities) X 300kB x 3 yeas = 1 GB (Still not nothing, but okay) How is it still 1.9GB in size then?

I expect many old records that are never removed.
For example, I’m now started cleaning, disabling entities like a temperature signal reported by a door sensor (which is not accurate, so useless). Currently this value is also in LTS.
(Checked and I’m able to retrieve value from the whole year).
When I disable the entity, it’s immediately gone from the list. Developer Tools > Statistics >
There is no repair/fix, it’s just gone.
However, even after reboot, when I search directly in the statistics tables, I still find all ¬20K+ records.

On my quest to reduce DB size, another question:

I excluded some useless values from the recorder.
(I did not disable them, as the current value is still useful)

entity_globs:
  - sensor.*_battery
  - sensor.*_signal_strength

However, these lead to two different results…???

This entity is excluded from being recorded:
image

Which is true, no “fix”. Will be here as a reminder.

However, for the battery:
image

Now, there is a “fix” to remove the old values.

Why is this different???

Not sure, just a speculation:
Perhaps the 1st entity was excluded BEFORE any LTS data were stored?

Not sure… you might be right.
After LTS was introduced more and more device classes where added (afaik) to be compatible.
But both signal signal_strength and battery I excluded recently.

I first believed it had anything to do with different entity types (like the ones are marked as “diagnostic”. But that’s not the case. I removed what I can by “fix”.

But okay, now I’m stuck with many “This entity is excluded from being recorded.”, most of the are _battery.