How to keep your recorder database size under control

So just create an explicit list of includes?
I guess I could look at my Grafana and make sure everything I’ve graphed gets included.

Yes, it works well and the management is not near as bad as people in this thread make it out to be. I touch my setup maybe twice a year at most. Most things I add at this stage in my system don’t require history. And if they do, I just have to add the specific entities to the file. Yes it would be nice to have it automated, but it’s really not saving any time by going that route. Maybe in 5 years the accumulated time I spend adding to the include list will be greater than the hour(s) it might take to create an automation.

Thanks. I’ll get on to it.
Once done, do you know how I can purge the InfluxDB database to make the file size shrink?

Sorry, not familiar with Influxdb

Include is good for storing only what you need.
Exclude is good for a test setup to find possible issues with your setup.

Thank you! That did it!

1 Like

Good solution:

Also read these comments starting from Ping binary sensor flooding DB by "round_trip_time" attributes · Issue #111531 · home-assistant/core · GitHub

1 Like

Yes, that’s a big step in the right direction!

I’m really surprised this isn’t a standard. Why spam the attributes tables with every change? Many (most?) of them make absolutely no sense to repeat, over and over, often in hundreds of records. Unit of Measurement? Friendly Name? Device Class? State Class? Icon? The list goes on.

Because all these repetitive data are stored as JSON, both the field name and value are spelled out. In ASCII. In every. single. record.

I can only assume this wasn’t the original intent of these tables. Nobody would intentionally design a database this way. Perhaps this situation somehow evolved without the integration developers really understanding the impact.

Also, even with same state & unchanged attributes there could be same consecutive records - with only different last_updated.

1 Like

This is fab post thank you.

Does anyone know how one would exclude media from the backup please? In particular I would like to exclude frigate media from what gets backed up

Thank you

Pieter, how do I run this query against my home assistant?

It would depend on your installation method. I run core, so I can access my DB via a terminal on the same host. If you use HAOS, I think there’s an add-on you can install to get access to the SQLite DB, but I’m not sure, to be honest.

ok thank you

I don’t think you can delete records with the add-on. Anyway, I just copy the .db file to my laptop and run DB Browser for SQLite there. Fast, easy and I don’t mess with the live copy. If I want to do any DB maintenance, like deleting data, I do this while I have HA shutdown for an update or whatever anyway. And always keep a backup copy before making any changes, of course.

1 Like

A couple days ago I had a MariaDB of 55GB.

Now I just managed to get my SQLite db to 175MB thanks to this thread. Omfg. And this is not even diving deep. Wow.

Obv my initial setting of 720 days to be retained wasn’t right. I did not know about long term statistics, so I wanted to keep everything. After learning that and purging down I went down to 1GB, and then I migrated to SQLite as I learned how it was improved last year from HA team.

Then I arrived to this thread, and tbh I wasn’t expecting much. I have only focused on states and disabling the worst entities I found in the states table. Got it down to 500MB first, and now I disabled a few more entities. I did not expect it to go down to 175MB, really. I could carry on, but tbh I am more than happy as it is.

Many thanks for the amazing guide.

What I wanted to ask is exactly how is this table helpful. While the states table served me to purge the heavy entities, what can I do knowing that the state_attributes is the heavyweight here? Is there something I can do to trim it down?

Same for the events table. I can see a few culprits, but no clue how to attack them.

This is something I’ve been exploring, too. It would seem that some entities have a very large number of “attributes” associated with them. Apparently every time the state changes, a new state_attributes record is also stored. Or maybe it’s just when any of the attributes is also changed, I’m not clear on that yet.

The point is, these attributes (which are stored in an extremely verbose JSON format, in which includes the field names - all of them - and values) are repeated over and over again in every single record. This can be result in some pretty intense database spam.

So, look for entities which store lots of attributes. I’ve found browsing the Developer Tools / States page is a good starting point. There’s also an SQL query here which can help. I think you’ll be shocked at what you find.

1 Like

Thanks!

Re the attributes, yeah, I could see some are very lengthy. But again, is there a way of telling the recorder to store the value without attributes or with selected attributes? Or is the only solution to filter the entity again completely?

Thanks for those new queries. The first one about states is a simpler version about the one in the guide really. It didn’t show anything new.

And the other one shows the same but in the long term statistics table. Which I don’t think it’s too useful. In that table all entities are reduced to 1 entry per hour AFAIK. Which means the ones with more entries are your oldest entities (in my case it matches perfectly, also with the number of hours my HA has been running). That’s 24 new entries per entity per day. And these are never purged by HA, since their impact is very low.

But again, that’s adding very little to your db compared to the numbers from the short-term data, which might be registering every few seconds.

Also leaving here the tables query in MB as I find it more descriptive than bytes.

SELECT
  SUM(pgsize) / (1024 * 1024) AS MB,
  name
FROM dbstat
GROUP BY name
ORDER BY MB DESC

Edit: I think you guys will like this query I created with some help from ChatGPT.
This will tell you which entities are the heavyweights in state_attributes exclusively, and how much in MB they use.

The 2 notification count sensors on my 2 phones are taking 15MB themselves. You know where they’re going.

After filtering a few of the culprits, 180MB down to 115MB now. :smiley: This is SO addictive.

SELECT
    ROUND(SUM(LENGTH(shared_attrs) / (1024.0 * 1024.0)), 2) AS attrs_size_mb,
    ROUND((SUM(LENGTH(shared_attrs) / (1024.0 * 1024.0)) * 100) / (SELECT SUM(LENGTH(shared_attrs) / (1024.0 * 1024.0)) FROM state_attributes), 2) AS size_pct,    
    COUNT(*) AS cnt,
    COUNT(*) * 100 / (SELECT COUNT(*) FROM state_attributes) AS cnt_pct,
    states_meta.entity_id    
FROM 
    state_attributes
INNER JOIN 
    states ON state_attributes.attributes_id = states.attributes_id
INNER JOIN 
    states_meta ON states.metadata_id = states_meta.metadata_id
GROUP BY 
    states_meta.entity_id
ORDER BY 
    attrs_size_mb DESC;

In a word, no. For attribute-heavy entities (and there are a lot of them) the best we can do is exclude the entity, then create a template to hold only the value you’d like to keep.

Be careful when defining your templates, too. Giving them things like unit_of_measurement, friendly_name or icon adds those as attributes. So you can end up with the same problem.

Remember, these attributes are stored as JSON strings with each variable name and value spelled out in ASCII text. So while you think you’re only storing a binary or numeric value, you’re actually storing a long, multi-value string with every state change.

Anyway, here’s the SQL I use to see the heavy hitters in the state_attributes table. I can’t claim authorship:

SELECT
  COUNT(state_id) AS cnt,
  COUNT(state_id) * 100 / (
    SELECT
      COUNT(state_id)
    FROM
      states
  ) AS cnt_pct,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) AS bytes,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) * 100 / (
    SELECT
      SUM(
        LENGTH(state_attributes.shared_attrs)
      )
    FROM
      states
      JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
  ) AS bytes_pct,
  states_meta.entity_id
FROM
  states
LEFT JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
LEFT JOIN states_meta ON states.metadata_id = states_meta.metadata_id
GROUP BY
  states.metadata_id, states_meta.entity_id
ORDER BY
  cnt DESC;
2 Likes

No prob. As I said, I’m more than happy now. Just checked your query but I prefer mine tbh, as it’s in MB and sorted for size, not count.

Also noticed my db went down once again, this time down to 80MB. Not even sure why this time, but it must be one of the previous changes hadn’t made an effect yet. 55GB to 80MB. Not even in my wildest dreams.

So my database is suddenly 10GB. Trying to get some stats from the dbstat seems impossible. Nothing happens when I try to pull data from it via SQLite Web. I get no errors. Is this a timeout? If so, what can I do? Can I purge some data to get head over water again?