How to keep your recorder database size under control

Wingnut · March 2, 2024, 4:20pm

So just create an explicit list of includes?
I guess I could look at my Grafana and make sure everything I’ve graphed gets included.

petro · March 2, 2024, 4:23pm

Yes, it works well and the management is not near as bad as people in this thread make it out to be. I touch my setup maybe twice a year at most. Most things I add at this stage in my system don’t require history. And if they do, I just have to add the specific entities to the file. Yes it would be nice to have it automated, but it’s really not saving any time by going that route. Maybe in 5 years the accumulated time I spend adding to the include list will be greater than the hour(s) it might take to create an automation.

Wingnut · March 2, 2024, 4:29pm

Thanks. I’ll get on to it.
Once done, do you know how I can purge the InfluxDB database to make the file size shrink?

petro · March 2, 2024, 4:32pm

Sorry, not familiar with Influxdb

Ildar_Gabdullin · March 2, 2024, 7:19pm

Include is good for storing only what you need.
Exclude is good for a test setup to find possible issues with your setup.

Alfamonk · March 2, 2024, 7:36pm

Thank you! That did it!

Ildar_Gabdullin · March 3, 2024, 4:38pm

Good solution:

github.com/home-assistant/core

Move Ping binary sensor attributes to sensor entities

home-assistant:dev ← jpbede:ping-attrs-to-entities

opened 10:18PM - 01 Mar 24 UTC

jpbede

+366 -22

## Proposed change Move the attributes of the binary sensor of the Ping integra…tion to sensor entities. The entities are disabled by default an can be enabled if needed. This way don't store data that is not needed and wanted by the user. The attributes are kept for backward compatibility and can be removed with version 2024.7. I don't actually know how the path for deprecation would look like, as I don't know of any way we can determine attribute usage. ~ToDo: Docs~ ## Type of change  - [ ] Dependency upgrade - [ ] Bugfix (non-breaking change which fixes an issue) - [ ] New integration (thank you!) - [x] New feature (which adds functionality to an existing integration) - [x] Deprecation (breaking change to happen in the future) - [ ] Breaking change (fix/feature causing existing functionality to break) - [ ] Code quality improvements to existing code or addition of tests ## Additional information  - This PR fixes or closes issue: fixes #111531 - This PR is related to issue: - Link to documentation pull request: https://github.com/home-assistant/home-assistant.io/pull/31705 ## Checklist  - [x] The code change is tested and works locally. - [x] Local tests pass. **Your PR cannot be merged unless tests pass** - [x] There is no commented out code in this PR. - [x] I have followed the [development checklist][dev-checklist] - [x] I have followed the [perfect PR recommendations][perfect-pr] - [x] The code has been formatted using Ruff (`ruff format homeassistant tests`) - [x] Tests have been added to verify that the new code works. If user exposed functionality or configuration variables are added/changed: - [ ] Documentation added/updated for [www.home-assistant.io][docs-repository] If the code communicates with devices, web services, or third-party tools: - [ ] The [manifest file][manifest-docs] has all fields filled out correctly. Updated and included derived files by running: `python3 -m script.hassfest`. - [ ] New or updated dependencies have been added to `requirements_all.txt`. Updated by running `python3 -m script.gen_requirements_all`. - [ ] For the updated dependencies - a link to the changelog, or at minimum a diff between library versions is added to the PR description. - [ ] Untested files have been added to `.coveragerc`.  To help with the load of incoming pull requests: - [x] I have reviewed two other [open pull requests][prs] in this repository. [prs]: https://github.com/home-assistant/core/pulls?q=is%3Aopen+is%3Apr+-author%3A%40me+-draft%3Atrue+-label%3Awaiting-for-upstream+sort%3Acreated-desc+review%3Anone+-status%3Afailure  [dev-checklist]: https://developers.home-assistant.io/docs/development_checklist/ [manifest-docs]: https://developers.home-assistant.io/docs/creating_integration_manifest/ [quality-scale]: https://developers.home-assistant.io/docs/integration_quality_scale_index/ [docs-repository]: https://github.com/home-assistant/home-assistant.io [perfect-pr]: https://developers.home-assistant.io/docs/review-process/#creating-the-perfect-pr

Also read these comments starting from Ping binary sensor flooding DB by "round_trip_time" attributes · Issue #111531 · home-assistant/core · GitHub

CaptTom · March 3, 2024, 6:34pm

Yes, that’s a big step in the right direction!

I’m really surprised this isn’t a standard. Why spam the attributes tables with every change? Many (most?) of them make absolutely no sense to repeat, over and over, often in hundreds of records. Unit of Measurement? Friendly Name? Device Class? State Class? Icon? The list goes on.

Because all these repetitive data are stored as JSON, both the field name and value are spelled out. In ASCII. In every. single. record.

I can only assume this wasn’t the original intent of these tables. Nobody would intentionally design a database this way. Perhaps this situation somehow evolved without the integration developers really understanding the impact.

Ildar_Gabdullin · March 3, 2024, 7:11pm

Also, even with same state & unchanged attributes there could be same consecutive records - with only different last_updated.

danuw · March 7, 2024, 4:10pm

This is fab post thank you.

Does anyone know how one would exclude media from the backup please? In particular I would like to exclude frigate media from what gets backed up

Thank you

JohnSchols · March 12, 2024, 11:50am

Pieter, how do I run this query against my home assistant?

parautenbach · March 12, 2024, 12:09pm

It would depend on your installation method. I run core, so I can access my DB via a terminal on the same host. If you use HAOS, I think there’s an add-on you can install to get access to the SQLite DB, but I’m not sure, to be honest.

JohnSchols · March 12, 2024, 12:15pm

ok thank you

CaptTom · March 12, 2024, 12:45pm

I don’t think you can delete records with the add-on. Anyway, I just copy the .db file to my laptop and run DB Browser for SQLite there. Fast, easy and I don’t mess with the live copy. If I want to do any DB maintenance, like deleting data, I do this while I have HA shutdown for an update or whatever anyway. And always keep a backup copy before making any changes, of course.

Chaoscontrol · March 15, 2024, 10:49pm

A couple days ago I had a MariaDB of 55GB.

Now I just managed to get my SQLite db to 175MB thanks to this thread. Omfg. And this is not even diving deep. Wow.

Obv my initial setting of 720 days to be retained wasn’t right. I did not know about long term statistics, so I wanted to keep everything. After learning that and purging down I went down to 1GB, and then I migrated to SQLite as I learned how it was improved last year from HA team.

Then I arrived to this thread, and tbh I wasn’t expecting much. I have only focused on states and disabling the worst entities I found in the states table. Got it down to 500MB first, and now I disabled a few more entities. I did not expect it to go down to 175MB, really. I could carry on, but tbh I am more than happy as it is.

Many thanks for the amazing guide.

What I wanted to ask is exactly how is this table helpful. While the states table served me to purge the heavy entities, what can I do knowing that the state_attributes is the heavyweight here? Is there something I can do to trim it down?

Same for the events table. I can see a few culprits, but no clue how to attack them.

CaptTom · March 16, 2024, 1:46am

This is something I’ve been exploring, too. It would seem that some entities have a very large number of “attributes” associated with them. Apparently every time the state changes, a new state_attributes record is also stored. Or maybe it’s just when any of the attributes is also changed, I’m not clear on that yet.

The point is, these attributes (which are stored in an extremely verbose JSON format, in which includes the field names - all of them - and values) are repeated over and over again in every single record. This can be result in some pretty intense database spam.

So, look for entities which store lots of attributes. I’ve found browsing the Developer Tools / States page is a good starting point. There’s also an SQL query here which can help. I think you’ll be shocked at what you find.

Chaoscontrol · March 16, 2024, 8:32am

Thanks!

Re the attributes, yeah, I could see some are very lengthy. But again, is there a way of telling the recorder to store the value without attributes or with selected attributes? Or is the only solution to filter the entity again completely?

Thanks for those new queries. The first one about states is a simpler version about the one in the guide really. It didn’t show anything new.

And the other one shows the same but in the long term statistics table. Which I don’t think it’s too useful. In that table all entities are reduced to 1 entry per hour AFAIK. Which means the ones with more entries are your oldest entities (in my case it matches perfectly, also with the number of hours my HA has been running). That’s 24 new entries per entity per day. And these are never purged by HA, since their impact is very low.

But again, that’s adding very little to your db compared to the numbers from the short-term data, which might be registering every few seconds.

Also leaving here the tables query in MB as I find it more descriptive than bytes.

SELECT
  SUM(pgsize) / (1024 * 1024) AS MB,
  name
FROM dbstat
GROUP BY name
ORDER BY MB DESC

Edit: I think you guys will like this query I created with some help from ChatGPT.
This will tell you which entities are the heavyweights in state_attributes exclusively, and how much in MB they use.

The 2 notification count sensors on my 2 phones are taking 15MB themselves. You know where they’re going.

After filtering a few of the culprits, 180MB down to 115MB now. This is SO addictive.

SELECT
    ROUND(SUM(LENGTH(shared_attrs) / (1024.0 * 1024.0)), 2) AS attrs_size_mb,
    ROUND((SUM(LENGTH(shared_attrs) / (1024.0 * 1024.0)) * 100) / (SELECT SUM(LENGTH(shared_attrs) / (1024.0 * 1024.0)) FROM state_attributes), 2) AS size_pct,    
    COUNT(*) AS cnt,
    COUNT(*) * 100 / (SELECT COUNT(*) FROM state_attributes) AS cnt_pct,
    states_meta.entity_id    
FROM 
    state_attributes
INNER JOIN 
    states ON state_attributes.attributes_id = states.attributes_id
INNER JOIN 
    states_meta ON states.metadata_id = states_meta.metadata_id
GROUP BY 
    states_meta.entity_id
ORDER BY 
    attrs_size_mb DESC;

CaptTom · March 16, 2024, 12:29pm

In a word, no. For attribute-heavy entities (and there are a lot of them) the best we can do is exclude the entity, then create a template to hold only the value you’d like to keep.

Be careful when defining your templates, too. Giving them things like unit_of_measurement, friendly_name or icon adds those as attributes. So you can end up with the same problem.

Remember, these attributes are stored as JSON strings with each variable name and value spelled out in ASCII text. So while you think you’re only storing a binary or numeric value, you’re actually storing a long, multi-value string with every state change.

Anyway, here’s the SQL I use to see the heavy hitters in the state_attributes table. I can’t claim authorship:

SELECT
  COUNT(state_id) AS cnt,
  COUNT(state_id) * 100 / (
    SELECT
      COUNT(state_id)
    FROM
      states
  ) AS cnt_pct,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) AS bytes,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) * 100 / (
    SELECT
      SUM(
        LENGTH(state_attributes.shared_attrs)
      )
    FROM
      states
      JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
  ) AS bytes_pct,
  states_meta.entity_id
FROM
  states
LEFT JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
LEFT JOIN states_meta ON states.metadata_id = states_meta.metadata_id
GROUP BY
  states.metadata_id, states_meta.entity_id
ORDER BY
  cnt DESC;

Chaoscontrol · March 16, 2024, 1:07pm

No prob. As I said, I’m more than happy now. Just checked your query but I prefer mine tbh, as it’s in MB and sorted for size, not count.

Also noticed my db went down once again, this time down to 80MB. Not even sure why this time, but it must be one of the previous changes hadn’t made an effect yet. 55GB to 80MB. Not even in my wildest dreams.

krossykross · April 1, 2024, 10:32am

So my database is suddenly 10GB. Trying to get some stats from the dbstat seems impossible. Nothing happens when I try to pull data from it via SQLite Web. I get no errors. Is this a timeout? If so, what can I do? Can I purge some data to get head over water again?