How to keep your recorder database size under control

Yes, this is normal behavior. Since a while the entity_id is stored in the states_meta table and is referenced by the metadata_id (should be the last column an a row). This saves a lot of space since it is only stored in a two byte number and not in a text string that is about factor 5-20 larger.

Thanks a lot for the reply.
Here is why I asked this question:

Due to some issues with processing jinja templates in a custom card I observed this chain reaction:
huge CPU load → rapidly increasing DB
The DB size jumped from ~1GB to ~8GB within a few hours.

Why these templates & card caused this huge CPU load - is another issue.
Why this huge CPU load caused a rapid constant flooding the DB - is also another issue (have no idea, may be some integration, stock or custom, reacted this abnormal way).

Called “purge+repack” service, it reduced DB size to ~7GB.
This is still abnormal. So I am trying to:
– find out why the DB size was increased,
– find out why DB size was not reduced by “purge+repack” service (issue in Github),
– find out how to reduce a card by myself - if a “standard purge service” failed.

What I did:

  1. Opened db file in a DB Browser for SQLite.
  2. Ran a query posted here.
  3. It gave me a table which I pasted into Excel and sorted by “bytes” column:
    image
    where the 1st row is a calculated sum (MB) - i.e. all entities occupy ~49MB in DB.

I believed that the “bytes” column stands for a “size of stored data” since OP did not care to add any descriptions in his 1st post.

There is a difference between 49MB & 7GB, so I wonder what could occupy all this space.

For me this script is very useful as it focuses on the size of the entities in the states table in database rather than the number of entries. However as it gives me the same error as others have seen, I wonder if anyone has managed to fix it up? I see there are fixes for the scripts that count the number of entries in the states and stats table, but my SQL is not good enough to do this for this one :slight_smile:

I am also interested to know how can I get a size of data stored for events?
When python get crazy in my setup - I see tons of error messages in Log.
DB grows then very fast.
And I think than DB grows due to data stored for events:


What are these records for?
How can I get a size of these data?

Fixed:

SELECT
  COUNT(state_id) AS cnt,
  COUNT(state_id) * 100 / (
    SELECT
      COUNT(state_id)
    FROM
      states
  ) AS cnt_pct,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) AS bytes,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) * 100 / (
    SELECT
      SUM(
        LENGTH(state_attributes.shared_attrs)
      )
    FROM
      states
      JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
  ) AS bytes_pct,
  states_meta.entity_id
FROM
  states
LEFT JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
LEFT JOIN states_meta ON states.metadata_id = states_meta.metadata_id
GROUP BY
  states.metadata_id, states_meta.entity_id
ORDER BY
  cnt DESC;

(Tested on Postgres.)

1 Like

Thank you for posting that! I’d been meaning to get around to something like that myself.

I learned something very interesting from that SQL statement. There’s one event which is taking up 36% of this table, according to the output of this statement. It’s binary_sensor.internet_connection.

Huh? I figured a binary sensor really only needs one bit! But no. When I look at it in Developer Tools / States, it shows a bunch of attributes. So “binary sensor” is really a misnomer. It can pack a lot of attribute data.

I also assumed that this sensor state should change only very rarely. My ISP is normally connected, sometimes for weeks on end. But with each ping, a bunch of “round trip time” attributes change. So this one state change is creating 28% of the state attribute change records in my database!

If this is working the way I think it is, I need to look beyond just how many times a state changes, but also how many attributes are attached to it, and how often they change.

For me, this opens up a whole new dimension to Recorder database management. I figured I mention this and see what everyone else is experiencing.

1 Like

You’re welcome!

I also picked up on the ping sensors. I still need to work through everything else, but I also want to find the sensors with lots of attributes that also cause a lot of updates.

Yeah, the ping integration keeps all round trip time attribute changes in the database, and there’s no configuration option to change that. In my configuration, I create an equivalent template binary_sensor that only mirrors the ping binary_sensor value and does not include the ever changing RTT attributes. Then I exclude the ping binary_sensor from recorder and include the template binary_sensor, so I still have a history of when the ping fails.

In the past, all media_player attributes were also recorded in the database. This was problematic because it included every media position update and made the database huge if you were eg. streaming media all day. There is a mechanism now for integrations to exclude specific attributes from recorder, but this must be implemented in the integration code itself.

It’s a bit annoying that the attribute history can’t be shown via the default history graph card or via the History page even though it’s stored in the database, though luckily mini-graph-card supports showing attributes.

2 Likes

Thanks again, I already had making a template for just the state, and excluding the source entity, on my schedule for today. You beat me too it! I wish HA had hired a DBA instead of a UI designer.

Update: That worked. After excluding the offending entity and two others from the top-ten list, I did a purge of those entities, then a full purge and repack. Sure enough, the database shrunk by 39.5%, about the amount the SQL results showed those three entities to be using.

It seems crazy to have to create a binary sensor template to mirror a binary sensor, but it works.

1 Like

There’s one more crazy trick that you may wish to consider. If you did want some RTT history, but not at the same frequency as the ping check, then you can set up a template sensor that is triggered by a time pattern that mirrors the relevant attribute. Obviously this will increase the database size, but not at the same rate that it did before.

2 Likes

Alternative way is using a “device_tracker” for PING instead of binary_sensor.
The device_tracker entity does not contain these “trip” attributes.

There are many integrations which create frequently changing attributes, like:

  1. Open Hardware Monitor - creates plenty of useless attributes (issue).
  2. Composite device tracker - creates “last_seen” attribute inside “device_tracker” entity (issue).
  3. Similarly:
  • “last_time_reachable” in ASUSWRT - the attribute was excluded, but DB still stores every 30 sec same state (issue)
  • “updated” in starline - but here it is even worse, the integration consecutively writes records with same state & attributes (issue).
  1. iPhoneDetect platform - writes records with same states & attributes (issue).

Also, here is an issue for ping binary_sensor.

So, a useless writing to DB is not considered as smth bad by many platforms…

1 Like

Yeah, I was quite surprised when I saw all the garbage in the state_attributes table. For example, the speedtest integration has lots of records, each with a long JSON string containing all kinds of static data like class, server_name, server_country, server_id, unit_of_measurement, device_class, icon and friendly_name.

Likewise with climate entities, which record every possible thermostat option and value, well beyond anything I’d care about like the actual and set temperatures.

It would seem that lots and lots of static data is recorded, over and over again, with each state change. Doesn’t seem like very good database design to me.

I am not an expert in DB.
Consider this entity with ONE attribute:

template:
  - sensor:
      - name: xxx
        state: ...
        attributes:
          attr: ...

Here is a table of 10 changes:
изображение

All possible attributes are stored in “states_attributes” table:
изображение

Then the “states” table contains an “attributes_id” for a corr. attribute values’ set:
изображение

That was an example with ONE attribute.
If an entity contains MANY attributes and ALL of them are static - then the “states_attributes” table will have ONE row, and the “states” table will have rows with non-repeating states:
изображение

So, this DB structure allows to save some space by storing only unique attributes.

In fact, even in this case there are examples for “many rows for same states & attributes” - here, here.

If an entity has MANY changing attributes like this one:
изображение
then the “states_attributes” will have entries for EACH unique combination of values of attributes.
I guess storing these attributes becomes an additional load on a DB engine (not to mention a space).
That is why it is better to exclude these entities from a Recorder.

Even excluding an attribute from recorder causes excessive writing to DB - if this attribute is changing.

I think you’ve got it.

Yes, it would make sense to save the static attributes once, and let each state change reference that one record in the states_attributes table.

What it seems like is happening, however, is that integrations are loading dynamic data into the attributes. For example, each ping has a four different round_trip_time values associated with it. They’re virtually never all going to be the same. Thus for each state change, a new record is saved to the states_attributes record, containing both the static and dynamic attributes.

Poking around the database I see a lot of entities like this. Now that I know what to look for, I have been looking through the attributes in Developer Tools / States.

And even excluding these “round_trip_time” attributes from recorder will not help - every successful ping will create a new record in DB. See the attached issue above.
Solutions:
– change this behaviour in Recorder (do not create a new record after changing excluded attrs)
– exclude these attribute from an entity
– exclude whole entity from recorder & replace by a template sensor

I’m trying to get my head round all of this before I start fiddling with includes/excludes.
My situation is that i’ve been running HA for 4 years now on a generic x86-64 NUC. My ‘Partial Backups’ (just exclude Media) are now 17Gb and with my dismal uplaod speed, the time to get them on my Google drive is about 48hrs. Hence me wanting to get this size down.
The system has grown over these 4 years and I seem to have 3 databases running, namely the default(?), MariaDB and InfluxDB. I think I installed these latter 2 for the Traccar and Grafana Add-Ons.
Looking at the backup with 7-Zip shows me the biggest components are influxdb.tar.gz at 15Gb, mariadb.tar.gz at 2Gb and homeassistant.tar.gz at 85Mb.
Now, I’ve loaded SQLLite and it only shows 1 database at 81Mb; presumably the default homeassistant db. I tried running the queries in the OP but they didn’t work. Anyway this is small fry. Looking at stuff in the History page nothing is older than the default ‘recorder’ 10 days.
So looking at the Influx add-on page, I can get it to show me graphs of particular units with certain entities selected, but I can’t get it to display anything that might tell me what’s consuming most space. Maybe it can’t or maybe I just don’t know how - Any help there appreciated.
Also, I already had the add-on phpMyAdmin so I started looking at that with my limited understanding of it. It shows what I’m guessing to be a number of databases with the biggest, as far as I can tell, being labeled ‘homeassistant’ (I think I named the mariadb that). Browsing into it shows the tables, the largest of which is ‘states’ at 5Gb. Browsing into that shows the rows but each page of 25 takes over a minute to display! It has 18M rows. Again I have no idea how I can get it to show me the ‘big hitters’. Any ideas?
Traccar is also listed as a database but it’s only 279Mb. All the rest don’t hold much if anything.
So, I’m really not much further forward in trying to get my head round what includes/excludes I need to put in my Influxdb yaml section.
Any help/guidance would be appreciated. Thanks

In my opinion, you’re wasting your time trying to find what are the heavy hitters. From day 1 I’ve treated my history as “Keep thing things you want with history, discard all the rest”.

For me that is:

  • Lights/Switches for the house.
  • Binary sensors (doors & windows).
  • Energy sensors for the whole house.
  • Temperature & humidity for the house.
  • Device Trackers for people in the house.

All other things are not important. And if you have some that are, just add those. At the end of the day, I have over 3K entities, with less than 200 being tracked in history.

This enables me to have 40 days of history, and LTS for the energy tab. My database sits around 1 gig.

The downside of this route is that you have to manage new things you want to track when you add them. Or there are sometimes situations where you wanted history during an event but don’t have it. That’s only happened a few times, and I simply just add the entity to recorder.

How do you guys manage to make the configuration.yaml with those asterix? If I use something like:

  exclude:
    entities:
      - binary_sensor.nvr_motion*

So that all my motion sensors to be excluded (names that are like binary_sensor.nvr_motion_1, binary_sensor.nvr_motion_2, binary_sensor.nvr_motion_3, binary_sensor.nvr_motion_4).

But… If i use that *, I will get configuration error and my HA doesn’t start correctly anymore. Am I missing something?

I think you have to use the globs thing…

exclude:
  entity_globs:
   - binary_sensor.nvr_motion*
1 Like