How to keep your recorder database size under control

I don’t think the statistics table is the problem.
When I execute that query it shows me that I have 46 results.
The highest count is 22888, but indeed I don’t know the space utilization.
But I don’t know where to search further.
Maybe someone can adapt the statistics query so it shows also the space utilization?

When you add this to your customize.yaml

sensor.xxxxxx:
  state_class: none

You can delete them in the statistics dashboard

2 Likes

To get a proper offline copy of your HA database, the following procedure can be used:

Make a full backup of your HA system using Settings - System - Backups - CREATE BACKUP (select full). When finished, click on the created backup and select the three dots - Download backup
In Save as-dialog, select some place to do the analysis.

On Windows, with 7z installed, ‘Extract files…’, creates a directory with several .tar.gz-files
The one of interest is homeassistant.tar.gz.
‘Extract Here’ produces homeassistant.tar and then ‘Extract files…’ creates a directory homeassistant where the subdirectory data has (among others) home-assistant_v2.db, which is the SQLite database.

The analyzer sqlite3_analyzer.exe is available for download from SQLite Download Page
and documented at The sqlite3_analyzer.exe Utility Program

Running sqlite3_analyzer creates a detailed study of the database
sqlite3_analyzer home-assistant_v2.db > analyzis

1 Like

If I try the SQL’s to get the heavy hitters states I often get a error in phpMyAdmin. Possibel timeout because the SQl is running to long on my system.

Error:
Error in Processing Request
Error Text: error (rejected)
It seems that the connection to the server has been lost. Please check your network connectivity and server status.

Everybody know how I could expand the runntime to get a result for the SQL:

SELECT
  COUNT(state_id) AS cnt,
  COUNT(state_id) * 100 / (
    SELECT
      COUNT(state_id)
    FROM
      states
  ) AS cnt_pct,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) AS bytes,
  SUM(
    LENGTH(state_attributes.shared_attrs)
  ) * 100 / (
    SELECT
      SUM(
        LENGTH(state_attributes.shared_attrs)
      )
    FROM
      states
      JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
  ) AS bytes_pct,
  states_meta.entity_id
FROM
  states
LEFT JOIN state_attributes ON states.attributes_id = state_attributes.attributes_id
LEFT JOIN states_meta ON states.metadata_id = states_meta.metadata_id
GROUP BY
  states.metadata_id, states_meta.entity_id
ORDER BY
  cnt DESC;

Thanks

This is the sensor I have created that gives the size of my MariaDB

# Sensor part
# https://www.home-assistant.io/components/sensor.sql/


sql:
  - name: "Database size"
    db_url: !secret recorder_db_url
    query: >
      SELECT table_schema "database", Round(Sum(data_length + index_length) / 1024 / 1024, 1) "value" 
      FROM information_schema.tables 
      WHERE table_schema="homeassistant" 
      GROUP BY table_schema;
    column: "value"
    unit_of_measurement: "MB"

For anyone struggling with database size - I developed an addon which shows you data about database usage. You can read about it here.

By the way, it looks like there is no correct query to check shared attributes size in this topic. I was struggling with it for a long time, and this is my best effort:

select attr2entity.entity_id, sum(length(a.shared_attrs))/1024.0/1024.0 sum from (select distinct state_attributes.attributes_id, states_meta.entity_id from state_attributes, states, states_meta
where state_attributes.attributes_id=states.attributes_id and states_meta.metadata_id=states.metadata_id) attr2entity, state_attributes a where a.attributes_id=attr2entity.attributes_id group by attr2entity.entity_id order by sum desc limit 10
4 Likes

is it possible to purge from recorder by event_type?
I only see purge by entity, or entity global or domain as a Service Purge.

BTW: my top top event_type is call_service… but I don´t know if it’s safe to exclude that for recorder. What do you think ? what would I lose?

1 Like

Yeah thats a heavy query. Do you really need to be calculating percentages?
For example:
Your query:
– Result: 4631 rows returned in 627745ms

vs

SELECT 
  COUNT(state_id) AS cnt, 
   SUM(
    LENGTH(state_attributes.shared_attrs)
  ) AS bytes, 
  states_meta.entity_id 
FROM 
  states 
LEFT JOIN state_attributes ON (states.attributes_id=state_attributes.attributes_id)
LEFT JOIN states_meta ON (states.metadata_id=states_meta.metadata_id)
GROUP BY 
  states.metadata_id 
ORDER BY 
  bytes DESC

– Result: 4631 rows returned in 239724ms

2 Likes

hello.
my database in mariadb is very very big (more thant 60GB).
Now im searching for the entitis thant create more records in the entitys table and excluding the no neccesarys. and ok its slow procedure but works.

but i have a little problem with the state_attributes table.
this table have 6GB of size.
And i dont know how to purge it.
any advice???

thanks

1 Like

david, have you found a solution? i have a similar problem. thanks

hello!.
yes and not.
jejeje

I remove a lot of sensors, to decrease the database (now is 26gb).
I remove manually the table events (because the size was more than 6gb).

after this in developer options in stadistics, i find a lot of problems and when i push solve button, i push delete.

After this I used the purge service of the recorder with the 2 options active (aply filter and repack).

and i dont know what of this things make the magic, but now my state_attributes table size is 2.8GB
is not perfect. but is the 50% of previous size (6gb)

Way back in February of '22 I made a few posts here about the way the statistics table was bloating my database. Now, I know some of you use statistics. If so this isn’t for you. But if you aren’t really interested in keeping those data forever, and you care about keeping your database lean, I have an update.

Some time ago, the statistics table schema was updated to stop using the start field, formatted as a date field, and use the start_ts field, defined as a timestamp format.

So, for example, if you wanted to purge the statistics table, keeping just the past 4 days of data, a new query was required:

DELETE FROM statistics WHERE start_ts < (CAST(strftime('%s', 'now', '-4 days') AS FLOAT));

It took me a while to get around to updating this query, and I found that when I ran it this time, I deleted 183,245 rows. My database shrank from 19,092KB to 3,544KB. That’s a savings of about 81%. Wow. I’m glad I keep my long-term statistics elsewhere.

Just posting here in case anyone else wants to try this.

1 Like

I am not a DB expert.
What I am thinking is:

  1. If “statistics” means “long-term statistics” (LTS) - then these data are supposed to be kept forever. There is also so called “short-term statistics” (5 minutes data) - they are purged like other data after a purge interval.
  2. If some entity is excluded from Recorder but still has LTS in DB - it is possible to delete this old LTS data (old = because the entity is excluded from Recorder, thus LTS are not stored since the moment when it was excluded) from Dev tools - Statistics (this is not correct, see below). Similar - about LTS for removed entities.
  3. It was mentioned several times that LTS do not occupy plenty of space in DB. For me - the only reason to delete LTS from DB is a practical reason “I simply do not need LTS for some entity” (not to mention cases in pt. 2 above). So, if I do not need LTS for some entity - I do not set a corr. state_class for this entity (for template sensors), or I set “state_class: none” via “customize” (for sensors provided by other integrations). As a result - DB contains only needed LTS for particular entities.

I think the SQL-query posted above is very useful:

  1. For educational purpose for non-experts like me.
  2. For a possible practical case: assume you have some sensor wrongly configured, and there is wrong LTS for this sensor in DB; now you re-configured this sensor - and then purged old wrong LTS. But for this case the script should be re-written to delete LTS for a particular entity.

“Supposed” by who? I never asked for these data. I got by fine for a couple of years before the statistics tables were added to the DB. I haven’t found any way to inhibit collecting the data, short of excluding the entities from recording all data, short- or long-term.

Yes, I’ve done that wherever I could. Unless I’m missing something, the “Delete” option only appears for missing entities. It would be great if we could delete all the entries there.

When I ran the DELETE query, above, I recovered 81% of the space my database was consuming. That seems like plenty to me. Again, since I don’t use these data it was 100% wasted space.

You are right.
You can NOT delete LTS for entities excluded from Recorder.
My mistake because I asked in GitHub about this possibility several times - but got no positive answer.

81% of whole DB or only “statistics” table?
Assume your DB was 1GB - then will you get same 81% after purging old statistics?
I heard many times that DB is growing up mainly because of not-LTS data.
Cannot check it myself since do not know SQL.

If you do not need LTS - then you need to set state_class correspondingly.
And yes - a user currently seems to be OBLIGED to have statistics - many integrations add “state_class” for sensors. This is a default behavior. You are not asked “switch on LTS” when adding a config entry.
Probably a global switch “allow LTS” could be added - but Dev team does not think it is needed… And users’ opinion is not always meaningful for them. They may say “LTS data need so little of space, no practical reason to disable storing it”. And in this part they could be right: see above.

I know. I was shocked, too. But it really did shrink that much. In other words, the vast majority of my Recorder database was “wasted” (to me) long-term data.

I maintain a very lean database. I exclude everything I don’t need to keep, and I set my purge days to four. I thought some others following this thread might be interested, but I recognize that many HA users aren’t.

In my defense, I’ve been running HA on an RPi with an SD card. Performance, storage and, most importantly, disk writes are at a premium. This is the hardware which the HA documentation recommended when I first started. So although I understand that developers always want to develop on the latest and greatest hardware, I think I represent a non-trivial portion of the user community.

When I was doing development, I always made it a point to develop on the lowest-performing platform that my users were using. It’s all about respect for your users and pride in your work.

I think my experience suggests maybe they’re not right about this, at least, not in all cases.

As for changing state_class, that’s on my list to research and learn how to do, assuming it doesn’t mess up anything else I need. But frankly it’s easier for me to copy and paste one line of SQL as part of the routine database maintenance I do during my update procedure, so it hasn’t made it to the top of my priority list yet.

Please do not say it. It is absolutely normal to expect a smooth work of SW which is declared to be working on your RPi.
As a former developer for QNX and embedded systems, I absolutely hate a trend of making a SW which requires more and more resources w/o a real need.

Hard to believe, so I repeat my question:
Assume you have 10 days, 1000 entities and a database of 1GB.
What will be an amount of LTS?
If it is 800MB - ok, you are right.

You can estimate LTS size in another way:
Asssume you know a size of one record. There are 24 records a day. So you can estimate a size of LTS for one entity for a day.

But in this case HA will spend resources to process LTS and write it to DB.
So, it is better to change a state_class.

I did play around with state_class a few weeks back as I saw no need to collect useless data. The following works but you do have to remove and reconfigure the device for the change to take. I now add as default in my yaml.

sensor:
  - platform: wifi_signal
    name: "WiFi Signal"
    update_interval: 60s
    state_class: ''
1 Like

Don’t you get a warning for ''? I had to set mine to None.