Why are the Hass.io full snapshots growing in size?

Hi there!

I’ve started a new Hass.io server roughly 6 weeks ago. At the beginning each full snapshot I took was only a few megabytes in size but as the time passes, the full snapshots are larger and larger. This morning I took one and it was 452 MB, then a few hours later a next full snapshot is 463 MB. In this time frame I added a few changes to my configuration.yaml and to my automation.yaml, but nothing major really. So why are the snapshots growing in size? Are they cluttered with some log files? How can I check this and how can I make the snapshots small again?

Thanks!

Probably the db growing in size?

But what the hack can make the DB so large and how can I trim it back?
Is every value of every sensor being recorded all the time and kept? If so, can an expiration be specified or something?

Look at the documentation for the recorder :

recorder:
  purge_keep_days: 2
  exclude:
    domains:
      - persistent_notification
      - automation
#      - weblink
      - updater
    entities:
      - sun.sun # Don't record sun data    
  

The history of you devices changing state, events etc. Check the size of the file
home-assistant_v2.db in your config directory.

Check out the documentation for the recorder, there it is explained how to exclude entities, purge interval etc. You can also have it in a separate db and exclude it from your snapshots.

Yepp, that’s it! the home-assistant_v2.db is huge.

I’ll check the docs, I hope I can make it work like I had in my old home automation system (Domoticz), which kept the values of sensors in detail for a day, it kept less details for previous days, even less details for previous months, but still enough to be able to show nice daily/monthly/yearly graphs for the sensors (and not eat up the disk space)

Thank you for your help!

Well, I read the recorder docs… If I understand correctly, recorded values are purged by default after 1 day (purge_interval) and then discarded by default after another 10 days (purge_keep_days). This worries me on multiple levels. First, I would have hoped that sensor values are kept forever to be able to see graphs for years back. If values are thrown out daily, then there’s no hope for that. Second, if my DB is 4.5 GiB in size with just 1+10 days of data, then something is really “wrong”. What the hack produces so much data? Is there any way to find out what is taking up the space exactly?

I have the answer!

To find out what is producing so much data, you should install the SQLite Web add-on, which allows you to see the data in your SQLite db belonging to Home Assistant. In case you’re not using the default SQLite db, but another one, find an add-on for visualizing that.

  1. Start the SQLite Web add-on and open it’s web UI.
  2. Click the “states” table and go to the “Query” tab.
  3. Execute the following query:
    SELECT "entity_id", COUNT("state") as "StateCount" FROM states GROUP BY "entity_id" ORDER BY "StateCount" DESC
    You might have to wait for a while for the results to show up, depending on the size of your DB.

The query will return a list of your entitites with the count of their states, sorted by the number of states
of each entity in descending order. This will give you some clarity over which entities are generating the most states.

Here’s what I got:

To get the counts of states per entities only after a given moment in time:

SELECT "entity_id", COUNT("state") as "StateCount" FROM states WHERE "last_updated" >= '2019-12-20 08:30:00.000000' GROUP BY "entity_id" ORDER BY "StateCount" DESC

(in this example we’re querying states newer than 2019-12-20 08:30)

How you can act based on the obtained information to reduce the number of states is a different story and I will write about that in a separate post.

2 Likes

In my case particularly clearly the most states are produced by my Z-Wave devices. The question is why they are producing this ton of data. Might be because of the polling that I had to set up in order to get timely updates from the power sensors. Perhaps I can configure the polling to only work for some of the Z-Wave devices. Or I might have to exclude them from history (See [https://www.home-assistant.io/integrations/recorder/](the recorder documentation)).

But there’s still something that I don’t quite understand. I have roughly 30 entities which have over 1.000 states (between 1.000 and 71.000), the rest is insignificant. That’s about 30 * 36.000 = 1.080.000 states. So I have just a little over one million states stored. How does that produce a DB which is 2.2 GiB in size? 2.2 GiB dvided by 1 million is roughly 2.200 bytes per state. Is really Home Assistant using 2200 bytes to store a single state of an entity?!

Did you ever find the answer to this question (i.e., what specifically is causing the large data storage)? I’d be interested to know.

BTW, many many thanks for your earlier postings on this (in this thread) – it’s been really helpful!

Nope, unfortunately I never figured it out. I just eliminated the sensors generating many states from the history using the recorder settings. But to this day I don’t understand why the DB can grow so large.

Bummer. I’m just finding my sea legs with HA, and I’m starting to grapple with the very same issues you raised in your OP. I wonder if there is some garbage collection that isn’t happening (but should be)? Also, what the size (more or less) of your db after you eliminated the offending sensors?

You can set the number of days to keep the history. I reduced that to 7 days, I think. My DB is a few hundred megs now. My personal opinion is that this may just be one of HA’s areas which are not so well designed and thought through. But I may be wrong. I would definitely like to keep data of hundreds of sensors for at least a year and still have a reasonable DB size. But I couldn’t figure out if that can be achieved today. From what I’ve seen, it cannot.

The simple solution is to not include the DB in your snapshots. My DB is huge as I keep a month of data, but my snapshots aren’t too bad since the DB is excluded.

DB size in itself is a problem regardless of the snaphots becausr it wears out the SD card, especially smaller ones.

I run an SSD

How do you exclude DB in snapshots?

You can’t if you use the sqllitedb. But if you have the mariadb addon installed (which it looks like you do) and use that for recorder then just uncheck the box for mariadb

1 Like

Just my two cents…

Wasn’t Grafana / Influx ment more for ‘long’ term storage, while sql takes care of ‘short’ term data ??

Not completely true.

If you put the DB in a select folder you can exclude that folder from the Snapshot using the Google Drive backup add-on.