How to keep your recorder database size under control

tosion · April 3, 2021, 8:29pm

Thanks for your response!

I pretty much do the very same thing as you but my problem is how to keep the simple on/off-state in database. If you exclude the switch.xyz you lose the state (on/off) - don’t you?

As I wrote: “I want to keep the state (on/off) of the plugs in my database”

You probably can create a binary_sensor for that also but how is the question?

CaptTom · April 3, 2021, 9:39pm

Hmmm. I haven’t actually done that. But according to this documentation it’s possible:

Look at the section “Switch as a Sensor.” I think that’s the closest to what you want.

tosion · April 4, 2021, 7:04am

Thanks @CaptTom!

binary_sensor:
  - platform: template
    sensors:
      m_heat1:
        device_class: plug
        value_template: "{{ is_state('switch.m_heat1', 'on') }}"

It works

denilsonsa · April 5, 2021, 12:32am

Many people have used InfluxDB for long-term storage of many values from many sensors. Thus, if you really care about the history but don’t want to bloat the SQLite database, you can try looking into that. Or replacing the SQLite database with a robust MariaDB or Postgres instance (probably running on another more powerful machine with a more reliable storage).

tosion · April 13, 2021, 8:04am

Thanks for your effort to educate us!

I have couple observations which makes me a bit confused. I first excluded database size sensor (in two quite similar setups; HAOS/NUC). Then added it again with half hour scan interval.

  - platform: filesize
    file_paths:
      - /config/home-assistant_v2.db
    scan_interval: 1800

In another system I see “stairs” of couple hundred KB in every 1800 seconds.

In another one I see just a “flat” result in size.

So two questions:

Why just part of the day I see these updates of the sensor in another system?
Overall I would suppose the size of database change a bit all day long - like the other one has been around 2 days 51.54 MB. At least after every purge I should see smaller database - I do not see that either.

And Recorder is:

recorder:
  purge_keep_days: 3
  commit_interval: 2

denilsonsa · April 18, 2021, 11:51pm

TL;DR: I guess… The file size only shrinks on repack: true command. Thus, it seems normal that it keeps growing until 04:12 AM, then it stays mostly stable for the next days.

It seems the file size slowly increased throughout the day until the maximum of 43MB, and then at around 4AM it stopped growing. Meanwhile, on the second graph it is stable and never shrinks.

I’m not 100% sure, but I think I know what happens. The recorder integration auto-purges the database at 04:12. This explains why the database stopped growing around that time.

Additionally, I think this “auto-purge” is a simple DELETE FROM statement that will remove rows from the table. I also think database systems (such as SQLite) tend to rarely shrink their file sizes unless explicitly told so. Let’s imagine how the rows are stored internally by the database. It certainly has some index structures pointing to byte-offsets where the actual user data is stored. Then, if someone deletes a few rows, the system will remove those from the index, and possibly write zeroes over the old data. However, for performance, the system will not move the remaining rows (as that would mean rewriting a lot of data and a lot of metadata); instead it will just leave the deleted space as “empty” and available for later use. Getting rid of this empty space requires rewriting several rows and indexes (essentially, as much work as re-populating the entire table from scratch).

Thus, in order for the database file size to shrink, a slower cleaning-up procedure has to be executed. Since this can cause slowdowns (specially for larger databases) and a lot of I/O, it is not executed automatically. You can still ask for it using repack: true:

When using SQLite or PostgreSQL this will rewrite the entire database. When using MySQL or MariaDB it will optimize or recreate the events and states tables. This is a heavy operation that can cause slowdowns and increased disk space usage while it runs. Only supported by SQLite, PostgreSQL, MySQL and MariaDB.

If my understanding is correct, then I’m curious to know how the graph looked like on the following days. I assume the file size was growing on the first day (the “stairs” you cited), and then stabilized around that size for the next days (possibly with occasional growing near the end of each day).

tosion · April 19, 2021, 10:13am

Thanks again!

Actually graph had those “stairs” couple days until it stabilized totally. So what you explained makes sense!

CChris · November 12, 2021, 1:03pm

Hi there,

first of all, sorry for creating a new reply to this old topic…
I have changed my recorder to a MariaDB running on a different server than HomeAssistant.
Now, after a few days - the DB size has grown a bit over 1 GB of data - this shouldn’t be yet an issue for MariaDB, but honestly speaking - I don’t need much history for most of my sensors and integration.

Sure, I could use “purge” to cleanup the database, but since the Energy Dashboard is using the database for its statistics as well, I am not sure, if purge will have an impact on these statistics, too.

I want to keep as much data as possible for the energy dashboard (long term statistics)…
Is there a way to specify which statistics could be deleted and which not?
for example, there are many sensors available in the Database (statistics and shortterm_statistics) which I don’t really need there…

So in short, I do want to optimize the Database / Recorder options, but I am a bit unsure, what is required and what not (mostly in terms of the energy dashboard and just a few other sensors)

CaptTom · November 12, 2021, 1:26pm

I think the first post in this thread explains it pretty well. IMHO the first five or so should be part of the HA “Getting Started” documentation. And if you don’t want to mess with SQL, there’s always the “History” page which can give you an idea of which entities are spamming the database unnecessarily and should be excluded.

Assuming you’ve read those and excluded the things you don’t want to keep, that’s about all HA can do for you. It’s downright embarrassing that a great system like HA has such a poorly designed database, and no way to specify retention times for individual entities. You either keep them all for the full keep_days or exclude them. Obviously this is a pretty blunt instrument.

I did post a Feature Request requesting retention periods by entity. Feel free to up-vote if you agree.

denilsonsa · November 12, 2021, 2:30pm

@CaptTom’s response is pretty accurate. In summary, @CChris, you have a few options:

Set a short purge_keep_days that applies to all entities. (This may impact the history of the other entities you want to record for a long time.)
Leave purge_keep_days long enough and accept your database will become very large. (Shouldn’t be an issue, except for the feeling that you are wasting storage space that could be used for something else.)
Leave purge_keep_days long enough and filter other entities, so those won’t have any history.
Leave purge_keep_days long enough and add a cronjob or a script somewhere that will periodically run a DELETE FROM … WHERE … SQL query on your database, cleaning up older items for entities you don’t care as much. (If you do this, please post your query.)
Keep the recorder database small, and use a second storage for long-term history, such as InfluxDB, Graphite, Prometheus. (Those solutions are optimized for time-series data with lots of writes, but setting them up is more complicated than just a single database.)
Hope that CaptTom’s feature request will ever be implemented. (Or help implement it yourself.)

All of these solutions have some drawbacks, and unfortunately there no simple solution that would work for most people. I’m currently using solution 1, I can’t do solution 2, I thought about solution 5 but that means more maintenance and more moving parts that can fail. I’m willing to adopt solution 4, but so far I haven’t invested time on it (also because long-term history is currently just a “nice-to-have” feature for me).

CChris · November 12, 2021, 2:35pm

all right, thanks @denilsonsa and @CaptTom.
so my thoughts are correct, that purge will also remove the statistics for the energy dashboard…

then I will have to consider some additional methods - aside from homeassistant to do the job…
I am already using influx for some measurements, but I want to keep the states in the energy dashboard, which isn’t possible with influxdb as far as i know…

CaptTom · November 12, 2021, 3:08pm

I use a sort of hybrid solution, starting with @denilsonsa’s Option 1.

I’ve excluded everything I don’t care about (biggest impact on DB size) and set purge_keep_days to 7. I could probably go to 4 or 5 without really missing anything.

But I also record some long-term data to flat text files using notify: / platform: file. I log every start/stop of my heating system’s boiler (helps me estimate fuel used) and I log daily and monthly run-time data for the various heating and cooling zones.

I figure that I’m going to want to analyze and manipulate the data in a bunch of different ways that a database (especially, a poorly-designed one) just isn’t ideal for.

CChris · November 12, 2021, 6:31pm

thanks a lot
Probably, I am going to keep the data I want to keep within the database.

Having them in my MariaDB would also allow me to access the data with anything else rather than HomeAssistant…
But first, I need to decide, what Data I really need and what could be deleted… this will probably take some time

basically, I just wanted to know, if purge does have an effect to the statistics for the energy dashboard (which isn’t really mentioned in the documentation —> probably should be [?])

this will also allwo me to define the exclusion list more specific, since I know what sensor I have assigned to the dashboard, etc.

CaptTom · November 12, 2021, 8:05pm

Right. This should be a step in the “Getting Started” section of the HA documentation!

Here’s a tip: Exclude any entities which have a bunch of attributes. Use a template to pull out just the one(s) you need, and let Recorder save just those to the database.

Another tip is don’t spend too much time trying to exclude everything you don’t need. Focus on the heavy hitters which update often or with a lot of data. You can query the database and use COUNT to find the worst offenders. Or even use the History page. Then decide which of them to keep and which to eliminate.

malosaa · December 9, 2021, 4:52pm

wow great info,

but does it affect the energy database? i want to keep that info for atleast a year.
i’m scared to use purge and lose the full energy database.
i excluded a lot of things, just want to keep alll the info from the energy dashboard

i’m new here btw

denilsonsa · December 9, 2021, 10:34pm

CChris asked almost the same thing. Check that post in this thread and the few replies to it. In summary, there is no simple and easy solution.

Additionally, as my personal opinion, if you want long-term storage you should probably use a storage system optimized for large amounts of time-based data. Such systems have predictable storage size, and they can reduce the granularity of the data over time. (e.g. you probably don’t need sub-minute resolution for data from one year ago, having the number for every minute, or every 5 minutes, or every half hour, is usually enough.) They also won’t save all the extra attributes that the states table has. All of this makes for a highly efficient solution (both for reading, for writing, and in total size). The downside is having to configure it yourself. (Well, you can use a HAOS add-on, but that will store data in the SD card (assuming you’re running on a Raspberry Pi), which will likely cause too many writes and cause the card to fail sooner than you might expect.)

malosaa · December 10, 2021, 11:12am

I did some research

actually there is a way to host influx db on a other pc and use that for long-term, u can than save it where u want, even on a remote-storage, and for grafana the same.

Right now i changed recorder to whatever i want to record, i use include only, and this saves me a lot of space.

it was already 700mb at 9 weeks while using the exclude… inlcude is better and u know what u need to record to see long-term data without using influxdb and grafana etc.

muzzak · February 2, 2022, 2:42am

Thanks very much for this great writeup. I’ve been using HA for a while now and have noticed the DB grows in to the order of several Gb but have never known why. It slows down my PI 4 to a crawl and I have things like the lights come on after I’ve entered and left the room
Now I can fix it, so thanks very much

flobidan · February 2, 2022, 3:37pm

Thanks for the good guide, I followed that except digging through the database with SQL.
But I still have problems to understand a few things.

This is my recorder.yaml, followed by an exclude part:

purge_keep_days: 3
commit_interval: 10
exclude:
.......

According to the docs, by default there is an auto-purge each night.
When I look at the history of the database, it is growing only each night by a couple of MBs, no change at all during the day:
grafik

This does not make sense to me, as there is hardly anything happening each night between 2 and 4 AM and also not much in the log. On the other side, during the day of course a lot is happening, which is also saved in the history.
Does the auto-purge only delete content, while keeping the same data base size, so that during the day just the empty part is being filled up, and just in the night shortly before the next purge by coincidence it is slightly growing?
The only way I have ever seen the database shrinking so far was by manually executing the purge service with the repack option, but nowhere did I read that generally only with this the database would shrink.

CaptTom · February 2, 2022, 4:47pm

I’ve been seeing this growth every night in the wee hours, too. This is a new behavior in the past release or two.

I tried running a manual purge and re-pack, but that doesn’t change the size or stop it from growing a bit each night. It hasn’t been unmanageable yet, but it’s got my attention.