How to reduce your database size and extend the life of your SD card

Burningstone · October 12, 2020, 1:15pm

Looks, good, need to test it. I just remember @tom_l having issues to create a reliable DB size sensor for InfluxDB.

not_a_coder · November 15, 2020, 10:16pm

I had the same question as karansoi even though I’ve already had InfluxDB and Grafana addons installed since a long time ago. The answer to getting the sensor for the MariaDB, at least for me, was at SQL/MariaDB size sensor in MB?

So it looks like this is in my sensors.yaml file

  - platform: sql
    db_url: !secret mariadb_url
    queries:
      - name: MariaDB size
        query: 'SELECT table_schema "database", Round(Sum(data_length + index_length) / POWER(1024,2), 1) "value" FROM information_schema.tables WHERE table_schema="homeassistant" GROUP BY table_schema;'
        column: 'value'
        unit_of_measurement: MB

bingo5 · December 28, 2020, 10:48pm

Hi,
I come from iobroker because I really like some things about home assistant . But also some things not. My biggest problem is logging. In iobroker I can set for each entity when the value should be logged. e.g. E.g. only every 30 seconds or when there was a value change.

Edit: i see that i can maybe configure mqtt so that the value is not updated so often. I’ll test that tomorrow.

Following is an example:
I have a device (e.g. a computer) which is on between 09:00 and 12:00 o’clock and usually consumes exactly 20 watts.For a short time the consumption goes up to 100 watts. The device is connected to a current meter, which constantly reports the current value via ZigBee to HA.

The recorder now writes the value “0” into the database every time. Between 9:00 and 12:00 the same with “20” or sometimes “100”. This now generates thousands of lines of unnecessary information. However, it is enough for me if 00:00 the 0 watts are stored once in the database and only at 09:00 the “20”. Then 09:20 “100”, 09:23 again “20”, 10:00 “100”, …, 12:00 “0” until the next switching on of the device and a consumption of more than 0 Watt. But exactly this information is very important to me, which is why I do not want to turn off logging completely. However, if HA would write to the database only at such value changes, the size would also be significantly smaller.

Since I have a lot of such sensors and would like to log over long periods of time (up to a year), I would end up with several TB of data.

Is there a solution for this at HA?

Maybe the problem is also the attribute “last_seen” which changes every time?
–>
{“current”: 0, “energy”: 39.29, “last_seen”: “2020-12-27T23:29:49+01:00”, “linkquality”: 75, “power”: 0, “state”: “ON”, “voltage”: 228, “unit_of_measurement”: “W”, “friendly_name”: “energy_sensor”, “icon”: “mdi:flash”}

{“current”: 0, “energy”: 39.29, “last_seen”: “2020-12-27T23:29:59+01:00”, “linkquality”: 75, “power”: 0, “state”: “ON”, “voltage”: 228, “unit_of_measurement”: “W”, “friendly_name”: “energy_sensor”, “icon”: “mdi:flash”}

Burningstone · December 29, 2020, 8:50am

For long-term data I recommend influxDB, a time-series database that is exactly made for this purpose. Coupled with Grafana you can get some nice graphs/dashboards. I wrote a little guide about my config with PostgreSQL for short-term data and influxDB for long-term data, integrated into Home Assistant here.

bingo5 · December 29, 2020, 10:18am

Thanks, I will look at your guide today. With Iobroker I also used InfluxDB. However, through HA I became aware of TimescaleDB ( https://community.home-assistant.io/t/home-assistant-add-on-postgresql-timescaledb/198176 ). Since I wanted to switch from SQLite to MariaDB or Postgres anyway, this way I only need one database and not two. Do you see anything that speaks against TimescaleDB? The arguments for TimescaleDB actually sounded pretty good.

Burningstone · December 29, 2020, 10:53am

You could use this as well, I’m not familiar with TimescaleDB.
I like my solution because long-term and short-term data is separated and can be managed separately easily. Also the HA DB for the recorder often gets messed up with the only way out being to delete the DB (or knowing enough about databases to fix it yourself), which would be disastrous if everything is stored in one DB.

Expaso · December 29, 2020, 10:53am

Well, of course as the author of the TimescaleDb addon I’m slightly biased, but to answer the question: Yes, I have both the recorder and the Long Term storage on TimescaleDB and it performs great!
I actually created the addon because I did not want to have multiple database systems running like SQLLite/mariaDB/Influx… So a one-size fits all solution was very welcome.

BTW, you have separate database for the recorder and LTSS. And (to address the concern about corruption), My SQLLite DB got corrupt once in a while, but on Timescale never had any problems.
Although, you should not want to run ANY database with heavy writes on a SDCard. It will get corrupt one day due to wearing.

Burningstone · December 29, 2020, 11:26am

Ok, then that’s great!

Luckily I don’t use an SD card anymore for a long time

bingo5 · December 29, 2020, 10:17pm

Yes, I also use an SSD - the switch was a very good decision The perceived performance gain was significantly higher than between each Pi generation.

Unfortunately, via mosquitto mqtt or zigbee2mqtt, I could not reduce the refresh rate without losing valuable information (last seen of the sensors).

I’ll try LTSS tomorrow. In TimescaleDB, are the data possibly stored more sparingly? Or is the advantage only in the faster generation (of graphs)?

bingo5 · January 4, 2021, 6:34pm

I have spent the last few days intensively looking into TimescaleDB and LTSS. I’m almost convinced too, but I still have a few open questions (maybe you can take a look at Pinned: Additional Documentation for Timescaledb Integration · Issue #1 · Expaso/hassos-addons · GitHub).

Since the topic here is about storage space, I would like to briefly summarize my findings of the past days:

out of the box TimescaleDB (with LTSS) consumes barely less storage than PostgreSQL (with Recorder), max. about 10 to 20 %.
(According to google search:) On the other hand InfluxDB consumes much less storage (about 4 % of TimescaleDB), furthermore Victoriametrics (also a time series DBMS) consumes about 2 % of the storage.
On the other hand InfluxDB (and Victroiametrics) consume much more RAM, which is worth its weight in gold on the Pi.
However, TimescaleDB (since version 1.5) supports post compression at a rate of up to 96%. According to my own tests, it’s even over 98%! (i.e. slightly more than Victoriametrics).

Currently I generate about 150 MB per day of sensor data. Since I have not yet integrated all sensors, it will eventually be around >500 MB. So just close to 200 GB per year. That would be way too much. By compressing the data, I should end up with less than 5 GB per year, which would be completely ok.
Regarding performance, the question is, of course, how long a query will take over a period of one year with compressed data. But from what I have read, this is not a problem.

Briefly what I have done:

Changed LTSS so that the chuck interval is only 12h (<100 MB) instead of 30 days.
SQL query to set up compression:

ALTER TABLE ltss SET (
timescaledb.compress,
timescaledb.compress_orderby = ‘id, time DESC’,
timescaledb.compress_segmentby = ‘entity_id’
);

SQL query to manually compress a single chunk for testing purposes:

SELECT compress_chunk(‘_timescaledb_internal._hyper_1_1_chunk’);

SQL query to display the compression ratio:

SELECT *
FROM timescaledb_information.compressed_chunk_stats;

Result:

“ltss” “_timescaledb_internal._hyper_1_1_chunk” “Compressed”
“54 MB” “53 MB” “8192 bytes” “106 MB” “88 kB” “32 kB” “1624 kB” “1744 kB”

The 106 MB are uncompressed in total and the 1744 KB are compressed in total.

However, I do not recommend haphazardly adopting my SQL statements. It is better to take the time and read up a bit at Timescale Documentation | TimescaleDB API reference.

Of course, my sensor data is very compressible because it is very repetitive. But the bottom line is that TimescaleDB can compete with influxDB and other time series DBMS. And I don’t need to worry further about avoiding unnecessary data. In addition, TimescaleDB should compress even better in future versions.
I could also delete all rows between an interval of 10 seconds or so.

Expaso · January 5, 2021, 8:34am

Hi @bingo5,

I have answered your questions in the mentioned issue.

First off, thanks for taking the time to testdrive the TimescaleDb compression and post your findings! Very usefull!

To save storage space, you basicially have 3 options:

1) Compress the data. You have just proven that it even can beat Influx.

Pros:

Very easy
Instant space savings.

Cons:

Compressed data cannot be updated. Normally not a problem (because Influx can not update at all!), but should you have the need, you must first decompress the chuck to update, and after update re-compress again.
Little bit of overhead (depending on the type of query, some are faster!) when querying compressed data.

2) Remove data you don’t need (like wiping the attributes column to NULL)

Pros:

Space saving
Easy to do

Cons:

Should you need that data in the future, bad luck.

3) Downsample your data, using Continuous Aggregates

Pros:

Saves a LOT of space.
Will speed op queries over large timespans tremendously.

Cons:

No more access to all measurements, but only the chosen aggregates values, like min/max/avg/mean within the chosen timeframe (like 5 minute/hour/day/month/etc).
You have to overthink your data-needs upfront, because you basically are replacing a large set of old data with 1 record containing aggregated values. There is no way you can get your original data back if you deleted it to save space.

Now, you could even combine all 3 methods in a mix and match style if you like,
depending on your hunger for data.
Maybe start with compression first, and create some continues aggregates to speed up queries spanning thousands of record.
That way, you have a nice balance between performance, and not losing any data.

With that in place, you could start with dropping old chunks of data to save space if needed.
Clearing the attributes column is more difficult with compression in place, but also

The good news: lots of options, freedom everywhere

bingo5 · January 5, 2021, 9:40am

First, thanks again for your support (also on github). I really like the mix of self-learning and your support

My approach was originally to avoid writing unnecessary data to the database in the first place, by only taking states that have a certain difference from the previous state.

But this is not supported by Recorder/LTSS. Because of the good compression and the other approaches you mentioned (very nice overview!) this is not necessary at all. The good performance of HA/LTSS and PostgreSQL/TimescaleDB on the small Piv4 is also surprising. I’ll see how it performs when more sensors are added. As another idea, I would adapt Zigbee2MQTT to transmit less/rare sensor information to HA.

Either way - there are quite a few possibilities and therefore I am sure that HA and TimescaleDB are very future-proof for me --> yes, freedom (and fun) everywhere

Next I’ll add the extra column for numeric states and read into Continuous Aggregates.

adorobis · January 21, 2021, 2:46pm

Many thanks for the guide! Very useful and my db is reduced by half right away
I have just one question though:
Can we combine the domains in exclude section with entities_glob in include section? When I tried to do it my entities that were matching the globs were not included in recorder. My config example:

recorder:
  purge_keep_days: 14
  db_url: !secret db_url
  exclude:
    domains:
      - sensor
    entities:
      - binary_sensor.shelly1l_84cca8ad7bf2_overheating
  include:
    domains:
      - binary_sensor
    entity_globs:
      - sensor.shenzhen_*_power*
    entities:
      - sensor.shenzhen_neo_electronics_co_ltd_power_plug_12a_energy

Everything works except the entities matching the listed patterns are not recorded. Am I doing something wrong or it is not possible the way I want?

Redpizza · June 28, 2021, 8:40am

Thank you all for this guide. I was running with a 3Gb DB with only 3 days history. This helped me identify the main culprits who were logging every 15 seconds (do I really need every stat from every container every 15 seconds when I already have InfluxDB doing the same???!).

So now the purge has run and I’m at the end of my 3 day cycle, I’ve gone from 3Gb to 360Mb. I’ll now extend up to 10 days of history.

aportillagithub · July 9, 2021, 7:02am

It’s in Spanish, but I guess you can follow it: some analysis to optimize database size:

https://www.pacienciadigital.com/base-de-datos-sqlite-home-assistant/

klogg · June 30, 2022, 2:30pm

Old post, yeah I know but…

My DB size jumped today and I am not sure why.

I’m usually fairly conscientious when it comes to keeping my recorder entities under control.

I’m using SQLLite to do some analysis and my States don’t seem to be a huge problem. I can’t however get any data using the Events query posted above.

SELECT 
    JSON_EXTRACT(event_data, '$.entity_id') AS entity_id,
    COUNT(*) AS cnt
FROM events
WHERE
    JSON_EXTRACT(event_data, '$.entity_id') IS NOT NULL
GROUP BY
    JSON_EXTRACT(event_data, '$.entity_id')
ORDER BY
    COUNT(*) DESC;

Maybe (probably?) the database has changed since this guide was written. Does any one know what the query should look like?

Thanks.

jwvanderbeck · January 17, 2023, 1:58pm

Yeah this guide is old and doesn’t seem to be quite working anymore. I ran into the issue that the notifier script no longer generates output like the guide shows. It is just a mess

ChargerDad · May 3, 2023, 11:11am

For MariaDB, I had to update the query to:

SELECT 
    states_meta.entity_id,
    COUNT(*) AS cnt
FROM states JOIN states_meta
ON states.metadata_id = states_meta.metadata_id
GROUP BY
    states_meta.entity_id
ORDER BY
    COUNT(*) DESC;

ChargerDad · May 3, 2023, 11:49am

In my case, I was able to find the culprit of my rapid DB size bloat, and it was a media_player device responsible for more than 90% of the table entries… I excluded those, and then took a back up, and decided to purge those records from the states table:

DELETE QUICK FROM `states` WHERE metadata_id IN (SELECT metadata_id FROM states_meta WHERE entity_id LIKE 'media_player%')

CaptTom · May 3, 2023, 12:07pm

That works with the default SQLite database, too.