InfluxDB - Setup to Compress Data older than 6 Months / 2 Years

madface · November 17, 2022, 6:48pm

I’m glad it works for you, it’s more a workaround for the missing time clause for the continous queries. But it works really stable if you can do the cronjob and it’s saving 80% space.
So until anybody finds a better method …

erkr · February 28, 2023, 10:19pm

Thanks for this great tutorial how to manage the Influx DB size.
I’m running on Hass OS and it works there as well. Really nice. ca 400mb of data became 15mb. Not really convinced yet it is complete, so not changed the retention of the raw data yet…

I mainly used a shell command that can be called from automations instead of a cron job:

Installed the community SSH & Web Addon running with protection mode off
configured SSH with keys for logon without passwords (see tutorial SSH'ing from a command line sensor or shell command - #75 by arkoko)
created your shell script ‘compress_influxdb.sh’ to update influx in the HA /config folder.
Needed to add a sudo before the docker commands in the shell script (related to Hass OS?)
Created a shell command in configuration.yaml:

shell_command:
  compress_influxdb: ssh <username>@homeassistant -i /config/.ssh/id_rsa 'sh /config/compress_influxdb.sh'

Really happy with your tutorial

madface · March 1, 2023, 5:53pm

Looks really good, seems to be a reliable way for using it with Hass OS.
When using docker you normally assign your user the rights to docker with usermod with something like this:

sudo usermod -aG docker youruser

Without it you have to sudo the commands, i think this is the reason. But then it works in the same way.
Many of the data without compressing the data is from the attributes, if your sensor has attributes they are stored also. Maybe they can be used by grafana, but i did not find a way and never used the attributes. If you store a weather sensor like dwd you have a lot of attributes stored every time when the sensor changes.
I read in another thread that you can exclude attributes. I now use victoriametrics for the long time storage, but since it uses the influx integration it should work the same way:
Add this to your influxdb config entry:

  tags_attributes:
    - friendly_name
    - unit_of_measurement
  ignore_attributes:
    - icon
    - source
    - options
    - editable
    - min
    - max
    - step
    - mode
    - marker_type
    - preset_modes
    - supported_features
    - supported_color_modes
    - effect_list
    - attribution
    - assumed_state
    - state_open
    - state_closed
    - writable
    - stateExtra
    - event
    - friendly_name
    - device_class
    - state_class
    - ip_address
    - device_file
    - unit_of_measurement
    - unitOfMeasure

This should extremely shorten the data that is written to the autogen policy. When compressing the data these attributes are deleted anyway and only the mean value is written.
Maybe it helps…

erkr · March 4, 2023, 8:01am

@madface thanks for the extra insights. The attributes not being copied probably explains the factor 30 compression. I also never used them in analysing long term trends, so that is perfectly fine.

ralong · March 29, 2023, 6:41pm

Can you write a tutorial how to make it work on Hass OS?
I donk know how to create those retention policies i don’t have a docker

erkr · April 3, 2023, 12:15pm

Hi @ralong
I’m running on hass OS and all info in this thread is applicable to set up the retentions policies. I only approached the automation slightly different (so an addition tutorial for that part):

I added a sudo before docker in influx_query.sh (‘sudo docker …’ instead of ‘docker …’):

#!/bin/sh
sudo docker exec -t addon_a0d7b954_influxdb influx -execute 'SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."y2_5m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -26w and time > now() -26w6h GROUP BY time(5m),*' -precision rfc3339  -username '<username>' -password '<password>' -database 'homeassistant'
sudo docker exec -t addon_a0d7b954_influxdb influx -execute 'SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."inf_15m".:MEASUREMENT FROM "homeassistant"."y2_5m"./.*/ WHERE time < now() -104w and time > now() - 104w6h GROUP BY time(15m),*' -precision rfc3339 -username '<username>' -password '<password>' -database 'homeassistant'

I created a shell_command to enable to call influx_query.sh by any automation (instead of using crontab for that):

shell_command:
  compress_influxdb: ssh <username>@homeassistant -i /config/.ssh/id_rsa 'sh /config/influx_query.sh'

Example of my automation that trigger each 30 minutes:

alias: Update InfluxBD
description: ""
trigger:
  - platform: time_pattern
    minutes: /30
condition: []
action:
  - service: shell_command.compress_influxdb
    data: {}
mode: single

Note: the SSH commands only work if you have installed keys for password less access (see tutorial SSH’ing from a command line sensor or shell command - #75 by arkoko )

DuckDuck25 · April 6, 2023, 5:03am

For the homeassistant.autogen retention policy should it be infinite or a shorter value like 6 months?

madface · April 7, 2023, 11:31am

For this example it should be set to 6 months and one day. So you have in the first 6 months all data you stored, after that it it cumulated all 5 minutes and for data that is older than 2 years it it cumulated all 15 minutes. Before setting up this retention policy double check everything is working, or you will loose all data older than 6 months!

BrunoSpy · April 20, 2023, 10:03am

If I understood correctly the documentation, your CQ didn’t work because they run against new data.
The where clause is ignored in CQ so if you want disjointed RP, you need to do it manually.
In fact the best practice is to have your RP cover the period from now to “what you want”. The downside is that you have parts of the period covered multiple times, but it’s not much data. On the plus side, you can simplfy your queries a lot.

erkr · April 22, 2023, 6:03pm

Hi Mike

Can I ask you a noob influx DB query question?

Before limiting my raw data retention policy, I intend to automate a backup of my influx database to my PC first.

So in Home Assistant I export the Database to the share
On the PC I import the database from the share into a temp database tmpdb

So far so good, but I fail doing the last step:

only add the new data from the tmpdb to the homeassinstant DB.

I noticed that overlapping data simply duplicates.
This is the query:
SELECT * INTO "homeassistant"."autogen".:MEASUREMENT FROM "tmpdb"."autogen"./.*/
Same result with this variation (addition group by):
SELECT * INTO "homeassistant".autogen.:MEASUREMENT FROM "tmpdb".autogen./.*/ GROUP BY *

If I run that query twice, the destination database homeassistant more then doubles in size.

I thought duplicated datapoints would overwrite. Maybe you have a suggestion (I’m very new unfamiliar with database queries). A different way to perform the query, or some merging / cleanup of duplicate data.

Thanks in advance!

Best Eric

dmartens · April 22, 2023, 6:56pm

Blockquote
Every time it runs it will copy all data older than 6 months to older than 6 months and 6 hours to the retention policy “y2_5m” and grouping them again by 5 minutes, and data older than 24 months to 24 months and 6 hours to the policy “inf_15m”, grouping them by 15 minutes.

Hey Mike why the 6hr timeframe?

madface · April 23, 2023, 9:04am

@dmartens
I did this for safety, if the script does not run due to maintenance or something like that, you will not loose the data. I looked in the database and did not see double data points for the overlapping ranges, so i thought it would be a good idea.

@erkr
Good question, i think this is even beyond my database knowledge. As i created this before a year i looked in the db after executing a few times and did not see any double data points.

Since the script ran every hour it would have to be 5 overlapping points. What i don’t know is, aren’t there any or is only that they are not shown.
Can you test with a select statement like mine in the picture, just to look if you will see double data points? Maybe you have to replace the measurement to something you have.

EDIT:
I dived a little bit deeper into, but i don’t understand it. As per docs, selecting into with same measurement, timestamp etc. should replace the values. So it should not make the database grow. I made a test setup, but i see the same problem as you, the database grows exactly by the size it has with every new select into.
But if you count the data points they stay the same. Just try

select count(value) from "W"

And the amount of data points stays the same independant how often you select into. So i really don’t know where the problem is at the moment. I know influxdb compacts the database every few days, maybe it shrinks when it runs over a time?

erkr · May 2, 2023, 6:45pm

Thanks for looking deeper into it. Glad you can confirm my issue. I didn’t have time myself to further investigate. Will report if I get any further

dingausmwald · May 9, 2023, 5:39pm

Oh dear, all that hassle. I am the guy from the initial post, and i can confirm that it works. In my thread i see you try alot of additional stuff like you start your cont. queries with “2y_5m-inf_15min” CREATE CONTINUOUS QUERY and so on. You name your query databases “2y_5m-inf_15min”, but later your refer to them as “FROM homeassistant.“2y_5m”./.*/ WHERE time < now()”

give it another try and take a look at the logs/error messages the influxdb query gives back.

At the end it should look like this

dMopp · August 7, 2023, 11:02pm

Why not using

CREATE CONTINUOUS QUERY "y2_5m" ON "homeassistant" BEGIN SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."y2_5m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -26w and time > now() -26w6h GROUP BY time(5m),* END
CREATE CONTINUOUS QUERY "inf_15m" ON "homeassistant" BEGIN SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."inf_15m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -104w GROUP BY time(15m),* END

? no crontab required anymore afaik

madface · August 8, 2023, 4:51pm

@dingausmwald
I do not work with influxdb anymore (using victoriametrics), so i cannot double check this anymore. I updated my first post for the others to use your method (and only to use the cronjob way if they fail like me), i know it is the cleanest way and my originally intention was to use a cq.

@dMopp
Give it a try, if this works for you there is no better choice. But unfortunately this didn’t work for me, so i made this workaround.
I gave up as i read the time clause is ignored in the continous query (InfluxQL Continuous Queries | InfluxDB OSS 1.8 Documentation)

juskalalie · September 1, 2023, 9:33am

Hey @madface,
I tried these CQ and can confirm that the WHERE is not taken into account. I took a look at VictoriaMetrics but it seems that the downsampling function is not available (only in the Enterprise version), so I am curious to know what benefits do you get from VictoriaMetrics instead of InfluxDB?
I really think downsampling is a must for long storage, not only regarding DB size, but also for “cleaning up” the data after a while…
Thanks!

madface · September 2, 2023, 12:49pm

I have read a lot about VM before i switched, and it should have a smaller footprint in ressources, is faster and i am always willing to try something new, that was my intention to switch.
I agree with the downsampling, my hope is that the team behind VM considers somewhere in the future that this feature should be also in the community edition or there is a way to do this manually from time to time.
But the database footprint of VM is really small. At now i have 337 time series that are stored (i have a include only strategy for long term and store only what i think it could be usefull) and my db grows ca. 3MB a month:

So in 10 years it would be about 400MB, nothing i would worry about and who knows what will be the next hot thing to use then

(FyI, i switched about the beginning of 2023 and the import of influxdb was about ~20MB)

juskalalie · September 4, 2023, 7:41am

Thanks a lot for your answer! Can I ask how did you migrate your data from influxdb into VM?

madface · September 4, 2023, 3:36pm

There is a tool from victoriametrics to migrate, called vmctl. You can download it from their github page.
If you have access to the host system (in my case i have a supervised install on debian) you can migrate all the data with a single command (assuming you have the influx and the vm add-on installed on the machine you run the command):

./vmctl-prod influx --influx-user homeassistant --influx-password mysupersecretpw --influx-database homeassistant

If you run from a different machine (for example you run the add-ons on HAOS and do not have access to the host system) you have to add the flags with the adresses of the databases.
When done you simply have to change your settings for the influx integration in your configuration.yaml to the new adress (it is described in the documentation of the addon how it have to look like).

For a quick test about the database size of influx and vm with your data, you could migrate the data and look how the size is in vm (nothing would change in your setup when doing this i think)
I use a shell command to show the size of vm data:

/usr/bin/du -s /usr/share/hassio/share/victoria-metrics-data | cut -f -1 > /usr/share/hassio/homeassistant/victoriametrics_db_size

Again i think you need access to the host system or at least ssh addon with deactivated protected mode, and you have to look how your pathes are to the vm data. You should find them in the share folder of home assistant.
And as always, please make A BACKUP before doing anything with databases .