InfluxDB - Setup to Compress Data older than 6 Months / 2 Years

Inspired by this Thread https://community.home-assistant.io/t/influxdb-1-8-setup-one-continuous-query-for-whole-database/292242/4 , where i find the idea really good but it did not work for me.

EDIT:
The threadstarter of above thread also replied and said it’s working. I cannot double check this since i don’t use influxdb anymore. For don’t having all the hassle with cronjobs or anything else, you might want to try that first. If this works for you, it would be the cleanest way and the way it should be done. Just 2 continous queries and influx does everything for you. If it had worked for me i would have done it also this way, but i could not find my failure.
If it also doesn’t work for you, read on :wink:

The problem seems so be, that the continuous queries mentioned there cannot have a where time clause, which makes it useless for this case (InfluxQL Continuous Queries | InfluxDB OSS v1 Documentation, from my researches).

So i decided to find another way. Maybe it helps someone who has the same goal.

I want to do the following. Data in InfluxDB older than 6 months but newer than 2 years should be compressed to one entry per 5 minutes. Data older than 2 years is compressed to one entry per 15 minutes. Data younger than 6 months is the raw data you save in your InfluxDB.

BEFORE DOING ANYTHING IN YOUR DATABASE - MAKE A BACKUP. I’M NOT RESPONSIBLE FOR ANY DATA LOSS.

Thread for Backing up the Database:
https://community.home-assistant.io/t/howto-migrate-influxdb-data-to-another-server-on-hassio/164405

So let’s start. The Database i used in InfluxDB is homeassistant. I am running Home Assistant Supervised on a NUC on Debian Bullseye, for this kind of installation this tutorial is. Should be nearly the same on a raspberry pi. On some parts you have to edit your Username and your Password.

I assume you already have a database and want it to be converted to the above.

Login to your Linux Machine:


ssh your_ip_of_your_machine

Login to your InfluxDB Container:


docker exec -it addon_a0d7b954_influxdb influx -precision rfc3339 -username 'influxdb_username' -password 'influxdb_password' -database 'homeassistant'

Here create 2 retention policies:


CREATE RETENTION POLICY "y2_5m" ON "homeassistant" DURATION 104w2d REPLICATION 1

CREATE RETENTION POLICY "inf_15m" ON "homeassistant" DURATION INF REPLICATION 1

Now copy and group the data from your database:

This copies all database entries older than 6 months to newer than 24 months to the retention policy “y2_5m” grouping them to 5 minutes per entry. Data older than 24 months is grouped by 15 minutes into the policy “inf_15m”


SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."y2_5m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -26w and time > now() -104w GROUP BY time(5m),*

SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."inf_15m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -104w GROUP BY time(15m),*

Since the continuous query didn’t work at least for me i am doing this with a shell script.

Create the script:


nano influx_query.sh

Put the following content into it:

Every time it runs it will copy all data older than 6 months to older than 6 months and 6 hours to the retention policy “y2_5m” and grouping them again by 5 minutes, and data older than 24 months to 24 months and 6 hours to the policy “inf_15m”, grouping them by 15 minutes.


#!/bin/sh

docker exec -t addon_a0d7b954_influxdb influx -execute 'SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."y2_5m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -26w and time > now() -26w6h GROUP BY time(5m),*' -precision rfc3339 -username 'your_username' -password 'your_password' -database 'homeassistant'

docker exec -t addon_a0d7b954_influxdb influx -execute 'SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."inf_15m".:MEASUREMENT FROM "homeassistant"."y2_5m"./.*/ WHERE time < now() -104w and time > now() - 104w6h GROUP BY time(15m),*' -precision rfc3339 -username 'your_username' -password 'your_password' -database 'homeassistant'

To execute this once per hour for example use a cronjob:

run:


crontab -e

and in the last line insert:


* */1 * * * /home/yourusername/influx_query.sh >/dev/null 2>&1

if you want a logfile replace ‘>> /home/yourusername/log_influx_query.txt 2>&1’ with ‘>/dev/null 2>&1’

If you don’t have access to the host, Eric wrote an addendum how to setup this without crontab (see post 14):

When you have finished this you should have a retention policy “y2_5m” which holds data from 6-24 months with an aggregation of 5 minutes, a policy “inf_15m” which has the data older than 24 months (and infinite) grouped by 15 minutes. The raw data you have in your autogen policy which is not altered in this moment.

This example works for me and hopefully for you. I have the times a little bit overlapping for not to loose data if the machine is not working for a short time.

When you double checked all is working fine, you can set the retention policy for autogen to 6 months and 1 day or something like this, so the raw data is deleted after that period.

7 Likes

Wow, I appreciate the write up and referral from the other post. This is exactly what I need with my 20Gb database…

Mainly filled with 3s interval power/solar information. That perfectly fits into the retention schedule you have provided.

But, and it is a big but… I am no DBA at all and have no experience at all with messing with my influxdb at this level. Yes I can query it, delete records with InfluxDB Studio and do some basic cleanup. But making changes in the docker container with cronjobs :roll_eyes:.

Is there any way to achieve the same by using a GUI :sweat_smile:

Great idea,

Still trying to wrap my head around this and how I could get this done in my virtual box install…

@madface Which version of InfluxDB are you using?

I’m using the official addon in version 4.5 (i think this is the actual one), don’t know which is the influxdb version behind…

@madface How Can I use it with home assistant Operating System?

Since i use a debian install and then installed the ha container version i don’t know exactly. You would need to ssh into the debian HA is running on. If you use the ssh add-on this does not work since you are getting into a container but not the host.
There are ways to get to the host but this is untested for me:

If you get this to work you can create the cronjob which does the database actions according to my tutorial.

This worked really well, was a bit to get my head around and my cron job is playing up but I will work that out.

The 3 different colours are the 6month, 2yr and INF ones, only had the plug a little over 2 years.

DB down from 7.5GB to 1.8GB

Thanks buddy

I’m glad it works for you, it’s more a workaround for the missing time clause for the continous queries. But it works really stable if you can do the cronjob and it’s saving 80% space.
So until anybody finds a better method …

Thanks for this great tutorial how to manage the Influx DB size.
I’m running on Hass OS and it works there as well. Really nice. ca 400mb of data became 15mb. Not really convinced yet it is complete, so not changed the retention of the raw data yet…

I mainly used a shell command that can be called from automations instead of a cron job:

  • Installed the community SSH & Web Addon running with protection mode off
  • configured SSH with keys for logon without passwords (see tutorial SSH'ing from a command line sensor or shell command - #75 by arkoko)
  • created your shell script ‘compress_influxdb.sh’ to update influx in the HA /config folder.
  • Needed to add a sudo before the docker commands in the shell script (related to Hass OS?)
  • Created a shell command in configuration.yaml:
shell_command:
  compress_influxdb: ssh <username>@homeassistant -i /config/.ssh/id_rsa 'sh /config/compress_influxdb.sh'

Really happy with your tutorial :slight_smile:

Looks really good, seems to be a reliable way for using it with Hass OS.
When using docker you normally assign your user the rights to docker with usermod with something like this:

sudo usermod -aG docker youruser

Without it you have to sudo the commands, i think this is the reason. But then it works in the same way.
Many of the data without compressing the data is from the attributes, if your sensor has attributes they are stored also. Maybe they can be used by grafana, but i did not find a way and never used the attributes. If you store a weather sensor like dwd you have a lot of attributes stored every time when the sensor changes.
I read in another thread that you can exclude attributes. I now use victoriametrics for the long time storage, but since it uses the influx integration it should work the same way:
Add this to your influxdb config entry:

  tags_attributes:
    - friendly_name
    - unit_of_measurement
  ignore_attributes:
    - icon
    - source
    - options
    - editable
    - min
    - max
    - step
    - mode
    - marker_type
    - preset_modes
    - supported_features
    - supported_color_modes
    - effect_list
    - attribution
    - assumed_state
    - state_open
    - state_closed
    - writable
    - stateExtra
    - event
    - friendly_name
    - device_class
    - state_class
    - ip_address
    - device_file
    - unit_of_measurement
    - unitOfMeasure

This should extremely shorten the data that is written to the autogen policy. When compressing the data these attributes are deleted anyway and only the mean value is written.
Maybe it helps…

@madface thanks for the extra insights. The attributes not being copied probably explains the factor 30 compression. I also never used them in analysing long term trends, so that is perfectly fine.

Can you write a tutorial how to make it work on Hass OS?
I donk know how to create those retention policies i don’t have a docker

Hi @ralong
I’m running on hass OS and all info in this thread is applicable to set up the retentions policies. I only approached the automation slightly different (so an addition tutorial for that part):

  • I added a sudo before docker in influx_query.sh (‘sudo docker …’ instead of ‘docker …’):
#!/bin/sh
sudo docker exec -t addon_a0d7b954_influxdb influx -execute 'SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."y2_5m".:MEASUREMENT FROM "homeassistant"."autogen"./.*/ WHERE time < now() -26w and time > now() -26w6h GROUP BY time(5m),*' -precision rfc3339  -username '<username>' -password '<password>' -database 'homeassistant'
sudo docker exec -t addon_a0d7b954_influxdb influx -execute 'SELECT mean(value) AS value,mean(crit) AS crit,mean(warn) AS warn INTO "homeassistant"."inf_15m".:MEASUREMENT FROM "homeassistant"."y2_5m"./.*/ WHERE time < now() -104w and time > now() - 104w6h GROUP BY time(15m),*' -precision rfc3339 -username '<username>' -password '<password>' -database 'homeassistant'
  • I created a shell_command to enable to call influx_query.sh by any automation (instead of using crontab for that):
shell_command:
  compress_influxdb: ssh <username>@homeassistant -i /config/.ssh/id_rsa 'sh /config/influx_query.sh'

Example of my automation that trigger each 30 minutes:

alias: Update InfluxBD
description: ""
trigger:
  - platform: time_pattern
    minutes: /30
condition: []
action:
  - service: shell_command.compress_influxdb
    data: {}
mode: single

Note: the SSH commands only work if you have installed keys for password less access (see tutorial SSH’ing from a command line sensor or shell command - #75 by arkoko )

1 Like

For the homeassistant.autogen retention policy should it be infinite or a shorter value like 6 months?

For this example it should be set to 6 months and one day. So you have in the first 6 months all data you stored, after that it it cumulated all 5 minutes and for data that is older than 2 years it it cumulated all 15 minutes. Before setting up this retention policy double check everything is working, or you will loose all data older than 6 months!

If I understood correctly the documentation, your CQ didn’t work because they run against new data.
The where clause is ignored in CQ so if you want disjointed RP, you need to do it manually.
In fact the best practice is to have your RP cover the period from now to “what you want”. The downside is that you have parts of the period covered multiple times, but it’s not much data. On the plus side, you can simplfy your queries a lot.

Hi Mike

Can I ask you a noob influx DB query question?

Before limiting my raw data retention policy, I intend to automate a backup of my influx database to my PC first.

  • So in Home Assistant I export the Database to the share
  • On the PC I import the database from the share into a temp database tmpdb

So far so good, but I fail doing the last step:

  • only add the new data from the tmpdb to the homeassinstant DB.

I noticed that overlapping data simply duplicates.
This is the query:
SELECT * INTO "homeassistant"."autogen".:MEASUREMENT FROM "tmpdb"."autogen"./.*/
Same result with this variation (addition group by):
SELECT * INTO "homeassistant".autogen.:MEASUREMENT FROM "tmpdb".autogen./.*/ GROUP BY *

If I run that query twice, the destination database homeassistant more then doubles in size.

I thought duplicated datapoints would overwrite. Maybe you have a suggestion (I’m very new unfamiliar with database queries). A different way to perform the query, or some merging / cleanup of duplicate data.

Thanks in advance!

Best Eric

Blockquote
Every time it runs it will copy all data older than 6 months to older than 6 months and 6 hours to the retention policy “y2_5m” and grouping them again by 5 minutes, and data older than 24 months to 24 months and 6 hours to the policy “inf_15m”, grouping them by 15 minutes.

Hey Mike why the 6hr timeframe?

@dmartens
I did this for safety, if the script does not run due to maintenance or something like that, you will not loose the data. I looked in the database and did not see double data points for the overlapping ranges, so i thought it would be a good idea.

@erkr
Good question, i think this is even beyond my database knowledge. As i created this before a year i looked in the db after executing a few times and did not see any double data points.


Since the script ran every hour it would have to be 5 overlapping points. What i don’t know is, aren’t there any or is only that they are not shown.
Can you test with a select statement like mine in the picture, just to look if you will see double data points? Maybe you have to replace the measurement to something you have.

EDIT:
I dived a little bit deeper into, but i don’t understand it. As per docs, selecting into with same measurement, timestamp etc. should replace the values. So it should not make the database grow. I made a test setup, but i see the same problem as you, the database grows exactly by the size it has with every new select into.
But if you count the data points they stay the same. Just try

select count(value) from "W"

And the amount of data points stays the same independant how often you select into. So i really don’t know where the problem is at the moment. I know influxdb compacts the database every few days, maybe it shrinks when it runs over a time?

1 Like