InfluxDB 2 data retention - my approach

elfrinjo · September 5, 2022, 3:04pm

Inspired by this thread I decided to do a writeup of my influxDB v2 settings and tasks.

The InfluxDB is running on an external vps to preserve the local sdcard. It is run as a docker container with docker-compose, published with a traefik reverseproxy and it uses a Let’s Encrypt certificate automatically requested by treafik.

Data gathering
I am not sure which data I want to retain in longterm-storage, so I decided to store all data, except for the metrics I am sure I don’t need or that make little sense too keep, into a bucket called homeassistant.

This is the resulting influxdb configuration in ha’s configuration.yaml:

influxdb:
  api_version:         2
  host:                influx.example.com
  port:                443
  organization:        12345abcde      # hexadecimal org-id
  bucket:              homeassistant
  token:               !secret influxtoken
  measurement_attr:    unit_of_measurement
  default_measurement: units
  tags:
    source: HA
  tags_attributes:
    - friendly_name
  exclude:
    domains:
      - persistent_notification
      - automation
      - device_tracker
      - group
      - scene
      - schedy_room
      - script
      - update
      - alert
      - camera
      - remote

This configuration will store all data to measurements (influx-term for something like a table) named after the corresponding unit (°C, kWh, V, A, etc.)
When the unit is not specified, the unit will be “units”.
The entity’s attributes will be put into influxdb as well. I am not really interested in those data because it is mostly not relevant and the types are not well defined. As it seems there is no proper include/exclude mechanism for attributes so we will discard them later.
The bucket homeassistant and the api token need to be created before activating the configuration. This task is quite self-explanatory with the influxdb ui.

Downsampling data
So far, all data are stored into a single bucket. InfluxDB’s concept for downsampling data is to run regular tasks that copy aggregated data into additional buckets.

For now, this is my retention plan:

Keep the detailed data in bucket homeassistant for 30 days
Drop the attributes when copying into the aggregation buckets. This means we will only care for the entity’s value field.
Keep min, max and mean values for the aggregation periods
Keep 5 minute aggregations (I did not decide on the retention period yet)
Keep 15 minute aggregations (I did not decide on the retention period yet)
Keep 60 minute aggregations (I did not decide on the retention period yet)

Next step will be to create the following buckets via the influxdb ui:

homeassistant_5m
homeassistant_15m
homeassistant_1h

Then the tasks to copy data into those buckets can be created with the ui.

This is the task to fill the 1h bucket. For the other ones, the time and the target-bucketnames need to be adjusted.

option task = {name: "homeassistant_1h", every: 1h}

data =
    from(bucket: "homeassistant")
        |> range(start: -task.every)
        |> filter(fn: (r) => r["_field"] == "value")

data
    |> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
    |> set(key: "agg_type", value: "mean")
    |> to(bucket: "homeassistant_1h", org: "my_Org")

data
    |> aggregateWindow(every: 1h, fn: max, createEmpty: false)
    |> set(key: "agg_type", value: "max")
    |> to(bucket: "homeassistant_1h", org: "my_Org")

data
    |> aggregateWindow(every: 1h, fn: min, createEmpty: false)
    |> set(key: "agg_type", value: "min")
    |> to(bucket: "homeassistant_1h", org: "my_Org")

Downsampling older data
I added the downsampling tasks a few months after collecting data into the homeassistant bucket. So there is data to be aggregated outside of the tasks before to retention of the source-bucket can be set.

This can be achieved with the following queries executed in the data explorer.
Here are the queries for the 1h bucket:

from(bucket: "homeassistant")
  |> range(start: 2022-06-03T00:00:00Z, stop: 2022-08-27T11:50:00Z)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1h, fn: max, createEmpty: false)
  |> set(key: "agg_type", value: "max")
  |> to(bucket: "homeassistant_1h", org: "my_Org", tagColumns: ["agg_type", "domain", "entity_id", "friendly_name", "source"])

from(bucket: "homeassistant")
  |> range(start: 2022-06-03T00:00:00Z, stop: 2022-08-27T11:50:00Z)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1h, fn: min, createEmpty: false)
  |> set(key: "agg_type", value: "min")
  |> to(bucket: "homeassistant_1h", org: "my_Org", tagColumns: ["agg_type", "domain", "entity_id", "friendly_name", "source"])

from(bucket: "homeassistant")
  |> range(start: 2022-06-03T00:00:00Z, stop: 2022-08-27T11:50:00Z)
  |> filter(fn: (r) => r["_field"] == "value")
  |> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
  |> set(key: "agg_type", value: "mean")
  |> to(bucket: "homeassistant_1h", org: "my_Org", tagColumns: ["agg_type", "domain", "entity_id", "friendly_name", "source"])

Edits

Spelling and grammar
Use aggregateWindow instead of simple min(), max(), mean() functions in tasks. Before, mean() wasn’t working correctly.

madface · September 5, 2022, 3:32pm

First, thank you for marking my post and leading me here. This is exactly what i and i think many other being searching for. As i got reminded yesterday to my thread i saw that there is this way of the discontinued continous query for this job. But i struggled in the beginning, never had to do with a bucket.

I have this kind of setup in my config.yaml:

influxdb:
  host: 10.10.11.88
  port: 8086
  database: homeassistant
  username: homeassistant
  password: !secret influxdb
  max_retries: 3
  default_measurement: state
...

Straight forward as the add-on docs say. Now is there a way without beginning from scratch to change the setup?

EDIT: If i understand right this works only for Influxdb V2, and there is no addon for HA for i386 or amd64 (to run on my nuc)?

elfrinjo · September 5, 2022, 9:11pm

With his one, I can help: A bucket is basically a database: It stores timeseries data with tags, field and values. It has a single “retention period” after which data is deleted from the bucket. API Tokens can have permissions on buckets.
Users, databases, retention periods still exist in a compatibility layer but I am no expert on this.

About the addon: I guess once the developer is ready or needs to switch he would provide a migration path.

tom_l · September 6, 2022, 2:29am

This is definitely on my list of things to do.

At the moment I keep everything for 2 years. It would be nice to keep detailed data for 90 days, hourly (min/max/mean) for 12 months, and daily (min/max/mean) for 5 years.

However I too am still looking for a guide for those having InfluxDb running as a HAOS addon.

Palmin · October 7, 2022, 9:35am

Thanks a lot, I have just set up an InfluxDB v2 with a similar configuration, and I will definitely look into your downsampling approach for long-term storage.

I have one question to which I couldn’t really find a good answer elsewhere, so maybe you can help here:
What is the advantage/difference in using default_measurement: units vs default_measurement: state (which is in most of the (older?) InfluxDB documentation?

elfrinjo · October 7, 2022, 9:53am

I think it doesn’t really matter. default_measurement just sets the name of table the value is stored to when no unit is set by the entity.
Meaning: all °C, W, % etc have their own measurement aka. table within influx and if that is not pre-set it will be called units or state.

cpu · November 3, 2022, 9:51pm

@elfrinjo how this works so far for you? As I’m trying to accomplish similar thing - store measurements like °C, kWh, V, A and downsample that data as influx db can grow quite a lot. I’m currently excluding all battery related things.
So way I understand this - first task is filling buckets 1/5/60 min constantly and as they start to grow you’re downsampling data to free up some space but do you do that on main bucket too or you set retention policy on main bucket?

elfrinjo · November 5, 2022, 1:07pm

So far storage an downsampling work as expected.
However, I am currently evaluation whether the inbuilt long-term stats are sufficient for me. The reason is just cutting down complexity.

It works like this:
Homeassistant fills a single bucket with every measurement it produces. Next, there are three jobs, running every 5/15/60 minutes, that copy min/mean/max of the respective period to new buckets. The bucket with the detailed data has a retention period of 30 days, the others currently do not have a rp set.

cpu · November 5, 2022, 6:53pm

Glad to hear that - I might ping you if I have issues as I’m aiming to store only sensors data inside influxdb mainly temps and power related measurements - V, Amps, W etc. And downsample this a bit as time goes by so have 1,2,3+ years stats available just in case as Unraid serwer has plenty of disk space. Cheers for that last info - 30 days on main bucket.

DuckDuck25 · April 6, 2023, 2:54am

This is great information. I’m using InfluxDB 1.x. Does anyone have the equivalent DB statements to replicate this setup on 1.x?

fhb · September 7, 2023, 7:48pm

@ elfrinjo How are you using the downsampled data though? I see two issues in my case using Grafana to visualize my influxdb2 data (at 73gigabyte for about 1 year):

I don’t want to change the Flux querys I created in Grafana over many hours. Ideally I would like to change the above mentioned downsampling tasks and queries to something which let’s me keep my queries, e.g.:

 import "timezone"
 option location = timezone.location(name: "Europe/Berlin")
 from(bucket: "homeassistant")
   |> range(start: v.timeRangeStart, stop: v.timeRangeStop)
   |> filter(fn: (r) => r["_measurement"] == "W")
   |> filter(fn: (r) => r["_field"] == "value")
   |> filter(fn: (r) => r["entity_id"] == "shellyem3_3494547575cf_channel_a_power" or r["entity_id"] == "shellyem3_3494547575cf_channel_b_power" or r["entity_id"] == "shellyem3_3494547575cf_channel_c_power")
   |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)
   |> map(fn: (r) => ({r with _value: r._value * 24.0}))
   |> timeShift(duration: -24h)

For me the solution seems to be to only create mean aggregates, so I don’t have to filter out the other two in my preexisting queries.

Grafana needs to access different buckets instead of only one if using downsampling. This is not trivial and solutions setting Grafana up to use different buckets depending on the requested time frame are explained here Grafana: Dynamic Retentions (InfluxDB) and here Collectd to InfluxDB v2 with downsampling using Tasks | cloud-infra.engineer

As setting the tasks and Grafana up properly is quite complicated and might potentially lead to data loss while storage becomes cheaper, I’m seriously considering to replace the 512gb ssd with a 4Tb one, to resize the container+partition on my proxmox server and to set the retention policy for the main home assistant bucket to 10years. It’s an annoying issue as there are no easy solutions for this issue and once the influxdb volume is full, you start loosing the incoming new data.

Is there a good way to look for entities/domains from home assistant wich take up a lot of space or which typically can get excluded?.

molnart · September 7, 2023, 9:46pm

@fhb ^^^^^ exactly this. i never got around downsampling in influxdb, because its quite overwhelming and having historical data in different buckets kinda defeats the purpose of it. also for some reason i am unable to create a task from OPs script, it shows invalid flux script (yes, i did modity the bucket names and org names according to my config)

fhb · September 9, 2023, 12:40am

the following will work:

data =
    from(bucket: "homeassistant")
        |> filter(fn: (r) => r["_field"] == "value")

data
    |> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
    |> set(key: "agg_type", value: "mean")
    |> to(bucket: "homeassistant_downsampled_to_1h", org: "home")//data
    |> aggregateWindow(every: 1h, fn: max, createEmpty: false)
    |> set(key: "agg_type", value: "max")
    |> to(bucket: "homeassistant_downsampled_to_1h", org: "home")
data
    |> aggregateWindow(every: 1h, fn: min, createEmpty: false)
    |> set(key: "agg_type", value: "min")
    |> to(bucket: "homeassistant_downsampled_to_1h", org: "home")

virtualdj · September 11, 2023, 4:55pm

You have to remove the “option task” line (the first one) when you’re using the UI on the left to create the task… I had the same error!