Decreasing Influxdb size by removing measurements

exrx · November 15, 2022, 10:27am

I use Influxdb as long term measurement storage. When I started using Influxdb, I did not take actions on choosing which measurements to store in Influxdb. I don’t want to use retention policy on the measurements of interest (i.e. temperatures and humidity of my house) but store the measurements indefinitely. However, I would like to get rid of non-interesting measurements, like battery level of sensors etc.

How do I delete specific values from database such way that the disk usage of Influxdb is reduced?

johnbull · February 19, 2023, 9:09pm

I bump this…
Even though I started with quite strict rules for what to save, data base size grow pretty fast.
It would be very nice with a GUI function to just drop entities that are not interested any more.

Another function that would be highly appreciated is a more customable data retention policy…via GUI for us beginners…
I read about someone who managed to create some data retention policies to drop excessive old data (but I never understood if the description was for a “normal” installation of Home Assistant on a Pi/NUC etc or if it was some container installation).
However, what he described was:
Data from current to X month: Keep as is
Data from X month to Y month: Save one data point per 5 min
Data from Y month to eternity: Save one data point per 15 min
Something like this would be brilliant to be able to configure in an easy manner - preferably via the InfluxDB add-on GUI, or if someone could write a very clear description of how to configure.
Preferably with X and Y month configurable as well as number of data points per minute per interval.

tom_l · February 20, 2023, 1:24am

I have not tried this yet but it is on my list of things to do. Here are some links you may find useful:

https://alex3305.github.io/home-assistant-docs/add-ons/influxdb-downsampling/

https://community.home-assistant.io/t/influxdb-1-8-setup-one-continuous-query-for-whole-database/292242

https://youtu.be/0xt8T5bIw-4

pove · December 30, 2024, 5:30pm

First link about influxdb-downsampling was very useful for me. I have a lot of data and I want it grouped by 15 minutes, so queries to backfilling my data fails if I do it for more than one week at a time, and it last one minute per week. I have almost 3 years of records, so I created this AppDaemon App:

from influxdb import InfluxDBClient

class InfluxDBQueryAutomation(hass.Hass):

    def initialize(self):
        # InfluxDB connection details
        self.host = "influxhost"
        self.port = 8086
        self.username = "username"
        self.password = "password"
        self.database = "homeassistant"

        # Connect to InfluxDB
        self.client = InfluxDBClient(host=self.host, port=self.port, username=self.username, password=self.password)
        self.client.switch_database(self.database)

        # Time range configuration
        self.weeks_to_process = 150  # Number of 2-week periods to generate
        self.start_weeks_ago = 150  # Starting point in weeks ago
        self.interval_weeks = 1     # Interval in weeks

        # Start the process to generate queries
        self.run_queries()

    def run_queries(self):
        """Generate and execute queries with a 1-second delay between them."""
        for i in range(self.weeks_to_process):
            start_week = self.start_weeks_ago - (i * self.interval_weeks)
            end_week = start_week - self.interval_weeks
            query = f"""
            SELECT mean(*) INTO "homeassistant"."infinite".:MEASUREMENT 
            FROM "homeassistant"."autogen"./.*/ 
            WHERE time > now() - {start_week}w AND time < now() - {end_week}w 
            GROUP BY time(15m), * FILL(previous)
            """
            self.log(f"Executing query for weeks {start_week} to {end_week}")
            
            try:
                # Execute query and wait for the response
                result = self.client.query(query)
                
                # Check if query returned successfully
                if result:
                    self.log(f"Query for weeks {start_week} to {end_week} executed successfully.")
                else:
                    self.log(f"Query for weeks {start_week} to {end_week} returned no result or failed.")

            except Exception as e:
                self.log(f"Error executing query for weeks {start_week} to {end_week}: {e}")

            # Delay of 1 second before the next query
            time.sleep(1)

        self.log("All queries executed.")

Just had to add influxdb Python package in AppDaemon Configuration.

pove · December 31, 2024, 3:58pm

I have discovered a downside of downsampling the data. InfluxDB is adding a prefix to the grouped field. In this case, all fields on infinity RP now starts with “mean_”, so in Grafana you cannot switch easily between retention policies (default/infinity).

github.com/influxdata/influxdb

Empty prefix alias for wildcards aggregate functions

opened 01:02PM - 20 Sep 16 UTC

steverweber

area/queries kind/enhancement flux/triaged 1.x

Moving and downsampling data between databases or retention policies should allo…w us to maintain the field names. I tried ``` SELECT last(*) AS "" INTO "mfcf_vmware"."autogen".:measurement FROM /^vmware_.*/ WHERE time >= '2016-01-01T00:00:00Z' GROUP BY time(30m), "host" ``` but sadly because the prefix key is empty it fallbacks to "last_" in this case. https://github.com/influxdata/influxdb/pull/7009 @jsternberg would adding this be simple? @gunnaraasen should something different be done so it uses backreferences?