InfluxDB Cardinality on Home Assistant: Diagnose, Clean Up, and Monitor

InfluxDB Cardinality on Home Assistant: Diagnose, Clean Up, and Monitor

A practical writeup for anyone running the Community Add-on InfluxDB alongside the Home Assistant influxdb integration who has either seen a max-values-per-tag warning or wants visibility into what their database is growing into. Covers the cardinality vs retention distinction, the exclude pattern that actually fixes it, the cleanup query, and a fully HA-native monitoring layer using rest: and template: sensors.

Throughout this guide:

  • home_assistant stands in for whatever you named your InfluxDB database in the integration. Substitute your real value.
  • <INFLUX_HOST> is the host/IP your add-on is reachable at, typically the HA host on port 8086.
  • ha_stats is a recommended dedicated read-only user; create it before wiring the sensors.

The warning that starts the journey

In the InfluxDB add-on log you eventually see a line like:

lvl=warn msg="max-values-per-tag limit may be exceeded soon"
  service=store perc=100% n=220004 max=220000
  db_instance=home_assistant measurement=mi tag=entity_id

Decoded: in your database, one measurement has accumulated 220,004 distinct values for the entity_id tag, against a configured ceiling of 220,000. The default is 100,000. If the count crosses the ceiling, InfluxDB starts dropping any write that would introduce a new entity_id value in that measurement. Existing series keep writing.

This is not a generic disk-space problem. It is tag cardinality: the number of unique tag values in the index. It is also a problem only a handful of integrations can cause at scale, because Home Assistant entities normally have stable IDs.

What the limit actually measures

InfluxDB indexes every unique combination of measurement plus tag set as a separate series. The max-values-per-tag ceiling enforces a cap on how many distinct values one tag may hold inside one measurement. The HA influxdb integration writes the entity's entity_id as a tag and the value's unit as the measurement name. So when this warning fires, it is telling you that one unit class has accumulated more unique entities than InfluxDB is comfortable indexing.

Two things make this happen:

  1. An integration that mints a new entity_id per event rather than reusing a stable one.
  2. An infinite retention policy that never lets old index entries age out.

The well-known offender is the geo_location family of feed integrations. Lightning detectors (Blitzortung-style), USGS earthquakes, NSW rural fire, GDACS, and similar create one geo_location entity per detected event with a UUID-style suffix in the entity_id. The entity itself is short-lived in HA's state machine, but every entity_id is permanently indexed by InfluxDB.

Why a 24-hour retention does not fix this

There is a tempting line of reasoning: "I will set a 24-hour retention on lightning so it stops accumulating." It does not work, for two reasons that are worth being precise about.

First, retention policies are database-level, not measurement-level. The HA integration writes every measurement to the database's default retention policy. There is no per-measurement or per-entity routing in the integration. To target a 24-hour policy at one measurement you would need a separate writer outside the integration.

Second, and more importantly, a rolling DELETE only bounds volume, not cardinality. The InfluxDB v1 documentation is explicit: DELETE removes points but does not drop the series from the index, while DROP DATABASE, DROP MEASUREMENT, and DROP SERIES are the only operations that reduce series cardinality.

There is one exception that is genuinely useful: a short retention policy with short shard duration does reduce cardinality, because when a shard expires it is dropped whole and the series it held leave the index with it. But routing only one measurement to that policy requires a custom write path. Through the HA integration, you cannot.

The cleaner solution is to stop logging the offending entities to InfluxDB in the first place.

Find the source

Get an interactive InfluxQL session. Either Chronograf's Data Explorer (the InfluxDB add-on's web UI), the influx CLI inside the add-on container, or curl against the HTTP API. List the worst measurements first:

SHOW SERIES EXACT CARDINALITY ON "home_assistant"

You will get one row per measurement with a count. Sort descending. One measurement at hundreds of thousands while everything else is in the hundreds is the signature of unique-id churn.

Then look at the entity IDs themselves:

SHOW TAG VALUES ON "home_assistant" FROM "<measurement>"
  WITH KEY = "entity_id" LIMIT 60

Patterns to recognize:

  • Hex or UUID suffixes (lightning_strike_0123abcd..., IEEE addresses for unnamed devices).
  • Incrementing numeric suffixes (device_2, device_3, ...) from repeated re-pairing.
  • Timestamp fragments embedded in the ID.

Exclude at the integration, not at the recorder

Once you have the pattern, exclude it on the InfluxDB side. Be specific. Do not blanket-exclude a whole domain unless every entity in that domain is junk.

influxdb:
  # ... your existing connection block ...
  exclude:
    entity_globs:
      - "*.lightning_strike_*"

The glob matches on full entity_ids, so *.lightning_strike_* targets geo_location.lightning_strike_<hex> precisely while leaving every other geo_location entity, every other sensor, and every aggregate counter intact. It is domain-agnostic, which means you do not have to be right about the domain.

A common mistake worth avoiding: do not also add this to recorder: exclude. The recorder (MariaDB or SQLite) is what feeds Home Assistant's frontend, including the Map card's hours_to_show trail. If you want strikes to render on a map at all, the recorder is the source. The InfluxDB integration is write-only; nothing in the HA UI reads from InfluxDB. Excluding from InfluxDB has no effect on any HA visualization, only on long-term metrics export.

After the YAML change, restart HA. The integration does not have a config reload.

If you are concerned about recorder growth from high-churn entities, the lever is recorder.purge_entities on a schedule with a short keep_days, not a recorder exclude. The recorder's purge is time-based, so storms wash through; only the metadata churn lingers, and that is a slower problem.

Clean up the index

Excluding stops new strikes from being written. The existing 200k+ entries are still in the index until you remove them. The right tool is DROP SERIES with a tag regex, scoped to the offending measurement:

DROP SERIES FROM "<measurement>" WHERE "entity_id" =~ /^lightning_strike_/

This drops every series in that measurement whose entity_id starts with the pattern, keeping any legitimate companion sensors (a distance or counter sensor with a fixed entity_id is unaffected).

Notes on the cleanup:

  • DROP SERIES is index-heavy at hundreds of thousands of series and triggers a TSM compaction. Run it in a maintenance window.
  • After it completes, restart the InfluxDB add-on so the in-memory index rebuilds and the max-values-per-tag counter settles. There are upstream reports that the counter can lag without an index rebuild.
  • Once the cleanup lands, lower max-values-per-tag back to the default (100,000) in the add-on config so the limit is a meaningful guardrail again. Do not disable it (= 0). Disabling silences the warning at the cost of unbounded index growth in RAM.

Expected impact: total series count drops by roughly the number you removed, immediately. Disk size does not drop immediately. InfluxDB only reclaims space during TSM compaction, sometimes hours later. And for unique-id-per-event data, most of the cost is in the index, not in data volume, so do not expect a dramatic disk shrink. Cardinality is the metric that proves the cleanup landed.

Permanent monitoring with native HA sensors

The lesson of the cardinality blowup is that you want to see this kind of accumulation building, not stumble on it when the limit fires. Here is a monitoring layer that lives entirely in HA, no shell scripts, no cron jobs, secrets handled through !secret.

Prerequisites

Create a dedicated read-only InfluxDB user with access to the internal monitoring database and your HA database:

CREATE USER "ha_stats" WITH PASSWORD 'long_random_value'
GRANT READ ON "_internal" TO "ha_stats"
GRANT READ ON "home_assistant" TO "ha_stats"
SHOW GRANTS FOR "ha_stats"

SHOW GRANTS is the proof check. Do not reuse the integration's write user for monitoring; least privilege belongs here.

In secrets.yaml:

influx_stats_user: ha_stats
influx_stats_pass: paste_password

Size and basic stats (REST sensor against _internal)

This bundle pulls four numbers in a single HTTP request: total on-disk bytes (summed across shards), series and measurement counts, and TSM file footprint and count. _internal.monitor.* is InfluxDB's own self-monitoring database; the fields used here are standard.

rest:
  - resource: "http://<INFLUX_HOST>:8086/query"
    method: GET
    scan_interval: 300
    timeout: 30
    username: !secret influx_stats_user
    password: !secret influx_stats_pass
    params:
      db: _internal
      q: >-
        SELECT last("diskBytes") FROM "_internal"."monitor"."shard"
        WHERE "database" = 'home_assistant' AND time > now() - 5m GROUP BY "id";
        SELECT last("numSeries"), last("numMeasurements") FROM "_internal"."monitor"."database"
        WHERE "database" = 'home_assistant' AND time > now() - 5m;
        SELECT last("diskBytes"), last("numFiles") FROM "_internal"."monitor"."tsm1_filestore"
        WHERE "database" = 'home_assistant' AND time > now() - 5m GROUP BY "path"
    sensor:
      - name: "InfluxDB Disk Size"
        unique_id: influxdb_disk_size
        device_class: data_size
        unit_of_measurement: MB
        state_class: measurement
        value_template: >-
          {% set ns = namespace(t=0) %}
          {% for s in value_json.results[0].series | default([]) %}
            {% set v = s['values'][0][1] %}
            {% if v is number %}{% set ns.t = ns.t + v %}{% endif %}
          {% endfor %}
          {{ (ns.t / 1048576) | round(1) }}

      - name: "InfluxDB Series Count"
        unique_id: influxdb_series_count
        unit_of_measurement: series
        state_class: measurement
        value_template: >-
          {% set s = (value_json.results[1].series | default([{}]))[0] %}
          {% set cols = s.get('columns', []) %}
          {% set vals = (s.get('values', [[]]))[0] %}
          {{ (vals[cols.index('last')] if 'last' in cols else 0) | int(0) }}

      - name: "InfluxDB Measurements Count"
        unique_id: influxdb_measurements_count
        unit_of_measurement: measurements
        state_class: measurement
        value_template: >-
          {% set s = (value_json.results[1].series | default([{}]))[0] %}
          {% set cols = s.get('columns', []) %}
          {% set vals = (s.get('values', [[]]))[0] %}
          {{ (vals[cols.index('last_1')] if 'last_1' in cols else 0) | int(0) }}

      - name: "InfluxDB TSM Disk"
        unique_id: influxdb_tsm_disk
        device_class: data_size
        unit_of_measurement: MB
        state_class: measurement
        value_template: >-
          {% set ns = namespace(t=0) %}
          {% for s in value_json.results[2].series | default([]) %}
            {% set v = s['values'][0][1] %}
            {% if v is number %}{% set ns.t = ns.t + v %}{% endif %}
          {% endfor %}
          {{ (ns.t / 1048576) | round(1) }}

      - name: "InfluxDB TSM Files"
        unique_id: influxdb_tsm_files
        unit_of_measurement: files
        state_class: measurement
        value_template: >-
          {% set ns = namespace(t=0) %}
          {% for s in value_json.results[2].series | default([]) %}
            {% set v = s['values'][0][2] %}
            {% if v is number %}{% set ns.t = ns.t + v %}{% endif %}
          {% endfor %}
          {{ ns.t | int(0) }}

Two notes:

  • The disk-size sum uses last("diskBytes") ... GROUP BY "id" and aggregates client-side. The influxdb platform sensor cannot reproduce this; it has no GROUP BY and applies a single aggregator. Using group_function: sum instead inflates the value by however many monitor samples fell in the time window (one every ten seconds). The rest: approach is the structurally correct one.
  • device_class: data_size does not auto-rescale the displayed unit. The number you compute is what you see. Bytes is technically faithful but unreadable, so divide to MB or GB in value_template.

Cardinality view (separate REST entry, slower poll)

The cardinality query walks the index and is expensive while you still have a bomb in there. Split it onto its own REST entry with a longer timeout and a daily poll. After cleanup it is fast again.

  - resource: "http://<INFLUX_HOST>:8086/query"
    method: GET
    scan_interval: 86400
    timeout: 60
    username: !secret influx_stats_user
    password: !secret influx_stats_pass
    params:
      db: home_assistant
      q: 'SHOW SERIES EXACT CARDINALITY ON "home_assistant"'
    sensor:
      - name: "InfluxDB Top Cardinality"
        unique_id: influxdb_top_cardinality
        unit_of_measurement: series
        state_class: measurement
        value_template: >-
          {% set ns = namespace(top=0) %}
          {% for s in value_json.results[0].series | default([]) %}
            {% set c = s['values'][0][0] | int(0) %}
            {% if c > ns.top %}{% set ns.top = c %}{% endif %}
          {% endfor %}
          {{ ns.top }}
        json_attributes_path: "$.results[0]"
        json_attributes:
          - series

State is the largest single-measurement cardinality. The series attribute holds the full raw response (alphabetical, as InfluxDB returns it) for template sensors to sort.

Top writer view (cross-measurement entity activity)

InfluxDB does not expose per-series disk size. The closest cheap proxy for "which entities are costing me the most" is recent point count, grouped by entity_id, across all measurements. The query is time-bounded so it scales with recent points, not with historical cardinality.

  - resource: "http://<INFLUX_HOST>:8086/query"
    method: GET
    scan_interval: 3600
    timeout: 30
    username: !secret influx_stats_user
    password: !secret influx_stats_pass
    params:
      db: home_assistant
      q: 'SELECT count("value") FROM /.*/ WHERE time > now() - 1h GROUP BY "entity_id"'
    sensor:
      - name: "InfluxDB Top Writer Points 1h"
        unique_id: influxdb_top_writer_points_1h
        unit_of_measurement: points
        state_class: measurement
        value_template: >-
          {% set series = value_json.results[0].series | default([]) %}
          {% set ns = namespace(top=0) %}
          {% for s in series %}
            {% set c = s['values'][0][1] | int(0) %}
            {% if c > ns.top %}{% set ns.top = c %}{% endif %}
          {% endfor %}
          {{ ns.top }}
        json_attributes_path: "$.results[0]"
        json_attributes:
          - series

count("value") catches every point the HA integration writes, regardless of state type. The 1-hour window keeps the scan light. Stretch to 6h or 24h for trend data once your index is clean.

Template sensors and outlier alert

These derive the human-friendly views: top measurement name, sorted top-N lists, and a breakdown of every measurement ranked by series count.

template:
  - sensor:
      - name: "InfluxDB Top Measurement"
        unique_id: influxdb_top_measurement
        state: >-
          {% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
          {% set ns = namespace(name='unknown', count=0) %}
          {% for s in series %}
            {% set c = s['values'][0][0] | int(0) %}
            {% if c > ns.count %}
              {% set ns.count = c %}
              {% set ns.name = s['name'] %}
            {% endif %}
          {% endfor %}
          {{ ns.name }}
        attributes:
          top_count: >-
            {% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
            {% set ns = namespace(count=0) %}
            {% for s in series %}
              {% set c = s['values'][0][0] | int(0) %}
              {% if c > ns.count %}{% set ns.count = c %}{% endif %}
            {% endfor %}
            {{ ns.count }}
          top10: >-
            {% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
            {% set ns = namespace(pairs=[]) %}
            {% for s in series %}
              {% set ns.pairs = ns.pairs + [{'measurement': s['name'], 'series': s['values'][0][0] | int(0)}] %}
            {% endfor %}
            {{ (ns.pairs | sort(attribute='series', reverse=True))[:10] | list }}

      - name: "InfluxDB Cardinality Breakdown"
        unique_id: influxdb_cardinality_breakdown
        state: >-
          {% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
          {{ series | length }}
        unit_of_measurement: measurements
        state_class: measurement
        attributes:
          breakdown: >-
            {% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
            {% set ns = namespace(pairs=[]) %}
            {% for s in series %}
              {% set ns.pairs = ns.pairs + [{'measurement': s['name'], 'series': s['values'][0][0] | int(0)}] %}
            {% endfor %}
            {{ ns.pairs | sort(attribute='series', reverse=True) | list }}

      - name: "InfluxDB Top Writer"
        unique_id: influxdb_top_writer
        state: >-
          {% set series = state_attr('sensor.influxdb_top_writer_points_1h', 'series') or [] %}
          {% set ns = namespace(name='unknown', count=0) %}
          {% for s in series %}
            {% set c = s['values'][0][1] | int(0) %}
            {% if c > ns.count %}
              {% set ns.count = c %}
              {% set ns.name = s['tags']['entity_id'] %}
            {% endif %}
          {% endfor %}
          {{ ns.name }}
        attributes:
          measurement: >-
            {% set series = state_attr('sensor.influxdb_top_writer_points_1h', 'series') or [] %}
            {% set ns = namespace(meas='', count=0) %}
            {% for s in series %}
              {% set c = s['values'][0][1] | int(0) %}
              {% if c > ns.count %}{% set ns.count = c %}{% set ns.meas = s['name'] %}{% endif %}
            {% endfor %}
            {{ ns.meas }}
          top20: >-
            {% set series = state_attr('sensor.influxdb_top_writer_points_1h', 'series') or [] %}
            {% set ns = namespace(pairs=[]) %}
            {% for s in series %}
              {% set ns.pairs = ns.pairs + [{
                'entity_id': s['tags']['entity_id'],
                'measurement': s['name'],
                'points': s['values'][0][1] | int(0)
              }] %}
            {% endfor %}
            {{ (ns.pairs | sort(attribute='points', reverse=True))[:20] | list }}

  - binary_sensor:
      - name: "InfluxDB Cardinality Outlier"
        unique_id: influxdb_cardinality_outlier
        device_class: problem
        state: "{{ states('sensor.influxdb_top_cardinality') | int(0) > 5000 }}"
        attributes:
          worst_measurement: "{{ states('sensor.influxdb_top_measurement') }}"

The 5000 threshold on the outlier alert is the lever to tune. Pick a number a comfortable margin above your current legitimate top measurement and revisit as the system grows.

Final entity inventory

If everything is in place you should have eleven entities:

  • Five size and health sensors: influxdb_disk_size, influxdb_series_count, influxdb_measurements_count, influxdb_tsm_disk, influxdb_tsm_files.
  • Three cardinality sensors: influxdb_top_cardinality (raw input), influxdb_top_measurement (sorted top with top10), influxdb_cardinality_breakdown (full sorted list).
  • Two writer sensors: influxdb_top_writer_points_1h (raw input), influxdb_top_writer (top entity with top20).
  • One outlier alert: binary_sensor.influxdb_cardinality_outlier.

Pitfalls observed along the way

A few things that catch most people building this:

  • The default REST sensor timeout is 10 seconds. The SHOW SERIES EXACT CARDINALITY query against a database with a cardinality bomb can run well past that; raise timeout accordingly, or split that query onto its own slower-polling REST entry as shown above.
  • homeassistant.update_entity takes its target under data:, not target:. Forcing a refresh while debugging needs the right key.
  • The influxdb platform sensor is not interchangeable with rest: for these queries. It has no GROUP BY, returns a single scalar, and cannot expose the JSON response to other sensors. Use it for simple "one number from one measurement" cases; use rest: for everything in this guide.
  • Excluding entities from the InfluxDB integration is independent of the recorder exclude. Touch only what you need to. The integration is write-only; the recorder feeds the UI.
  • Bytes is a faithful unit but unreadable. device_class: data_size does not auto-rescale. Divide in value_template and set the unit to MB or GB explicitly.

Closing

Cardinality blowups in InfluxDB on Home Assistant almost always come from one or two integrations that mint unique entity_ids per event. The right fix is to exclude them at the integration layer, drop their series from the index, and put cheap monitoring in place to catch the next one before the warning fires. None of that needs a shell script, a watcher container, or anything external to HA. The setup above keeps secrets in secrets.yaml, configuration in version-controllable YAML, and everything in scope of regular HA backups.

A complex issue that seems to be a workaround for a design flaw.

Agreed. I haven't wrapped my head around why they designed the data storage that way but my gut is the lighting strike distance from Blitzortung could be organised differently.

But I also hold that this and many other addons should by default create health monitoring and alerting functions out of the box. For example this and other apps should raise a problem indication that is hooked into a repair ticket. This should be added as a consideration for gold or platinum status.

Hopefully this can help someone in the future with similar error without the days of research.