InfluxDB Cardinality on Home Assistant: Diagnose, Clean Up, and Monitor
A practical writeup for anyone running the Community Add-on InfluxDB alongside the Home Assistant influxdb integration who has either seen a max-values-per-tag warning or wants visibility into what their database is growing into. Covers the cardinality vs retention distinction, the exclude pattern that actually fixes it, the cleanup query, and a fully HA-native monitoring layer using rest: and template: sensors.
Throughout this guide:
home_assistantstands in for whatever you named your InfluxDB database in the integration. Substitute your real value.<INFLUX_HOST>is the host/IP your add-on is reachable at, typically the HA host on port 8086.ha_statsis a recommended dedicated read-only user; create it before wiring the sensors.
The warning that starts the journey
In the InfluxDB add-on log you eventually see a line like:
lvl=warn msg="max-values-per-tag limit may be exceeded soon"
service=store perc=100% n=220004 max=220000
db_instance=home_assistant measurement=mi tag=entity_id
Decoded: in your database, one measurement has accumulated 220,004 distinct values for the entity_id tag, against a configured ceiling of 220,000. The default is 100,000. If the count crosses the ceiling, InfluxDB starts dropping any write that would introduce a new entity_id value in that measurement. Existing series keep writing.
This is not a generic disk-space problem. It is tag cardinality: the number of unique tag values in the index. It is also a problem only a handful of integrations can cause at scale, because Home Assistant entities normally have stable IDs.
What the limit actually measures
InfluxDB indexes every unique combination of measurement plus tag set as a separate series. The max-values-per-tag ceiling enforces a cap on how many distinct values one tag may hold inside one measurement. The HA influxdb integration writes the entity's entity_id as a tag and the value's unit as the measurement name. So when this warning fires, it is telling you that one unit class has accumulated more unique entities than InfluxDB is comfortable indexing.
Two things make this happen:
- An integration that mints a new
entity_idper event rather than reusing a stable one. - An infinite retention policy that never lets old index entries age out.
The well-known offender is the geo_location family of feed integrations. Lightning detectors (Blitzortung-style), USGS earthquakes, NSW rural fire, GDACS, and similar create one geo_location entity per detected event with a UUID-style suffix in the entity_id. The entity itself is short-lived in HA's state machine, but every entity_id is permanently indexed by InfluxDB.
Why a 24-hour retention does not fix this
There is a tempting line of reasoning: "I will set a 24-hour retention on lightning so it stops accumulating." It does not work, for two reasons that are worth being precise about.
First, retention policies are database-level, not measurement-level. The HA integration writes every measurement to the database's default retention policy. There is no per-measurement or per-entity routing in the integration. To target a 24-hour policy at one measurement you would need a separate writer outside the integration.
Second, and more importantly, a rolling DELETE only bounds volume, not cardinality. The InfluxDB v1 documentation is explicit: DELETE removes points but does not drop the series from the index, while DROP DATABASE, DROP MEASUREMENT, and DROP SERIES are the only operations that reduce series cardinality.
There is one exception that is genuinely useful: a short retention policy with short shard duration does reduce cardinality, because when a shard expires it is dropped whole and the series it held leave the index with it. But routing only one measurement to that policy requires a custom write path. Through the HA integration, you cannot.
The cleaner solution is to stop logging the offending entities to InfluxDB in the first place.
Find the source
Get an interactive InfluxQL session. Either Chronograf's Data Explorer (the InfluxDB add-on's web UI), the influx CLI inside the add-on container, or curl against the HTTP API. List the worst measurements first:
SHOW SERIES EXACT CARDINALITY ON "home_assistant"
You will get one row per measurement with a count. Sort descending. One measurement at hundreds of thousands while everything else is in the hundreds is the signature of unique-id churn.
Then look at the entity IDs themselves:
SHOW TAG VALUES ON "home_assistant" FROM "<measurement>"
WITH KEY = "entity_id" LIMIT 60
Patterns to recognize:
- Hex or UUID suffixes (
lightning_strike_0123abcd..., IEEE addresses for unnamed devices). - Incrementing numeric suffixes (
device_2,device_3, ...) from repeated re-pairing. - Timestamp fragments embedded in the ID.
Exclude at the integration, not at the recorder
Once you have the pattern, exclude it on the InfluxDB side. Be specific. Do not blanket-exclude a whole domain unless every entity in that domain is junk.
influxdb:
# ... your existing connection block ...
exclude:
entity_globs:
- "*.lightning_strike_*"
The glob matches on full entity_ids, so *.lightning_strike_* targets geo_location.lightning_strike_<hex> precisely while leaving every other geo_location entity, every other sensor, and every aggregate counter intact. It is domain-agnostic, which means you do not have to be right about the domain.
A common mistake worth avoiding: do not also add this to recorder: exclude. The recorder (MariaDB or SQLite) is what feeds Home Assistant's frontend, including the Map card's hours_to_show trail. If you want strikes to render on a map at all, the recorder is the source. The InfluxDB integration is write-only; nothing in the HA UI reads from InfluxDB. Excluding from InfluxDB has no effect on any HA visualization, only on long-term metrics export.
After the YAML change, restart HA. The integration does not have a config reload.
If you are concerned about recorder growth from high-churn entities, the lever is
recorder.purge_entitieson a schedule with a shortkeep_days, not a recorder exclude. The recorder's purge is time-based, so storms wash through; only the metadata churn lingers, and that is a slower problem.
Clean up the index
Excluding stops new strikes from being written. The existing 200k+ entries are still in the index until you remove them. The right tool is DROP SERIES with a tag regex, scoped to the offending measurement:
DROP SERIES FROM "<measurement>" WHERE "entity_id" =~ /^lightning_strike_/
This drops every series in that measurement whose entity_id starts with the pattern, keeping any legitimate companion sensors (a distance or counter sensor with a fixed entity_id is unaffected).
Notes on the cleanup:
DROP SERIESis index-heavy at hundreds of thousands of series and triggers a TSM compaction. Run it in a maintenance window.- After it completes, restart the InfluxDB add-on so the in-memory index rebuilds and the
max-values-per-tagcounter settles. There are upstream reports that the counter can lag without an index rebuild. - Once the cleanup lands, lower
max-values-per-tagback to the default (100,000) in the add-on config so the limit is a meaningful guardrail again. Do not disable it (= 0). Disabling silences the warning at the cost of unbounded index growth in RAM.
Expected impact: total series count drops by roughly the number you removed, immediately. Disk size does not drop immediately. InfluxDB only reclaims space during TSM compaction, sometimes hours later. And for unique-id-per-event data, most of the cost is in the index, not in data volume, so do not expect a dramatic disk shrink. Cardinality is the metric that proves the cleanup landed.
Permanent monitoring with native HA sensors
The lesson of the cardinality blowup is that you want to see this kind of accumulation building, not stumble on it when the limit fires. Here is a monitoring layer that lives entirely in HA, no shell scripts, no cron jobs, secrets handled through !secret.
Prerequisites
Create a dedicated read-only InfluxDB user with access to the internal monitoring database and your HA database:
CREATE USER "ha_stats" WITH PASSWORD 'long_random_value'
GRANT READ ON "_internal" TO "ha_stats"
GRANT READ ON "home_assistant" TO "ha_stats"
SHOW GRANTS FOR "ha_stats"
SHOW GRANTS is the proof check. Do not reuse the integration's write user for monitoring; least privilege belongs here.
In secrets.yaml:
influx_stats_user: ha_stats
influx_stats_pass: paste_password
Size and basic stats (REST sensor against _internal)
This bundle pulls four numbers in a single HTTP request: total on-disk bytes (summed across shards), series and measurement counts, and TSM file footprint and count. _internal.monitor.* is InfluxDB's own self-monitoring database; the fields used here are standard.
rest:
- resource: "http://<INFLUX_HOST>:8086/query"
method: GET
scan_interval: 300
timeout: 30
username: !secret influx_stats_user
password: !secret influx_stats_pass
params:
db: _internal
q: >-
SELECT last("diskBytes") FROM "_internal"."monitor"."shard"
WHERE "database" = 'home_assistant' AND time > now() - 5m GROUP BY "id";
SELECT last("numSeries"), last("numMeasurements") FROM "_internal"."monitor"."database"
WHERE "database" = 'home_assistant' AND time > now() - 5m;
SELECT last("diskBytes"), last("numFiles") FROM "_internal"."monitor"."tsm1_filestore"
WHERE "database" = 'home_assistant' AND time > now() - 5m GROUP BY "path"
sensor:
- name: "InfluxDB Disk Size"
unique_id: influxdb_disk_size
device_class: data_size
unit_of_measurement: MB
state_class: measurement
value_template: >-
{% set ns = namespace(t=0) %}
{% for s in value_json.results[0].series | default([]) %}
{% set v = s['values'][0][1] %}
{% if v is number %}{% set ns.t = ns.t + v %}{% endif %}
{% endfor %}
{{ (ns.t / 1048576) | round(1) }}
- name: "InfluxDB Series Count"
unique_id: influxdb_series_count
unit_of_measurement: series
state_class: measurement
value_template: >-
{% set s = (value_json.results[1].series | default([{}]))[0] %}
{% set cols = s.get('columns', []) %}
{% set vals = (s.get('values', [[]]))[0] %}
{{ (vals[cols.index('last')] if 'last' in cols else 0) | int(0) }}
- name: "InfluxDB Measurements Count"
unique_id: influxdb_measurements_count
unit_of_measurement: measurements
state_class: measurement
value_template: >-
{% set s = (value_json.results[1].series | default([{}]))[0] %}
{% set cols = s.get('columns', []) %}
{% set vals = (s.get('values', [[]]))[0] %}
{{ (vals[cols.index('last_1')] if 'last_1' in cols else 0) | int(0) }}
- name: "InfluxDB TSM Disk"
unique_id: influxdb_tsm_disk
device_class: data_size
unit_of_measurement: MB
state_class: measurement
value_template: >-
{% set ns = namespace(t=0) %}
{% for s in value_json.results[2].series | default([]) %}
{% set v = s['values'][0][1] %}
{% if v is number %}{% set ns.t = ns.t + v %}{% endif %}
{% endfor %}
{{ (ns.t / 1048576) | round(1) }}
- name: "InfluxDB TSM Files"
unique_id: influxdb_tsm_files
unit_of_measurement: files
state_class: measurement
value_template: >-
{% set ns = namespace(t=0) %}
{% for s in value_json.results[2].series | default([]) %}
{% set v = s['values'][0][2] %}
{% if v is number %}{% set ns.t = ns.t + v %}{% endif %}
{% endfor %}
{{ ns.t | int(0) }}
Two notes:
- The disk-size sum uses
last("diskBytes") ... GROUP BY "id"and aggregates client-side. Theinfluxdbplatform sensor cannot reproduce this; it has noGROUP BYand applies a single aggregator. Usinggroup_function: suminstead inflates the value by however many monitor samples fell in the time window (one every ten seconds). Therest:approach is the structurally correct one. device_class: data_sizedoes not auto-rescale the displayed unit. The number you compute is what you see. Bytes is technically faithful but unreadable, so divide to MB or GB invalue_template.
Cardinality view (separate REST entry, slower poll)
The cardinality query walks the index and is expensive while you still have a bomb in there. Split it onto its own REST entry with a longer timeout and a daily poll. After cleanup it is fast again.
- resource: "http://<INFLUX_HOST>:8086/query"
method: GET
scan_interval: 86400
timeout: 60
username: !secret influx_stats_user
password: !secret influx_stats_pass
params:
db: home_assistant
q: 'SHOW SERIES EXACT CARDINALITY ON "home_assistant"'
sensor:
- name: "InfluxDB Top Cardinality"
unique_id: influxdb_top_cardinality
unit_of_measurement: series
state_class: measurement
value_template: >-
{% set ns = namespace(top=0) %}
{% for s in value_json.results[0].series | default([]) %}
{% set c = s['values'][0][0] | int(0) %}
{% if c > ns.top %}{% set ns.top = c %}{% endif %}
{% endfor %}
{{ ns.top }}
json_attributes_path: "$.results[0]"
json_attributes:
- series
State is the largest single-measurement cardinality. The series attribute holds the full raw response (alphabetical, as InfluxDB returns it) for template sensors to sort.
Top writer view (cross-measurement entity activity)
InfluxDB does not expose per-series disk size. The closest cheap proxy for "which entities are costing me the most" is recent point count, grouped by entity_id, across all measurements. The query is time-bounded so it scales with recent points, not with historical cardinality.
- resource: "http://<INFLUX_HOST>:8086/query"
method: GET
scan_interval: 3600
timeout: 30
username: !secret influx_stats_user
password: !secret influx_stats_pass
params:
db: home_assistant
q: 'SELECT count("value") FROM /.*/ WHERE time > now() - 1h GROUP BY "entity_id"'
sensor:
- name: "InfluxDB Top Writer Points 1h"
unique_id: influxdb_top_writer_points_1h
unit_of_measurement: points
state_class: measurement
value_template: >-
{% set series = value_json.results[0].series | default([]) %}
{% set ns = namespace(top=0) %}
{% for s in series %}
{% set c = s['values'][0][1] | int(0) %}
{% if c > ns.top %}{% set ns.top = c %}{% endif %}
{% endfor %}
{{ ns.top }}
json_attributes_path: "$.results[0]"
json_attributes:
- series
count("value") catches every point the HA integration writes, regardless of state type. The 1-hour window keeps the scan light. Stretch to 6h or 24h for trend data once your index is clean.
Template sensors and outlier alert
These derive the human-friendly views: top measurement name, sorted top-N lists, and a breakdown of every measurement ranked by series count.
template:
- sensor:
- name: "InfluxDB Top Measurement"
unique_id: influxdb_top_measurement
state: >-
{% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
{% set ns = namespace(name='unknown', count=0) %}
{% for s in series %}
{% set c = s['values'][0][0] | int(0) %}
{% if c > ns.count %}
{% set ns.count = c %}
{% set ns.name = s['name'] %}
{% endif %}
{% endfor %}
{{ ns.name }}
attributes:
top_count: >-
{% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
{% set ns = namespace(count=0) %}
{% for s in series %}
{% set c = s['values'][0][0] | int(0) %}
{% if c > ns.count %}{% set ns.count = c %}{% endif %}
{% endfor %}
{{ ns.count }}
top10: >-
{% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
{% set ns = namespace(pairs=[]) %}
{% for s in series %}
{% set ns.pairs = ns.pairs + [{'measurement': s['name'], 'series': s['values'][0][0] | int(0)}] %}
{% endfor %}
{{ (ns.pairs | sort(attribute='series', reverse=True))[:10] | list }}
- name: "InfluxDB Cardinality Breakdown"
unique_id: influxdb_cardinality_breakdown
state: >-
{% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
{{ series | length }}
unit_of_measurement: measurements
state_class: measurement
attributes:
breakdown: >-
{% set series = state_attr('sensor.influxdb_top_cardinality', 'series') or [] %}
{% set ns = namespace(pairs=[]) %}
{% for s in series %}
{% set ns.pairs = ns.pairs + [{'measurement': s['name'], 'series': s['values'][0][0] | int(0)}] %}
{% endfor %}
{{ ns.pairs | sort(attribute='series', reverse=True) | list }}
- name: "InfluxDB Top Writer"
unique_id: influxdb_top_writer
state: >-
{% set series = state_attr('sensor.influxdb_top_writer_points_1h', 'series') or [] %}
{% set ns = namespace(name='unknown', count=0) %}
{% for s in series %}
{% set c = s['values'][0][1] | int(0) %}
{% if c > ns.count %}
{% set ns.count = c %}
{% set ns.name = s['tags']['entity_id'] %}
{% endif %}
{% endfor %}
{{ ns.name }}
attributes:
measurement: >-
{% set series = state_attr('sensor.influxdb_top_writer_points_1h', 'series') or [] %}
{% set ns = namespace(meas='', count=0) %}
{% for s in series %}
{% set c = s['values'][0][1] | int(0) %}
{% if c > ns.count %}{% set ns.count = c %}{% set ns.meas = s['name'] %}{% endif %}
{% endfor %}
{{ ns.meas }}
top20: >-
{% set series = state_attr('sensor.influxdb_top_writer_points_1h', 'series') or [] %}
{% set ns = namespace(pairs=[]) %}
{% for s in series %}
{% set ns.pairs = ns.pairs + [{
'entity_id': s['tags']['entity_id'],
'measurement': s['name'],
'points': s['values'][0][1] | int(0)
}] %}
{% endfor %}
{{ (ns.pairs | sort(attribute='points', reverse=True))[:20] | list }}
- binary_sensor:
- name: "InfluxDB Cardinality Outlier"
unique_id: influxdb_cardinality_outlier
device_class: problem
state: "{{ states('sensor.influxdb_top_cardinality') | int(0) > 5000 }}"
attributes:
worst_measurement: "{{ states('sensor.influxdb_top_measurement') }}"
The 5000 threshold on the outlier alert is the lever to tune. Pick a number a comfortable margin above your current legitimate top measurement and revisit as the system grows.
Final entity inventory
If everything is in place you should have eleven entities:
- Five size and health sensors:
influxdb_disk_size,influxdb_series_count,influxdb_measurements_count,influxdb_tsm_disk,influxdb_tsm_files. - Three cardinality sensors:
influxdb_top_cardinality(raw input),influxdb_top_measurement(sorted top withtop10),influxdb_cardinality_breakdown(full sorted list). - Two writer sensors:
influxdb_top_writer_points_1h(raw input),influxdb_top_writer(top entity withtop20). - One outlier alert:
binary_sensor.influxdb_cardinality_outlier.
Pitfalls observed along the way
A few things that catch most people building this:
- The default REST sensor timeout is 10 seconds. The
SHOW SERIES EXACT CARDINALITYquery against a database with a cardinality bomb can run well past that; raisetimeoutaccordingly, or split that query onto its own slower-polling REST entry as shown above. homeassistant.update_entitytakes its target underdata:, nottarget:. Forcing a refresh while debugging needs the right key.- The
influxdbplatform sensor is not interchangeable withrest:for these queries. It has noGROUP BY, returns a single scalar, and cannot expose the JSON response to other sensors. Use it for simple "one number from one measurement" cases; userest:for everything in this guide. - Excluding entities from the InfluxDB integration is independent of the recorder exclude. Touch only what you need to. The integration is write-only; the recorder feeds the UI.
- Bytes is a faithful unit but unreadable.
device_class: data_sizedoes not auto-rescale. Divide invalue_templateand set the unit to MB or GB explicitly.
Closing
Cardinality blowups in InfluxDB on Home Assistant almost always come from one or two integrations that mint unique entity_ids per event. The right fix is to exclude them at the integration layer, drop their series from the index, and put cheap monitoring in place to catch the next one before the warning fires. None of that needs a shell script, a watcher container, or anything external to HA. The setup above keeps secrets in secrets.yaml, configuration in version-controllable YAML, and everything in scope of regular HA backups.