Ha-archive-search — Search and Markdown export for archived Home Assistant versions

Hi everyone,

A few days ago I posted ha-state-archive, an infrastructure-side archival and audit pipeline for Home Assistant snapshots.

While building and using that archive corpus, I ended up developing a second companion project focused on historical exploration and search:

ha-archive-search

Repository:


What it does

ha-archive-search is a multi-platform search engine operating on archived Home Assistant versions stored outside HA itself.

The project currently provides:

  • bounded recursive filesystem search;
  • version-aware traversal (--latest, --version, --all-versions);
  • compact or context search modes;
  • documentation filtering;
  • Markdown export;
  • lightweight Flask web interface;
  • Docker deployment;
  • LAN/VPN browser access from desktop or mobile devices.

The search corpus is the archive structure produced by ha-state-archive, but the project itself is filesystem-oriented and does not depend on Home Assistant internals at runtime.


Why I built it

Once you accumulate months or years of Home Assistant snapshots, searching historical configurations manually becomes painful.

I wanted something able to answer questions like:

  • “When did this entity first appear?”
  • “Which version introduced this automation?”
  • “What changed between these periods?”
  • “Where was this helper referenced historically?”

The Markdown export is especially useful for:

  • incident investigation;
  • historical analysis;
  • sharing findings;
  • external tooling and LLM workflows.

Philosophy

The project follows the same philosophy as ha-state-archive:

  • Home Assistant → real-time automation and operational decisions
  • External infrastructure → archival, audit, search and historical analysis

The goal is not to replace Home Assistant functionality, but to complement it with long-term infrastructure tooling.

Feedback welcome — especially from people running large or long-lived Home Assistant installations.

A quick update on ha-archive-search.

Since the initial release, the project evolved significantly (v0.3.xv0.4.0), both technically and conceptually.

One important clarification emerged through real-world usage:

I originally presented the project mostly as a historical exploration / archaeology tool for Home Assistant backups.

That description is technically correct — but ultimately too narrow.

In practice, I mainly use ha-archive-search as a portable infrastructure-side search engine for Home Assistant configurations.

The archive/snapshot layer is mostly the storage source.

The actual day-to-day use cases are things like:

  • finding where an entity is referenced;
  • auditing automations, templates and dashboards;
  • exporting structured Markdown search results;
  • searching safely outside the HA runtime;
  • feeding audit and AI-assisted workflows;
  • performing large-scale configuration inspection across external backups.

The project intentionally remains:

  • filesystem-based;
  • dependency-light;
  • non-indexed;
  • infrastructure-oriented;
  • non-realtime.

Recent releases also introduced:

  • structured Markdown export;
  • typed internal rendering models;
  • renderer golden tests;
  • compact protocol formalization (compact v1);
  • compatibility validation and drift detection;
  • Docker health monitoring;
  • doctrine validation tooling.

The goal is increasingly to provide a stable, portable and machine-consumable search layer for Home Assistant configuration corpora.

Current examples below.

Compact search UI:

Structured Markdown export (compact mode):

Structured Markdown export (context mode):

The context export workflow is becoming especially useful for:

  • audit reviews;
  • documentation workflows;
  • configuration inspection;
  • AI-assisted analysis;
  • portable technical reports outside Home Assistant runtime.

Thanks for the update. It's not clear to me how this would be deployed and put into service. Can you elaborate?
What are the requirements to use this system? I see it's a separate docker container, but I'm not sure how it expects to get access to data, and what generates the data it consumes.

Thanks, very good question — and I probably need to make this clearer in the README.

ha-archive-search is not a Home Assistant add-on and does not collect data from HA directly.

It is a separate Docker container that searches an existing filesystem tree containing Home Assistant configuration snapshots/backups.

The typical model is:

Home Assistant
  ↓ regular backups / snapshots / copied config versions
NAS or server filesystem
  ↓ mounted read-only into the container
ha-archive-search
  ↓
search UI + Markdown export

So the tool does not generate the data it consumes — it expects the data to already exist as directories on disk.

On my setup, Home Assistant backups are regularly stored on a NAS. The container mounts that archive directory read-only:

volumes:
  - /volume1/Backups_HA/ha_backup_timeline/versions:/versions:ro

Inside /versions, the tool expects one directory per archived snapshot:

/versions/
  2026-05-18_02-30_Automatic_backup_2026.5.2_e53ab870/
    configuration.yaml
    automations.yaml
    lovelace/
    scripts/
    ...
  2026-05-19_13-13_Automatic_backup_2026.5.2_2835eb69/
    configuration.yaml
    ...

No database, no indexer, no HA integration, no MQTT feed, no runtime connection to Home Assistant.

Requirements:

  • Docker
  • A directory containing extracted Home Assistant configuration snapshots
  • That directory mounted into the container (read-only recommended)
  • LAN/VPN access to the web UI

The system is intentionally infrastructure-side and offline from HA runtime — closer to a portable enhanced grep/search interface for HA configuration corpora than to a live HA integration.

So, to be clear, it doesn't support the native backup format used by Home Assistant? I believe these are some type of encrypted tar file.

Correct — and that is actually intentional.

ha-archive-search does not parse the native HA backup format directly. It operates downstream of a dedicated extraction and archival layer.

That extraction layer is ha-state-archive.

The full pipeline looks roughly like this:

[ Home Assistant ]
        │
        │  native encrypted backup (.tar)
        ▼
  Ingestion / archival layer
  (ha-state-archive)
  - decryption
  - extraction
  - extracted snapshot materialization
  - stabilization
        │
        │  immutable extracted versions
        ▼
  versions/
  ├── 2026-05-18_02-30_Automatic_backup_.../
  │     configuration.yaml
  │     automations.yaml
  │     ...
  └── 2026-05-19_13-13_Automatic_backup_.../
        ...
        │
        ▼
  [ ha-archive-search ]
  - corpus search
  - Markdown export
  - web UI
  - audit-oriented inspection

The separation is deliberate.

The ingestion layer is the only component that ever handles the encryption password. Everything downstream — audit, diff, retention, search and export — operates on immutable extracted trees with no access to secrets.

ha-state-archive handles:

  • encrypted .tar ingestion
  • decryption and extraction
  • creation of immutable extracted snapshots
  • stabilization checks before downstream processing
  • retention and quarantine workflows
  • infrastructure-side audit/diff processing
  • MQTT supervision back into Home Assistant

It does not rewrite or normalize Home Assistant YAML content; it only prepares stable extracted snapshot directories for downstream tools.

ha-archive-search then mounts the resulting versions/ directory read-only and provides structured search and export on top of it.

Both projects are designed to run infrastructure-side, typically on a NAS (I use Synology DSM), independently from the Home Assistant runtime itself.

So today:

  • direct native backup ingestion in ha-archive-search: no
  • searching extracted/versioned HA corpora: yes

The separation keeps the search layer intentionally lightweight, portable and dependency-light.