Memory db with cloud backup for analytics

mryckman · November 27, 2020, 3:29pm

I use a rpi4 with an sd card and so I’ve been worried about wearing it out. But I also wanted to keep a lot of data for analysis/data science projects.

I put together a custom component:

This lets me copy my recorder data to Google BigQuery, which has a decent free tier and is cheap even after that, and then switched to using a memory db for sqlite. Now, nothing hits the sd card, but I get to keep the data in a good analytics platform.

The “bigquery_backup” component just looks at the last record it copied to BQ and then grabs any records from the recorder that are after that and pushes them to BQ. You actually call the backup with a service call, so it’s easy to do through an automation. This way, I can run the backup every 15 minutes (and have it trigger on any of the home assistant events) and it just copies in whatever changed recently. I’m hopeful this will make sure I rarely, if ever, end up losing any of the data from the memory db.

I’ve just been getting started on actually using the data:

JoFie · February 19, 2021, 2:38pm

@mryckman, this sounds like a great solution to put the HA db in memory while still making the data persistent. For free.

After three months of use, can you comment on how well this works for you, and possible advice you can give to new users so that we can set it up correctly from scratch? To get an idea of how well this fits within the limits of the BQ free tier, can you give some feedback on e.g. how fast your BQ database grows (with respect to the number of sensors / data that you upload), and other restrictions that we must take care to not exceed?

Possible that you can provide an example of a bigquery_backup yaml configuration? E.g. typical values for dataset, state_table, event_table etc. And what does the service call look like that you use to trigger the backup?

Thanks…

mryckman · February 20, 2021, 3:54am

Great question, and good timing. Just a day or two ago, I updated HA and when the component set back up, it got an error trying to install the google-cloud-bigquery python package. I’m not sure why yet and I haven’t really had a chance to go through it.

In any case, as for the rest of it, it’s been great honestly. Here are some stats:

House/Situation

Three months of data - November 15 - February 15
~550 entities (About 250 of those entities have fewer than 100 state changes in the three months, and about 100 have like 1 or 2; some have tens of thousands)
I make no effort to reduce the data sent to BQ. The method always just looks at the last record it saw uploaded and gets every record created after that directly from the recorder
An automation calls the service to upload changes every 15 minutes

Data Generated

State Table:
- 1.2m records
- 622mb
Events Table
- 1.325m records
- 130mb

Summary

Averaging ~1,850 state records per entity, per month. Some entities have a lot, some have very few, but that’s the average
Averaging ~2,040 event records per entity, per month
BQ free tier currently gives 10gb of free storage per month and 1tb of query executions. Given that I’m currently just under 1gb of storage, and haven’t done a ton of querying with it (though it would take a lot to hit 1tb anyway), I don’t think this has actually cost me anything yet and, if it does, it should come on very slowly - like maybe I’ll get a bill for 1 or 2 cents.

I’ve actually been very happy with the setup. Obviously this most recent error is annoying, and I’ve gotta trace that down, but that’s just a matter of getting HA to install the package, so I don’t think of it as fully related to the approach/component itself. I setup a few visuals in Data Studio to view the data, but I haven’t used them very much.

It is true that I haven’t used it much. If that holds, maybe I’d eventually cut it off, but I doubt it. In BQ, there are other techniques you can use so that your queries don’t scan as much data (mainly, partitioning) and, at some point, paying the storage costs on ~2.5gb a year just isn’t that onerous. But we’ll see I suppose.

My yaml section looks like this:

bigquery_backup:
  key_file: bq_service_account.json
  project_id: (redacted)
  dataset: HOMEASSISTANT
  state_table: states
  event_table: events

In GCP, you end up creating a project right away - and that’s globally unique to you. From there, in BQ, you can add a dataset and name it anything you want. I obviously kept it simple and just called it HOMEASSISTANT. The key file is really just whatever you name the file - you download that from GCP - it’s the key for a service account. You can find tutorials online for how to do that piece. Once you have the key file downloaded, you just have to get it into your HA folder so this can find it.

aportillagithub · June 7, 2021, 5:12pm

I just wrote a spanish post about how to extract some meaning of those bunch of data directly from sqlite to jupyter notebooks:

https://www.pacienciadigital.com/analisis-de-consumo-de-energia-con-home-assistant/