Data detective

Today @WillKoehrsen and I have released a simple package for parsing data from the home-assistant database. Right now if does basic queries and plotting, but also allows prediction using Facebooks prophet library. This is a WIP but any feedback gratefully received!
Cheers

8 Likes

Released v0.2 which is a significant refactor for ease of use and maintainability.
Cheers

v0.3 adds the ability to plot binary sensors. Cheers to @WillKoehrsen for his hard work on this one.

OK detective is now on pypi https://pypi.org/project/HASS-data-detective/
Just pip install HASS-data-detective then see the example notebook here
Cheers

nice project.
to bad that the db from HA would grow to big to be usefull if you let it run over years, thats why i created my own sensor registration.

If db size is an issue, there are plenty of solutions. Cloud storage is one. I’ve a synology NAS so no issues with storage. Once @frenck gets the Jupyter hassio addon out then this library will be more of interest, even for plotting and exploring data captured relatively recently

storage isnt the problem :wink:
a very lange db for HA does slow it down at some point.
and a hassio addon can never make anything more interesting for me.

but your ideas do :wink:

Can you advise the size threshold that a slowdown is observed?

is also depending on the device i guess
i stopped using a db in the time that we couldnt configure what is going in the db and when i was still on a RPI.
and i must say that i didnt try other db programs which would be a big influence also.

at that time it was really noticable after a few days of data recording. (which was sometimes terribly high up to 20 mb)
later on i noticed that some of my sensors did started to race at some moments causing the big db growth.
but i only noticed that when i started splitting out the data.

in general you can say that there isnt going to be a problem with writing to a db. thats mostly the same amount of time.
but querying is taking longer with larger amounts of records.
ha moves everything in 1 table.

then just some simple calculations.
1 temp sensor that reads the temp only once every 5 mins. will have 100.000 records in a year.
and thats just 1 sensor that doesnt update that much.

in bigger places with lots of sensors, motion detectors, devices, etc. you will easy get up to a hundred million record a year. and all in 1 table!

now lets say i have another sensor based on the last time that some sensor is changed, or just that you want to create an automation that turns of the lights 5 mins after a motion is detected.
you can only do that by querying the db. unless you take out the data and save it somewhere else.

querying millions of records on a big server wont be a big problem, but you dont want to that on a db saved on a sd card from a RPI.

i think that you should consider another way to save data over time.

the way i would chose in that case for data collection is something like this:

  1. at a certain time (preferably during the night) move the data that HA collected in the last 24 hours to a seperate db
  2. splitt up that data to tables based on an entity
  3. use the data detective to inpect that db and not the ha db.

the big advantage is that you can move that data to a better place then the RPI sd card, but most off all the HA db doesnt need to grow that much and wont slow automations down over time.

i hope this makes sense to you and that i might inspire you with it to get to real good data collection. because that isnt really there in HA untill now (or at least i didnt see it) and i really want to have it myselve to, so i collected for over 2 years now (without a db, so i cant inspect my data like i want to untill i start to create something like i described or someone else did :wink: )

In the recorder component you can specify which entities are recorded to the db using the include config variable, so that’s one option.

i know that that is an option nowadays.
still that results in millions of records that are worthwhile saving every year.

but you are right you can blend out tons of useless stuff with that.

you can easy start calculating.
count the amount of devices and sensors you find worthwhile to see back
(i think just in a small setup you will easy get to 40)
then see at how many records you get in 1 year.
40 * 100.000 = 4.000.000
thats about the minimum that you will get.
but if you are interested in reviewing data 1 year is just the start.
so after 5 years you have 20 million records with the best optimised smallest setup.

in a big environment you can easy make that 10 or 20 times.
how many time do you think a db needs to search through 200 million records?

Yes you would need to get smart about sampling rate. Applying Nyquists theorem, we want to sample at double the rate of the signal of interest. So if for example I am sampling a tide reading with a periodicity of 12 hours, then I would only sample every 6 hours. This is just a toy example but illustrates the point - the number of samples you want to keep is related to what you are studying. So if I were interested in readings that were changing quite quickly I wouldn’t need a years worth of data to do useful analysis. In practice I would use the history_stats sensor or similar to sample at the interval I’m interested in, then record its value to the db.

that is nice, when you now know what you will be interested in in 2 years.
but for example, i am now interested in checking out movement in the kitchen to predict movement and make automations based on that prediction.

2 years ago i just saw movement as a sensor to use to switch something on or off.
if i would had the ability to create such a history stat 2 years ago i also would have needed to know what i want to do with it now, to make sure that my data wouldnt grow to much.

but however you look at it, it doesnt make any sense to keep statistical data inside a db that is unoptimised and used to automate stuff.
for sure not in an HA environment where things are evolving quick and where data collection is way from optimal.

I completely agree its impossible to know ahead of time what data you might need later, and at what interval. I expect that for most analytics to establish baseline behaviour, a few weeks to months data max is required. That will be a useful benchmark to establish.

I also agree that HA production database isn’t the way to go if you are serious about storing and analysing lots of data. Better approach would be to periodically copy data to a more long-term storage solution.
Cheers

1 Like

Is there a more up to date install doc for this? I’ve got jupyter installed and it loads and lets me pull up things. But nothing I try works. There are syntax errors in the scripts and other problems. Some install instructions make it seem like it needs to be in a virtual environment of its own, but then some of the scripts seem to want it to be in the same virtual environment as homeassistant. I may not be the sharpest tool in the shed, but I’m not completely lost around systems and linux either. Any help would be appreciated.

@turboc sorry I just saw this, did you make any progress? My advice is to use the Hassio addon

So I did a quick check to see how well my house insulation is doing this winter and the data-detective made it easy to verify that my bedroom and outdoor temperatures are not correlated, indicating that my homes insulation works well :fire:

or that you got heating running :wink:

3 Likes

I am just getting into this, but have a slightly different use case. The provided notebooks work file on the sqllite db. However, I only store 3 days of data in there. I am also storing a lot of data into an influxdb. Is there a way to use the data detective with the influxdb? I tried replacing the value of db in the Getting Started notebook to use a influxdb client, but kept running into methods not being available, etc.

Is it at all possible, @robmarkcole?

I don’t think so, influxdb is a proprietary db and incompatible with sqlalchemy. You could migrate data influxdb >> timescaleDB and query that