Data science with Home-assistant

Yeah, the alternative is to use some sort of alternative way to label your data, this is pretty inconvenient for a lot of sensors though and I’m afraid I’ll forget it so that it won’t be perfect.

I just realised I can’t simply parse the HA config to get the observations and that I have no way of linking the observations in the configuration to the observations in the attributes of the sensor, so this actually will need some ground work there first (probably by simply exposing the id # in the attributes).

This is awesome. A ‘best practices’ guide for medium and long-term data storage would a vital companion to these efforts and would open the doors for full usage of the work that’s being done here.

I’d really appreciate any advice that I think will serve other new users like me for recommendations as how to obtain a long-term data storage solution…there are a lot of scattered opinions in the forums here, but I don’t think there’s a general recommendation with a guided setup (hardware, off-loading DB onto NAS, ha, choice of DB format/type, etc)

(my own particular trial-and-error-based experience so far below)

Thanks for putting the work and time into this!


E.g. A new user’s experience that I’d like to avoid repeating: 3 months / 1.7gb sqlite3 data ‘loss’ and possible db ‘corruption’.

To my dismay last night, while I was trying to figure out why my sensors section of my configuration.yaml file wasn’t working with the mosquitto broker add-on that was communicating with a esp32 running micropython, my setup either froze or crashed.

On forced reboot, my home-assistant_v2.db was protesting with ‘DB image malformed’ errors in the home-assistant.log (fixes used by others In the HA forum wouldn’t work b/c DB integrity errors voiced by sqlite3 regarding instances of non-unique event ids around the chronological middle of the database’s entries).

So I had to delete the home-assistant_v2.db DB (made a sep. copy) to make Hassio function again. I can analyze the copy separately, but obviously I lose the ability to use the data within Hassio itself and some of the capabilities provided by the work being done here.

Avoiding this scenario particularly applies to ML-based efforts that I’d like to work on in the future:

  • Predict user-desired automations and I’d like to try and create a rudimentary recommender system and anomaly detection system:
    (like ‘hey Dude, it’s late for you even for a Friday night, and I think you’re likely to fall asleep soon since there’s less and less motion detected and due to your general sleep habits…did you forget that you left appliance X on like a dummy? Can I turn that off now for you?’)

I thought I had things sufficiently covered power and storage format-wise (RPi4 & 1TB HDD Pi Drive–sd card is used for boot only ‘old school-style’–both separately powered with sufficiently high amperages & Hass installed in docker)…but perhaps not?

Automation is irresistibly intriguing. Even more intriguing is the idea of privately collecting one’s own ‘organic’ data, which makes for far more interesting datasets to use in my own data science education than standard textbook generated/simulated ones.

Hi @carver
I’ve also experienced sqlite db corruption when moving the files around, basically I wouldn’t recommend this format if you intend to do data science. I myself and running home assistant and postgres in docker on a synology, and this has been rock solid. If you have a pi4 you could alsways install hassio and the mariaDB addon and jupyterlab addon. You can always put a large SD card in there if you think you are going to be generating serious volumes of data. Alternatively just run raspbian on the pi and setup your stack using docker on the pi.
Let me know what works out
Cheers

Hi @robmarkcole ,

Thanks for the response here!

Definitely good to know as saves me from trial-and-error time :slight_smile:

Sorry, below is a bit long of a post below in asking for advice, but I hope to make a response easier by providing the relevant info up front and as I’ve been feeling pretty enthusiastic about the idea of having a private life-data logger, data science, + home auto setup for about a year now. I see a good possibility of getting something up and running time-wise by utilizing the work that’s been put into HA + your work here with the data science setup!

Thanks again, and I’ll definitely post my experience with the results here in case it helps out.

So I’m wondering if you might be so kind as to advise considering the use-case below on whether I might be expecting a bit too much from by my use of an RPi4?

(if I understand correctly it still has some inherent compromises which allows for its low price…e.g. its I/o lane shared by ethernet and USB ports comes to mind)

General Objective aim: use either HA or Hass for:

  • A long-term data logger of everything I can throw in there home-data wise + maybe the grocery inventory add-on etc
  • A computer with the Data Science capabilities and future ML capabilities being worked on here (plus what I’d like to eventually take a stab at programming myself ML-wise for fun and for my own education)
  • And of course Home Automation

Current setup attempt to achieve objective:

  • RPi4 & 1TB usb HDD running Hass (‘alternative Debian install’ in Docker running on Raspbian booted to sd card with bulk of Raspbian moved to usb hdd , which presently handles an assortment of devices I’ve bought on sale for the past ~6 months:
  • a z-wave stick
  • z-wave energy usage-reporting Zooz devices (i.e. chatty devices)
  • z-wave sensors and outlets
  • deConz integration to separate RPi running deConz (to try and reduce the load a bit) with zigbee bulbs, outlets, sensors
  • a couple of misc hardware devices HA provides integrations to connect to
  • HA’s Homekit bridge/connector for all of the above connected devices + presence detection to add person id and person presence home/away as a feature/input var
  • future MQTT broker for micro/circuitpython-based boards (once I figure out why my sensor section of the yaml caused the crash)
  • future Node-Red

Would you recommend that instead of the RPi4 setup, try one of the following instead for the main running instance (while perhaps using the RPi4 for development/testing):
* NAS setup such as yours
* NUC (used or new)
* cheap small PC

I’ve been thinking about purchasing either one or the other in general anyway for experimentation as the new and used prices have come down quite a bit for both, and I believe there’s a new Intel NUC out that will bring prices for preceding models and down as well.
Potential trade-off: I’ve also read that Docker has a bit of a ‘learning curve’ (a misnomer of term haha, but I use it in the popular sense instead of implying a rapid learning rate) /required time investment, particularly with respect to networking.

Once again, sorry this is long, but I definitely appreciate any advice and will of course share the results!

Well I would say that you can use rpi for longer term projects, but there is always the risk of SD card corruption. You might be lucky and not have any issues, but then again you might be unlucky. I think a good tradeoff for cost is NUC, although I have never used one myself. NAS is really a storage solution, and doesn’t have much compute (in my case). RE docker it is definitely worth learning.
Note that you can always backup data to the cloud, see here

Thanks!
Hmm Interesting. It looks like I might do well to look into the NAS or NUC setup and docker learning then…It does look like the NAS can do quite a bit though since it looks like you’re able to do some ML-related image processing with it though as well.

The Google cloud tutorial looks pretty cool. I think I’ll definitely give that a try for selected data like temp and humidity and some other data that I don’t consider exploitable marketing-wise…I’m really interested in the ‘Big 3’'s platforms and beside the fun factor having some experience with them has an added benefit for job skills

(I’ve been drawn to the local effort b/c in the U.S. at least, anti-discrimination laws and the concept of a individual’s right to basic privacy are all taken as a big joke by prospective employers in my opinion, and I wouldn’t at all be surprised if bulk purchasing of marketing profiles is a technique used to illegally screen out applicants that algorithms have deemed likely to e.g. have a health problem(s) that can be gleaned from sold ‘smart home’ device data showing how well people sleep, how often they have to get up at night, etc.)

Nice HA + micropython & circuitpython tutorials and others btw :slight_smile:

On my NAS I am running HA & postgres only, so not really pushing it. I’ve done image processing mostly on rpi or even cloud. Another good reason for keeping it local is no nasty surprise bills.
For job market probably learning AWS is best ROI, although I also like GCP. Overal learning python bas been my best decision job wise, I am working for an IOT startup on the back of my HA/python experience, so very glad I put the time in!

1 Like

@ robmarkcole
So after reviewing some differences between MariaDB and PostgreSQL, I decided on PostgreSQL.
Unfortunately, there’s isn’t an add-on for it yet in Hass…so, I’ve looked at documentation for:
HA’s recorder component, the PostgreSQL docker image, and docker in general.

I’m still having some trouble understanding how I can install PostgreSQL in my hass installed in Docker (alternative Linux installation) such that:

  • snapshots created in the HA front end will also backup the database as it does now with the default home-assistant_v2.db (so whenever I need to restore a snapshot the database will not contain future event, invalid entity id’s, etc)
  • ideally would use unix sockets for efficiency instead of TCP (and the resulting not needing to specify a password in the recorder url to the db is nice too)
  • is handledby the hass supervisor such that it’s available before home assistant starts up, and anything else the supervisor usually does for other containers

Do you have any advice as to how to make this happen and make it work with the data science work you’ve implemented?

I’ve looked all over the forums, but I can only find bits and pieces…

So far the best I have is:
sudo docker run\ --name <postgresContainerName> -e POSTGRES_PASSWORD=<postgresRootUsersPassword> -v <PathToVolumeOnHost>:/var/lib/postgresql/data -p 5432:5432 -d postgres

To begin with, I’m not really sure what to put for the host path <PathToVolumeOnHost>

Would appreciate any advice on this so I can get up and running with data science here. Thanks!

HI @carver
you are interested in more advanced topics than I have experience with, re backups etc. Specifically regarding mounting existing postgres data, I dont do this either.
I will make a suggestion to checkout LTSS as this has some advantages for data sci, such as server side aggregation operations

I just used the data detective to answer a question for my wife - is our bedroom the correct temeprature for our incoming baby? Babies require temperatures between 16 - 20 deg Celcius, with the guidance that it is better to be on the cooler side as babies can easily overheat. Using the detective I was able to easily calc our night time temperature mean as 16.5 deg Celcius, satisfying the temperature requirement.

And histogram confirms temperatures are mostly in the desired range, but occasionally dipping below 16 degrees so I might increase the heating set point ever so slightly.

Hopefully this will reassure my wife…!

3 Likes

You might be interested in this article.

It would be nice to see a “prediction timer” for when a desired temperature will be achieved in a room, based on the heat source in the room and the set temperatur on the termostat.

1 Like

That is an interesting suggestion. I got an email update on my smart thermostat, and it will now ‘detect’ open windows based on if heating a room is taking longer than normal

1 Like

I am no programmer, but I hope some one that knows a bit of coding wants to take the generic_thermostat to the next level.

I use generic_thermostat in all the rooms to set the proper temperature and distribuating the air to the next rooms with fans.

Now I use automations and scenes, to get the “smart” termostat functunality.
But this logic shuld be part of the thermostat.

The thermostat shuld be based on the out door temperature and thermal resistance of the walls (U value in windows) + The heat source in the room to get and maintain a desired set temperature

Now I set a scene based on time of day and based on the persons at home :

  • to get a desired temperature when we get out of bed
  • Set a temperature when you get to work (shuld be able to increase desired temp with in 1 hour).
  • A set temperature for sleeping.
  • A Long time minimum vacation temperature

I can recommend the: “Xiaomi Mi Aqara Smart Air Pressure Temperature Humidity Environment Sensor” sensor. if you want tips on inexpensive zigbee sensors.

When I put them next to each other, the result in temperature is les then 0.02 degree.
Air Pressure is also very similar.
Humidity can have a bigger diffrence, but in my application, to get a notification when the londery is dry. it’s good enough:-)

So the novelty of the algorithm would be on the inclusion of the presence and activity of people (sleeping). Accurately detecting people and their activity has been almost impossible until very recently, and even now remains very challenging IMO. However their inclusion in a thermostat algorithm would be straightforward. Probably this post is more suitable for the thermostat thread as there is not really any data science element.
Cheers

I think the first challenge is defently a Data science task.

home assistant climate need a “thermal capacity sensor” that can predict the amount of time it will take to raise or lower the temperature in a given room with a given heat/cooling source.

The climate controll service shuld also have a derivative sensor, that will detect if the temperatur is climbing or falling.

Derivative sensor was just added https://www.home-assistant.io/integrations/derivative/

Was thinking about how to tag false positives/negatives/other data categories. Not sure if I understand your approach above, but perhaps it would be possible to use template sensors? You would put the main sensor value/state of interest into the value field and the maybe use a input_select or similar to populate attributes as the tag? It could reset to default after x minutes or when you tell it to. Then all the data is in one entity. I don’t know how this appears in the database (I haven’t tinkered yet), but hopefully it would be on one row, and so well shaped for analysis? Does that make sense? I’m using a kind of similar approach to gather images for training OpenCV. When I see hot air balloons outside I say “hey google, I see balloons”, and home assistant starts taking image snapshots from a camera at intervals for 10minutes and writes them to my “positive” directory (@robmarkcole, you might be interested in this image collection technique). The light on my xiaomi hub goes on/off to remind me to check after 10minutes if the balloons are still there.

Hi @Mahko_Mahko
I am intrigued by your use case - are you monitoring weather baloons or something? How does OpenCV fit in?
RE ground truth I was previously using a Hue remote to log (manually) when I was going to bed and getting up.
Cheers

I don’t think the technical part of labelling is the hard part. What you suggest for instance would work fine.
After the original post I got myself an ikea button and have created input_booleans similar to what you suggest. this works, but for the bayesian sensors I have it really won’t help much, so I’m still tihnking about my original plan.

During this corona-mess I have started on the project to just get the basics in order. I now have a basic python program running that communicates to HASS via REST and can request the history of all sensors. So the basics are there but I still have to start on the hard part.

I still don’t know how to effectively get the configuration and potentially provide back a new configuration with adjusted probabilities. Obviously I could just parse the existing configuration, but it seems a bit silly to not somehow use home assistant for that.

I’m also weak in the presentation side. I have a rudimentary streamlit webinterface for now, but as an end product I would really prefer a home assistant panel.

Hi @robmarkcole, it’s really a variation of the approach you helped me with. I used the OpenCV component to detect if my blinds were up or down. For the ‘balloons’, I get hot air balloons rising on my skyline in Melbourne and I just want to know when they are out there so I can just have a look (they are quite nice). I thought I would train OpenCV with images from the actual context/background rather than off the web. Ideally I could automate iterations of model training based on feedback about false positives/negatives (maybe you have done this in another image processing platform? That is probably a discussion for other threads I guess. I’m also working on timelapse automation, so I might detect their presence, create a timelapse, and then play it back to me on one of my screens when I rise (they are typically at sunrise). You could probably mash together something similar for bird watching;)