I’m Anja, a 25 year old Computer Science Student at Hochschule München who is currently working on her Master Thesis “Machine Learning with Home Assistant”.
In detail the thesis will be about developing and training an intelligent model which can - based on the analysis of data - predict automations and different features to approve the daily life of household members.
To train and test the model, I need real life data from home automation systems. And that’s where you come in - to train my model I need your home assistant data saved in your database. This will mainly be sensor data - room temperature, light cycles, etc.
The data you send me will only be used locally to train my model - it will not be shared nor used for any other purposes and will be deleted after the master thesis is done.
I really appreciate your help and contribution!
If you would like to participate but do not feel well with uploading your data via google forms please contact me via E-Mail [email protected]
If you’re further interested in the thesis I will happily keep you updated here where you can also share your ideas and suggestions if you got some If you got any questions about the Project please feel free to ask!
I’m interested in this however I’m wondering how you would use data from multiple installations to train a model. Each home is unique, it has its own set of rooms, devices and entities. The behaviour is also different in each home. Some people work from home, some people go to work, people ultimately control their homes differently.
How will you handle these potentially exponential number of configurations and set ups?
My thought behind collecting data from different installations was exactly the point of diversity you mentioned. It gives me the possibility to train the Model in different aspects and features such as different sensors and to test which patterns it can recognize and which do not make sense. The ones that can be found e.g. in most of the setups seem to be an interesting point to solve problems for since it affect most of the community.
Furthermore another approach will be to make predictions about househould members (number, family or single, with/w/o children) which is only possible if I have different sets of data.
Hello @Anschke . Unfortunately I have just recently deleted most of my long-term data due to a change of PC :-/
Which time period would be helpful for you to train a model?
If you are interested, I can also set up an access to my Postgres DB for real time data during the period of your thesis for you
Just write me a PM if you are interested. Gladly also in German.
I did not test it with a MariaDB but according to the Home Assistant Data Detective Documentation it does - you just may have to alter the way you adress the DB via URL. I linked the part of the documentation in the Jupyter Notebook. If you need further help please feel free to message me!
Thank you!
I’d also appreciate a negative Maybe you could alter the file name of the .csv so that I could recognize the negative with _neg or something similar.
Thank you!
Totally forgot that I have filtered most of my entities in the recorder. I just enabled it for all entities and will let it run for 10 days to share it with you then.
I am so jealous! I was going to go to a data science mini class (about six months of intensive data science classes) and was going to do this VERY same data analysis. I wasn’t going to do it for all the people, but my own darn self. I think this is a better idea though because you will have a much better data set to pull from. You should look into open sourcing the data once you have stripped it of anything that can be used to identify people. This could be a small treasuretrove of information that could be used to help create some form of machine learned automation creation within HA! My end goal was to work with someone to help create something for HA, but life got in the way and I couldn’t attend the class. Plus it was really more than I could afford, so I had to wait on it. I just got my new setup going, will wait a week then revisit this and post my data for you to mull over (it won’t be much, I am starting over from pretty much scratch).
I second this, perhaps people here can say if their data can be anonimized and reused. I think big companies have access to a huge amount of data due to the way they collect everything, but there are not so many options left for the little guys.
IMHO It’s not always about having as much data as possible, but also about having the right data. You don’t need just Big Data but Thick Data. The quality of the data for answering the respective question is at least as important as the quantity.
At my employer (who is also a “big player” in its segment in Europe) I experience it again and again that people live with the idea that we just collect everything somehow and then see if we recognize meaningful patterns. In reality, however, this rarely or never works. Mostly you have to come from the direction of the hypothesis and generate the specific data to falsify it.
Hmm. Interesting project. But as far as I see it, big chunks of raw data are pretty much useless without additional contextual meta data. How are you supposed to know what the data from my sensor.temp_54fac6 represents ? Is it my bedroom temperature, my living room temperature or the temperature in my garden shed ? It’s even worse for motion and door sensors, because they’re binary. Did I open up the door to the bathroom or the door to the basement ? Add to that the vastly different layouts of peoples homes. If you want to train a model to correlate human behavior with sensor data, then you need meaningful and correlated data to begin with, you need to have access to that kind of meta data. Otherwise you’ll just train your model with noise.
Just curious on how you’re planning to manage this
I can barely train a model on my data, let alone on other’s data. But I don’t want to get in the way of Anja, I think I will get the statistical data from other places, there is plenty IoT data freely available online, some examples being: https://thingspeak.com/channels/public https://dweet.io/see
But I agree there is a need to have quality data that can easily be used to cross-correlate patterns.
When you extract the database it includes the friendly names of the entities. From the data I got so far I could see that almost all Users renamed their Sensors with meaningful names e.g. sensor.temp_Bedroom, sensor.door_fridge or smth similiar. There might be some exceptions where ppl prefere the “complicated” numerical names - I’ll see how it works out
But youre right, Data preparation before training will probably consume some huge amount of time.
I’m still working on the Project! I hope I will have some time around Christmas to give a proper detailed update on my work.
If everything goes according to plan I will finish the project around February 21. I may publish the repo after the thesis is completed and passed the whole University process