Smart Home Dataset for Machine Learning Project

Hi everyone,

I am doing a programming/machine learning project on smart home and I need a bunch of data about the devices statuses and context like temperature and so on. In the end, I want to be able to detect anomalies for better security of the smart home.

For this to work I need a huge dataset from different users and you could really help me out if you would like to upload your home assistant .db file for me here on dropbox, especially if you are already using home assistant for a while.
https://www.dropbox.com/request/ugRDcbZSm8R2pi6sgvGU

This .db file contains all the logs that are displayed in the history tab of home assistant, e.g. when a light is turned on.

Thank you very much in advance:)

You can download the .db file easily when you have the File Editor plugin installed:

And in return what will you offer?
Is there anything Confidential in the db?

Hi, since this is just a hobby project, I can’t pay or anything. But I guess, I can give back to the community by providing the combined dataset in the end? Maybe other people want to experiment with similar stuff that would be useful like perhaps trying to learn automations from user interaction automatically without requiring to program them?

The db file contains the event log of your devices, like when a light got turned on a.s.o, so normally I would not consider that as confidential, but I guess that depends on what kind of devices you have linked to home assistant.

You can exactly view/query what is inside the db file when you paste the file in here: https://inloop.github.io/sqlite-viewer/.
Mine looks something like this:

how can i send the file to you

Hi Cao Hoa,
you can put it here: https://www.dropbox.com/request/ugRDcbZSm8R2pi6sgvGU

My file cannot be downloaded
Do you have any other way?

Do you have Home Assistant installed on a Raspberry Pi or inside a VM (Virtualbox/VMWare) on your PC?

I installed on intel-nuc and installed on ubuntu server
And I have a backup


I have tried to send you

Hey, I have made some space, 2.25 GB is free now. How large is the file?

Has the file I sent you received yet?

If you use GPS tracking via GPSLogger or the HA apps, the database will contain your logged GPS coordinates. The DB could also contain things like API keys, depending on the integration and how it stores attributes. TTS phrases will be in the DB. Point is that the DB could contain significant personal information.

I am not sure about the API keys, because I have several configurations (like with SSH or Google Smart Home integration) and I checked and all the API keys/secrets get stored in the yaml file and not inside the DB. That location data might be inside the DB if you installed a GPS logger makes sense, I have not thought about that in advance, but that might be used by some people.

Anyway thank you Aaron for the helpful note.

Yes, thank you very much, Cao!

Do you need to add the db file data?
I have over 4gh new data

Have you finished the project yet?
can i help you?

Hi Cao,

So I was having a simple HA setup with a thermostat, smart lights, door and window sensors and a couple of other stuff. And with HA, everything was locked in the .db file inside the Raspberry Pi. So after 2 months passively collecting the data, I tested it out over the course of a day and wrote some Python script. What I was doing is simulating some unusual behavior, i.e. things that were not observed in the training data, and see if the model would pick up on it, e.g. simple One-class SVM. This included someone opening the door when I am not at home, flickering lights, lights turning on in the middle of the night (as this did not happen in the training data), making the temperature sensor of the thermostat measure higher than usual temperature, basically simulating a fire or something, but also more subtle things, like opening the window when it is cold outside, but having the heatings on at the same time. Obviously, you can hard-code everything, like have a rule that turns off the heating, when it recognizes that the windows are open, etc., but as I explained I wanted to see, if a machine learning model can recognize something like this automatically. So it worked partly, but mostly on things time related, my model was too stupid to recognize more subtle things like the heating scenario. Also FP rate was kind of high, like 2%. I hoped, if I had a larger dataset from diverse users, that the model would be smarter and generalize better. But on the other hand, it was hard to integrate, as your dataset and mine only have a small intersection of the same devices and usage patterns are probably very different as well. Anyway, after those first results, that showed that in some cases it worked quite nicely but overall it is not robust enough and can probably only work very good with lots and lots of data, I was carried away with other things.

1 Like

I have a new 4g data file, do you use it?
also i am using frigate to identify people from cars and dogs and cats and other things. you can write in event form
i am using this for doubletake face detection and recognition
When a trusted face is detected, the event or action will proceed to perform automation
Can you share the settings with me?
I’ll run it on my system and let you know the results

I have a new 4g data file, do you use it?
also i am using frigate to identify people from cars and dogs and cats and other things. you can write in event form
i am using this for doubletake face detection and recognition
When a trusted face is detected, the event or action will proceed to perform automation
Can you share the settings with me?
I’ll run it on my system and let you know the results