Best architecture for processing camera feeds on server

riccardopinosio · October 10, 2020, 2:43pm

Hello, I am just starting to tinker with home assistant and I have an architectural question that maybe you people can help me with.

The setup

I have a bunch of pi-zero’s with cameras/IP cameras that expose mjpeg/rtsp streams. I have home assistant core running in a venv on a RPI4. I am able to hook up the cameras using the mjpeg/general camera components and display the streams in the UI as cards. It’s nice and works well.

what I want to do

I want to run (a suite of) ML algos on the different video streams for object/face detection, in such a way that:

The processing of the images happen on a different host than the one which has home assistant on it. This is to allow the possibility of the processing being carried out in a different, more performant, machine
The output from the processing application is: (1) the processed streams with e.g. detection boxes, to be displayed in e.g. the home assistant UI (2) structured information, in the form of e.g. json, that can be used to fire events/create entities in home assistant.

My first goal would be to use this to do presence detection using image recognition. Example application: I go to the study in the evening and HA detects that I am in there (and not my wife) and starts playing my relaxing spotify playlist. Or it shows me the news from the newspaper I read (which is different from my wife’s) on a wall screen. Or, or… you get the idea. For this, any system based on devices (bluetooth, etc) won’t work because I don’t carry beacons/phones etc with me inside the house.

some rambling thoughts on this
I was thinking of creating a custom component that would function as a proxy. It would be configured from HA with the entities of the cameras whose feed is to be processed and the IP address of the image processing server running the image processing app (which I would write). Then the component would transmit this configuration data (e.g. the ips/types of the original streams) to the server, which would initiate the processing of the streams and expose the processed streams on new addresses. These would be returned to the component, which would create new camera objects for inclusion in the ui. Something similar (in my mind, maybe I’m wrong) to the current proxy camera platform, with the difference that the processing of the images is executed on a different host (maybe even in the cloud).

Then there is the business of using the extracted information for automation. Assuming that the algorithm detects the face of person A in the study camera stream at time T, I would like the HA component to update the corresponding person entity to reflect this information, so that automations can be configured based on this. Example: the camera stream detects my face in the study; this info is pushed to the HA component; this updates my person entity status as ‘in the study’, and automations are triggered based on this. On this point I am still vague, because the person entity seems to be very much modelled around device trackers for home/not at home, while I would like to represent a more granular state, e.g. home/sitting_room in order to trigger the aforementioned automations.

So, any suggestions/advice/pointers on this from you more experience people would be helpful.

Regards.