Combined FFmpeg, openCV, dlib and SciKit into one face recognition component using CUDA

So… I hacked the ffmpeg camera and dlib components (I could probably create a custom component but that is for later) to drastically update and improve performance of facial recognition.

As I shared here:

The code is on GitHub and I am running it currently on Hass 0.111.4.

It required compiling from source both FFmpeg and then openCV to support the latest CUDA but I found it to be optional as the dnn model for openCV downsizes the pictures to 300x300 and is fast enough to run on a CPU.
I am currently processing a constant stream at 10fps and my CPU and GPU have increased utilization only by 6% and 2% respectively.

The changes are

  1. To make the FFmpeg component not just grab a snapshot every 10s but maintain a constant stream in the background from which you can get frames instead of starting a new stream every time.
  2. Only pick up and decode the frame needed for processing
  3. Use the openCV dnn (Deep neural network) model for face detection instead of dlib’s HOG or cnn.
  4. Use the dlib encoding to encode the previously detected faces
  5. Use an SVM (Machine learning) classifier to determine and recognize the encodings from dlib instead of the euclidian distance from the original component.

Would it be possible to get a how to to setup all of this ?

If there is enough interest, I would actually create a custom component to avoid having to go through a long setup tutorial…
Indeed it is not just about copying the code since there are a couple of things to install as well.

Looks excellent and I may get time to try and implement it when I get home from holiday. Thank you for sharing.

Installation procedure:

Go into your configuration.yaml and follow this to setup your camera
Note that the input can only be an url (rtsp or http)

and follow this page to setup the dlib component

setting it up the source to be your ffmpeg camera.

you should be able to see your dlib component on lovelace

ffmpeg if you want GPU video decoding:

mkdir -p source/
cd source
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
cd nv-codec-headers && sudo make install && cd –
git clone https://github.com/FFmpeg/FFmpeg.git
cd FFmpeg
./configure --enable-cuda-nvcc --enable-cuvid --enable-nvenc --enable-nonfree --enable-libnpp --enable-pic --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 --extra-ldexeflags=-pie
make
sudo make install
sudo mv ffmpeg /usr/bin
cd ..

dlib If you want to play with CUDA, compile with CUDA: http://dlib.net/compile.html

git clone https://github.com/davisking/dlib.git
cd dlib
git submodule init
git submodule update
mkdir build
cd build
cmake -D DLIB_USE_CUDA=1 -D USE_AVX_INSTRUCTIONS=1 ../
cd ../
python setup.py install

Now the mods:

activate your venv and run this:

pip install git+https://github.com/rafale77/face_recognition_models
pip install git+https://github.com/rafale77/face_recognition
pip install opencv-python-headless
pip install opencv_contrib_python_headless
pip install sklearn
cd ..

if want to use GPU, Build opencv with CUDA:

git clone https://github.com/opencv/opencv.git

cd opencv

mkdir build

cd build
cmake -D CMAKE_BUILD_TYPE=RELEASE \
-D CMAKE_INSTALL_PREFIX=/usr/local \
-D INSTALL_PYTHON_EXAMPLES=OFF \
-D INSTALL_C_EXAMPLES=OFF \
-D OPENCV_ENABLE_NONFREE=ON \
-D WITH_CUDA=ON \
-D WITH_CUDNN=ON \
-D WITH_CAFFE=ON \
-D WITH_NVCUVID=ON \
-D OPENCV_DNN_CUDA=ON \
-D ENABLE_FAST_MATH=ON \
-D CUDA_FAST_MATH=ON \
-D CUDA_ARCH_BIN=7.5 \
-D WITH_CUBLAS=ON \
-D OPENCV_EXTRA_MODULES_PATH=~/source/opencv_contrib/modules \
-D HAVE_opencv_python3=ON \
-D PYTHON_EXECUTABLE=/usr/bin/python3 \
-D BUILD_NEW_PYTHON_SUPPORT=ON \
-D PYTHON2_EXECUTABLE=/usr/bin/python \
-D CMAKE_CUDA_FLAGS=-lineinfo --use_fast_math -rdc=true -lcudadevrt \
-D BUILD_EXAMPLES=OFF ..
make
sudo make install

Change the “CUDA_ARCH_BIN” to be the one corresponding to your GPU. 7.5 is for the turing GPUs

ok now everything is set

find the ffmpeg component within your venv:

homeassistant/lib/python3.7/site-packages/ffmpeg/

and replace the camera.py file with this one:

comment out line 19 if you do not use a GPU

find the dlib face identify component

homeassistant/lib/python3.7/site-packages/dlib_face_identify/

and replace the image_processing.py with this one

note that line 84 is where you input the folder where your pictures are going to be, you can change this depending on your setup below:

You are done with installation! Now the setup.

Create a folder where you will put pictures used to train the model. In my example it is in my .homeassistant configuration folder

~/.homeassistant/recogface/faces/

Create one folder for each person you want recognized and one “unknown” folder and copy in all the pictures you want. I recommend at least 20 pics in each

restart home assistant

1 Like

Installation procedure without GPU:

Go into your configuration.yaml and follow this to setup your camera

Note that the input can only be an url (rtsp or http)

and follow this page to setup the dlib component

setting it up the source to be your ffmpeg camera.
you should be able to see your dlib component on lovelace

Now the mods:
activate your venv and run this:

pip install git+https://github.com/rafale77/face_recognition_models
pip install git+https://github.com/rafale77/face_recognition
pip install opencv-python-headless
pip install opencv_contrib_python_headless
pip install sklearn

ok now everything is set

find the ffmpeg component within your venv:
homeassistant/lib/python3.7/site-packages/ffmpeg/
and replace the camera.py file with this one:

comment out line 19

find the dlib face identify component
homeassistant/lib/python3.7/site-packages/dlib_face_identify/
and replace the image_processing.py with this one

note that line 84 is where you input the folder where your pictures are going to be, you can change this depending on your setup below:

You are done with installation! Now the setup.
Create a folder where you will put pictures used to train the model. In my example it is in my .homeassistant configuration folder
~/.homeassistant/recogface/faces/
Create one folder for each person you want recognized and one “unknown” folder and copy in all the pictures you want. I recommend at least 20 pics in each

restart home assistant

1 Like

I made some further improvements to the code so now it saves the classification training so it only has to do it once and doesn’t do it every time you restart home assistant. To retrain, in case you changed your picture set, just remove the pretrained model classifications by deleting the model.joblib file in the homeassistant folder.

Would like to try it. Could you please create a custom compoment?

I will have to learn how to and it is a good opportunity for me to look into it.
I have been updating my repo actually and evolved the face detection to test a variety of models.
I discovered some oddities with training my SVM using openCV’s DNN face recognition component on my face dataset. I am not sure if it is my dataset but it seems to find too many faces in many of my pictures while never seemingly have any problems in real life. It is a resnet10 backbone run as a caffe model.
I trained my SVM using dlib’s CNN model and have now tested training with retinaface… yielding the same result so I switched to retinaface on resnet50 backbone for my production system. It uses a lot more GPU and system RAM with the same input resolution while using about the same CPU and GPU load.
I will next test arcface to replace dlib and the scikit classifier to see how this works and good or bad, will create a custom component with the best models for detection/alignment/recognition.

This far for face detection:

  • dlib HOG: fast on CPU but not very sensitive.

  • dlib CNN MMOD: requires a GPU, way too slow on CPU and much better than HOG but like the HOG requires a relatively large face to see it.

  • openCV DNN resnet10: Much better than the two dlib detectors, can run reasonably on CPU and is very fast on GPU. Detects much smaller faces and faces at an angle. It’s face box is also larger than dlib CNN which enables using dlib’s large encoding model(68 pts vs 5 pts).

  • retinaface resnet50: Slower than the openCV DNN model on GPU but it could be because I am running it in fp32 vs. fp16 for opencv. not sure it could run on CPU. Much more accurate at the same input resolution and a little less prone to false positive.

Retinaface can also be run on a mobilenet backbone which is much faster, likely faster than openCV’s DNN but loses a lot in accuracy. Other to consider: Centerface which is both faster and more accurate than retinaface on mobilenet and less accurate than the retinaface resnet model. I will first test retinaface in half precision and see if I want to test centerface.

The combination of retinaface and arcface appears to be the best of the best in accuracy from my research so I will likely move to that if the resource load is not excessive.

A quick update on this… I have successfully implemented retinaface+arcface using pytorch as a framework and integrated it into Home Assistant.

Basically my new dlib component no longer makes use of dlib or SciKit at all.
It uses openCV to decode the image within the video stream instead of Pillow/ffmpeg, a resnet50 retinaface model on pytorch for detection and alignment instead of all the options I tried (dlib MMOD/openCV DNN) and uses a resnet50 arcface model on pytorch for face encoding/recognition with a cosine distance instead of dlib euclidian distance (or my later SVM classifier) for determination.

Obviously the resnet50 networks are much more demanding in resources than the previous resnet10/resnet34 for opencv/dlib but I found this to be about significantly more sensitive and accurate than the other models. In terms of speed, it is only second to the opencv DNN detector + dlib encoder + SVM classifier at the lowest setting. The difference between these two are extra 1GB of system RAM, 2GB of VRAM, 20% of CPU and 10% of GPU.

The changes I made to home assistant a little deeper than this component, in particular in how the video stream is managed so it is a bit hard to make it a custom component. I would recommend to just try my fork of home assistant. It also requires downloading the models from here and here and putting them in the home assistant config directory inside a “/model” folder.

Hi,
I’m interested in trying out your work on RetinaFace/ArcFace. Do you have more detailed instructions? For the “downloading the models from here and here”, I am less sure what actually needs to be downloaded.

Thanks!

Hi,

Yes I realize that I didn’t leave very detailed instructions. My fork of HomeAssistant has actually evolved quite a bit and I have upgraded pytorch as well as a number of models.
I am still using a combination of retinaface and arcface models but have since been running them with a the pytorch just in time compiler (JIT) to make them even faster.

The missing trick is to download the pre-trained model files saved in JITted form and store them in a model/ folder in your .homeassistant/ configuration folder.

As a prerequisite you would have to install or compile opencv to use my modified/improved ffmpeg component and then configure your various camera streams.

The arcface pretrained model can be downloaded from here. I will find some time next week to share the retinaface model as well and complete the instructions.

Hey Rafale, hows life :sweat_smile:. Any chance to make this “click-and-play”?

Can’t complain at the moment… I am out toasting on an island in the middle of the pacific ocean which is why it will be a week before I can get the instructions completed and uploaded. “Click and play” will be difficult on this one given the amount of camera stream protocols one has to setup and the couple of per camera sensitivity parameter you would have to adjust but I will do my best to document it properly. I will have to update my fork to 2021.4.5.

I have updated the readme with much more detailed instructions. Let me know if you have any questions.

Hi, Thanks for the more detailed information :slight_smile:
I’m a bit of a beginner at this, so its quite possible, I’m over my head with this.
I’m trying to get this to run in a “cpu” setup as I don’t have a GPU.
I’ve got an existing Docker setup for pytorch-serve that I’m using to play around with your code. I’m not using the HA stuff, as with a few mods, I’m able to run image_processing.py directly from python. I’ve also made mods for the “device” to be “cpu” instead of cuda.

The problem I’ve run into is face training. I’ve got, oh around 25 picture/images of my face that I tried to train. All of them comeback with the error can't be used for training. I dug into this a little. The good news is that all the picture/images used for training were successfully determined for landmarks and cropping the image 112x112 showing my face. The failure comes in image_processing.py where it checks the length of the embeddings produced by arcface:

                    emb = self.face_detector.detect_align(pic, img, priors)[0]
                    if len(emb) == 1:                       
                        embs.append(emb)
                    else:
                        _LOGGER.error(person_img + " can't be used for training")

When I print out the length that was actually computed, all my picture/images used in training have an ebedding length of 512.

Just wanted to run this by you to see if you thought the embeddings length was indeed suppose to be 1?
Thanks!

Thanks for the feedback. I should indeed think about enabling a CPU only option if only for the stream decoding. Will look into it for the next version.

If you look at the code you quoted, the embedding variable (emb) should only return the first element of the table returned by the detect function so yes its length is supposed to be 1. If it has more than 1, it would mean that it the detector is detecting more than one face. I don’t believe that the detector is finding 512 faces every time so there must something wrong elsewhere. There is likely a dimensional error in that function call. Have you made modifications elsewhere?

No particular modifications to speak of other than those mentioned above.
I’ll dig around some more, but will be a few days before I can get back to it.

Yeah, that would be great if you could get a CPU only version running!!

I wanted to report out some of my findings regarding training and the length of 512 issue, to see if you had any ideas.

I’m using a single image for now (with a single face) to train just to get some ideas of the dimensioning.

  1. After cv2.imreading the image: Python Length: 1280 Numpy Shape: (1280, 960, 3)

  2. After deriving Landmarks and executing warpAffine, face_img: Python Length: 112, Numpy Shape: (112, 112, 3). I can actually write this image to disk and it indeed found a single face.

  3. After running face_preprocessing, faces_prep: Python length: 1 (a single tensor), Tensor.Size: torch.Size([1, 3, 112, 112])

  4. Arc model Output arc_out: Python Length: 1 (a single tensor), Tensor Size: torch.Size([1, 512])

  5. After L2 Normalization l2_norm(arc_out): Python Length: 1 (a single tensor), Tensor Size: torch.Size([1, 512]). This tensor is what gets returned (see next)

  6. emb = self.face_detector.detect_align(pic, img, priors)[0]. Here, emb has Python Length: 512, Tensor Size: torch.Size([512]). This is where of course it fails.

Continuing from above, this time I repeated the test with a single image, that this time contained two faces:

  1. After cv2.imreading the image: Python Length: 960 Shape: (960, 1280, 3)

  2. After deriving Landmarks and executing warpAffine, face_img:Python Length: 112, Numpy Shape: (112, 112, 3). However, this repeated as it found a second face, face_img: Python Length: 112, Numpy Shape: (112, 112, 3)

  3. After running face_preprocessing, faces_prep: Python length: 2 Tensor Size: torch.Size([2, 3, 112, 112])

  4. Arc model Output arc_out: Python Length: 2 Tensor Size: torch.Size([2, 512]).

  5. After L2 Normalization l2_norm(arc_out): Python Length: 2 Tensor Size: torch.Size([2, 512]). This tensor is what gets returned (see next)

  6. emb = self.face_detector.detect_align(pic, img, priors)[0]. Here, emb has Python Length: 512 Tensor Size: torch.Size([512])

Seems like the dimensioning information is there to differentiate detection of a single versus multiple faces.