We can continue the discuss here. I find your results very intriguing. I have been working on compiling OpenCL within my OpenCV docker image, so I could make use of the Transparent API to speed it up. I am trying to see how I can maximise the Intel hardware chip-set, before I decide to get either the Coral Stick or a dedicated GPU. I am of the view OpenCV could even be more optimised, once Intel VA can be compiled with it.
So I can provide some updates to this as I fixed what I thought was an openCV bug with a cuda memory issue and turns out to be my own ignorance. I had to run the inference functions within an async wrapper to prevent memory conflicts and I now have full GPU supported stream management and inference with openCV. I have also switched to a tensorflow framework MobilenetV3 model which is much easier to code and… tada…
I am now getting much lower CPU loading, basically only the 3.5%/stream and 0% for inference.
On the GPU, the decoder is loaded 6%/stream and the inference adds an average of 3%.
Much better than watsor_gpu using ffmpeg. Which was 30% vs 3.5% on the cpu for decoding (still with GPU) and 20W/inference stream vs. 0W currently for my 4 streams.
code is here and you would have to download the mobilenetV3 model from the opencv github site:
This is pretty amazing stuffs, and to think abt the difference in efficiency compared to ffmpeg still surprises me. I will study your changes, and see how I can integrate into mine. Though I don’t have a dedicated GPU, so not able to achieve the level of minimal CPU usage.
So I got to my final setup with all streams running and I am now running the YOLO V4 model on openCV.
So comparing the 4 streams on which I had person detection setup on watsor to my current setup, knowing also that YOLO is more accurate that the inception V2 model I previously used:
Watsor/FFMPEG/inceptionV2/GPU:
Stream decoding: 15% CPU load per stream when downsized to 720P, 40% CPU load per stream when in full size. ~2% GPU decoder load/stream.
Object detection: Added ~10% CPU load/stream and 20W consumption/stream on GPU. The model uses 300x300 input.
HomeAssistant/OpenCV/YOLOV4/GPU:
Stream decoding: 3.5% CPU load per stream full size. ~2% GPU decoder load/stream.
Object detection: Added 8% CPU load per stream and 0.5W consumption/stream on GPU. The model uses 608x608 input.
My old watsor/FFmpeg setup was loading my CPU at 60%+40% = 100% (1 full thread) and added 80W to my GPU power draw.
My new setup with openCV loads my CPU with 14%+32% = 46% and <2W on GPU power draw and again is more accurate.
Well not so final afterall. I made some further updates today:
I added one more camera stream to now have a total of 6 streams managed by openCV and am running either facial and object recognition, in both case with my own modified components.
With my additional camera, homeassistant CPU utilization went to 76% and I started getting bothered:
Before I had the baseline of 8%+ 11% (facial recognition stream) + 46% I described in the previous post = 66%, so the extra stream just added about 10%, not very inconsistent but I looked at the CPU load average from top and noticed that it had skyrocketed to 3~3.5… I am only running only 2 CPU thread on my VM which means the CPU is overwhelmed… and is why my GPU only used very little additional power: It was waiting for the CPU to feed it.
Long story short, I looked at the home assistant code and found that the image processing component pulls its frames from the camera component in the jpeg format, as well as the camera streams in MJPEG. It all makes sense to display on the UI but it makes no sense for image processing. This is what the image frame goes through:
Camera stream in H264 or H265 format -> decoded by ffmpeg or opencv, gpu -> raw -> encoded by CPU to numpy array-> encoded by CPU to JPEG -> converted to Bytes by CPU -> decoded by CPU to numpy array for processing.
It’s pretty crazy. I therefore modified the camera component, the ffmpeg and image processing to add a “get raw image” functions. Modified my dlib and opencv image processing component to eliminate to the decoding steps. This is what it looks like now:
Camera stream H264 or H265 ->decoded by GPU to raw ->encoded by CPU to numpy array -> processing.
These mods dropped my home assistant cpu utilization from 76% to 37% and the CPU load average from 3~3.5 to 0.6~1… but now my GPU is being fed a lot more frames and it’s power consumption went up by 25W. Still less than 1/3rd of Watson and with an additional stream…
Summary:
I will probably post these code change suggestion as github pull requests.
These are amazing. Unfortunately I still don’t get this level of utilisation as I am still yet to figure out the GPU stuff on Intel in opencv.
But glad you got this working, so now I know I got work to do, to maximise my system. I do use raw frames already, so I need not stress over the jpeg compression.
Thanks for all these.