Local realtime person detection for RTSP cameras

Also, does anyone know if take_frame parameter still woks? I’ve been trying it under ffmpeg: at same level as fps but getting an invalid field exception. I guess would be useful for cams like wyze that can’t throttle fps.

Thanks.

I’ve been looking at the same thing, and it kinda makes sense why it may not be there anymore. Though I did have some questions about how the fps parameter works underneath.

So the problem with setting an fps parameter or a take_frame parameter comes down to how the h264 stream is setup. Now bear in mind, I am by no means an expert here, but you will see things like the iframe interval in your h264 setup on some cameras (some may not let you configure this), this comes into play because h264 uses an iframe and then kind of the differences between each frame to construct the frames between iframes.

Using an example, if you set your iframe interval to 20, and your fps is 20, then every 20 frames will be a new iframe. This is great for saving maximum bandwidth, but the problem comes that if you want to decode frame 15 of that stream, you need to decode the first frame (iframe) and then the next 14 frames in sequence to build that 15th frame.

So the problem comes that if you from the client side say “I only want 5 fps” but the frame rate of the camera is 20, and the iframe interval is 20, then you have a problem that to generate 5fps, you have to basically decode the entire thing anyway, because to get 5 frames per second as an output, you have had to decode 20 frames because of the iframe interval.

Funnily enough, I have been experimenting with this recently as a way of using my high-resolution stream as the source for the object detection, as in a few cases I have missed hits because the object was quite far away, think wide angle CCTV cameras. With the low res stream, it doesn’t have enough resolution to detect objects that are far away, but using the high resolution stream it does. The problem is I don’t really want to decode the entire high res stream for this at full frame rate as it absolutely slaughters the gpu decoder (intel quicksync) trying to decode 12 cameras at full 2K resolution, compared to using the low res.

I was actually about to write a full post about asking for more information about how the fps flag is handled internally so this is probably a good time. My thought process was, can I use an iframe interval of say 4, with a framerate from the camera of 20, to then use the fps parameter of 5 in frigate to effectively give me my 5 fps rate for decoding, but without the gpu/cpu hit. Because then ffmpeg/frigate only needs to decode the first full frame every 4 frames to give me a motion detection stream based on 5fps, but then trigger the full stream to be recorded at the full 20fps in BlueIris. e.g. I am using Node-Red to look at the object detection MQTT stream from frigate, to then tell BlueIris when to record.

1 Like

The fps parameter is very simple. It just passes a parameter to ffmpeg to specify the output frame rate. It replaced take_frame because it is more flexible and drops frames further upstream.

Your understanding of h264 is correct as far as I know. You can tell ffmpeg to skip decoding for everything except iframes with the skip_frame parameter. This should dramatically reduce the resources required to decode because it is just ignoring the differential frames in between. My Dahua cameras do not let me set an iframe rate lower than the fps, so that would max out my detection fps to 1fps. Also keep in mind that more iframes mean less efficient compression, so your mp4 files will be larger. I wish there were cameras that supported raw uncompressed image data.

1 Like

Cheers for the detailed response, will have a bit more of a play tomorrow when I have some more time. But you are absolutely correct about the image quality, lowering the iframe interval vastly decreased the quality of the image. For example moving from an iframe interval of 20 with a framerate of 20, to an iframe interval of 4 with framerate 20, needed 4-5x the bandwidth to get the same image quality. So it is definitely a trade off, but worth investigating to find a balance in being able to process the motion, and store. As ultimately I can post-process the stored data down to a smaller file-size when in more idle times.

Another area I had been playing with was H264 SVC which from my understanding is meant to do away with the need for multiple streams for lower resolution. In that you can have a single stream but have kind of layers that increase resolution or quality. Though getting clear documentation on what exactly this means from the CCTV vendors is proving to be a bit of a pain, that and making sure it’s actually supported with the version of Quicksync/hardware decoding as there would be no point if I lose hardware acceleration. From the details I have seen, the vendor support is very much a mixed bag, some implementing it only for framerate, but interesting to play with.

Complexity generally increases the amount of processing and CPU usage. My gut says H264 SVC is a dead end.

Since this is a MONSTER Thread, and I really would like some actual help I will open a separate thread and would like to ask if I can please get some help on-topic…:

Possibly, from what I have found so far it’s not so much a CPU overhead, but more a bandwidth overhead. So it’s a larger stream from start, rather than say having two separate streams. Granted it still relies on both the camera itself supporting it (dubious at best) and then a decoder properly understanding it, which is the next main issue, how to effectively tell the ffmpeg/decoder process what to pull from the stream. But is something that in theory my cameras support, and ffmpeg also in theory has some level of support for. But whether it bears anything useful only time will tell, at the very least I can manipulate the iframe intervals to get things working well in the current form, then play with SVC and see if it yields any benefits.

HI @claymen ,

I’m trying to follow your explain, your suggestion is to reduce the iframe at the minimum the camera allows you to do and keep a higher frame rate?

that’s correct?

Quick question for those who have the Coral USB accelerator. What do you typically see for inference speeds? Mine is currently between 80-150 as shown on the debug page. I was expecting it to be much faster than that because regular CPU (NUC 8th gen i5) seems to have consistent speeds between 50-80. I get the following lines on startup so I am sure that the coral device is being utilized.

detector.coral INFO : Starting detection process: 37
frigate.edgetpu INFO : Attempting to load TPU as usb
frigate.edgetpu INFO : TPU found

Are you running in a VM?

Yes. I’m running Ubuntu 20.10 as my host OS and HASSOS in a KVM setup through virt-manager.

Does a VM matter? Maybe I missed something in the documentation. Protection mode is set to off.

Edit: Just wanted to add how much I appreciate all the work you put into this project, blakeblackshear. This is easily my most used integration in Home Assistant.

There are industrial machine vision cameras that will give you a raw stream, e.g. FLIR GigE or Basler GigE. These are more expensive than a security camera and you need to provide your own lens and housing.

It’s almost certainly because you are running in a VM. I would recommend trying to run directly on the host with docker and comparing speeds. The documentation definitely advises against running in a VM.

1 Like

This is the export:

[{"id":"90672d10.e9562","type":"mqtt in","z":"cc80fdd9.36949","name":"","topic":"frigate/porch/person/snapshot","qos":"2","datatype":"auto","broker":"da7b9fe8.ef7f","x":260,"y":180,"wires":[["9bf4f457.8e5e08"]]},{"id":"2006a8fa.a2a618","type":"file","z":"cc80fdd9.36949","name":"","filename":"","appendNewline":false,"createDir":true,"overwriteFile":"false","encoding":"none","x":1000,"y":180,"wires":[[]]},{"id":"9bf4f457.8e5e08","type":"change","z":"cc80fdd9.36949","name":"","rules":[{"t":"set","p":"filename","pt":"msg","to":"","tot":"date"}],"action":"","property":"","from":"","to":"","reg":false,"x":600,"y":180,"wires":[["94353c5f.fa451"]]},{"id":"94353c5f.fa451","type":"function","z":"cc80fdd9.36949","name":"","func":"msg.filename = \"/config/frigatesnaps/\" + msg.filename + \".jpg\";\nreturn msg;","outputs":1,"noerr":0,"initialize":"","finalize":"","x":830,"y":180,"wires":[["2006a8fa.a2a618"]]},{"id":"da7b9fe8.ef7f","type":"mqtt-broker","name":"mqtt","broker":"192.168.1.180","port":"1883","clientid":"","usetls":false,"compatmode":false,"keepalive":"60","cleansession":true,"birthTopic":"","birthQos":"0","birthPayload":"","closeTopic":"","closeQos":"0","closePayload":"","willTopic":"","willQos":"0","willPayload":""}]

It can’t hurt to try this I guess, but there has to be a more elegant solution (this is just a quick test method, being new to Node Red myself).

I added the set msg.filename etc to give each image a unique ID, rather than being overwritten each time. The frigatesnaps folder from the test now has hundreds of person images from my porch camera.

Hi, just installed Frigate, looks awsome!

But I would like to get some insights on how these values work?

    objects:
      track:    
       - person
      filters:
       person:
        min_area: 5000
        max_area: 100000
        min_score: 0.5
        threshold: 0.6

Like, I have a camera up under the roof pointing towards the gate.

          roles:
            - detect

… are set with the substream of the cam with

    width: 640
    height: 360

When testing, I got the confirmation/Best Image when the person are almost at the building, not when he was further away. How does the min_area/max_area work combined with the resolution of substream?

The images show a header with person: 82% 12054 , what are the value 12054?

Image attached show an example. It never detects and scores a person when the person are outside the green frame I added manually, like coming in from the gate. It finally scores when the person are from the side just escaping out of the image on lower left or lower right…

So, basically, how can I make Frigate detect the person a bit further away?

Sort of, depends on what your goal is. The issue I have is that to try and decode 12 x (2688x1520) resolution streams, that’s a lot of processing overhead, especially as it then has to run motion detection on what is effectively 240fps of higher resolution video data. I prefer having a higher frame rate for recording with CCTV, as it has yielded better footage when there have been incidents in the street, was able to pull plates off a moving car with more frames to pick from than had it been a lower rate.

In the past this has been fine as the actual motion detection has run off the secondary stream, which is a much lower resolution 640x360. But with the AI detection I have found that it doesn’t do as well picking up people who are further away due to far less pixels to work from. But feeding it all 12 streams at full resolution absolutely hammers the processor even with quicksync, as there is still motion detection that has to occur.

The problem is you can’t just simply run the 20fps stream at 5fps, if your iframe interval is quite high, which is often the case to reduce bandwidth or improve image quality at a given bandwidth. In my example of a 20fps stream with an interval of 20 for the iframes, that means there are 19 frames that all rely on that iframe, so you can’t just say decode 5fps, because it needs that iframe + 19 frames after to be processed one after the other. This is how h264 and other codecs work to reduce the video data size, by in a way storing the incremental changes between each frame from the iframe.

So the theory is, you can reduce your iframe interval, in my case going from an interval of 20 with 20fps, to an interval of 4 with 20fps video, and then telling ffmpeg to only decode iframes which would result in an effective 5fps required to process it. This means that the motion/AI detector has far less work to deal with, 60fps total. Which when it does detect a person in those frames, can then use MQTT with Node-Red to trigger BlueIris to start recording at the full frame rate (direct to disc).

In a perfect world, the camera would let me run two high res streams but at different framerates, one for motion detection at a lower frame rate, and one at the higher/full rate. But on the cameras I have, that isn’t an option, though may be possible with SVC, but that has its own set of complications.

Bear in mind, this is a very simplified view of it, this article has more info and some images which help to explain it.
https://www.securityinfowatch.com/video-surveillance/article/21124160/real-words-or-buzzwords-h264-and-iframes-pframes-and-bframes-part-2

1 Like

You may be having the same issues I am, and my understanding is that there just aren’t enough pixels back that far for it to reliably determine it’s a person. I found swapping to the primary high resolution stream it would instantly pickup people at a similar sort of distance, hence why my line of thinking has been resolution to look for people in it.

Have you tested running it against the high resolution stream?

After reading your comment again and looking at the documentation, maybe I need to clarify. I installed Frigate as an addon so inside my HA VM, it is setup as a docker container. Is it ok for Home Assistant to be installed as a VM with Frigate as a docker container in HA without impacting Coral inference speed?

It will work, but you are running a docker container inside a VM that way. I suspect that is impacting the performance of the Coral. I would try running the frigate docker container on the host directly and comparing.

Ah, I see now. I’ll give that a shot and compare performance then.