Local realtime person detection for RTSP cameras

tc23 · June 17, 2019, 8:57pm

Yep, and it works perfectly with my other Unifi cameras, across multiple zones.

Kyle · June 17, 2019, 9:00pm

Huh. I’m not really sure then. Maybe the dafanged implementation is buggy?

tc23 · June 17, 2019, 9:06pm

Yeah, I’ll have to check that GitHub and see if there are any discussions on the frame rate. Either way, hopefully your PR solves the problem

blakeblackshear · June 18, 2019, 1:31pm

I added a Ko-fi link for anyone who wants to contribute to the project because several of you asked for it. Thanks for your support.

@cjackson234 @scstraus @Kyle

Kyle · June 19, 2019, 4:58pm

I seem to be getting notifications for very small regions when my min_person_area is something absurd like 2500.

Is this conditional backwards? I’m not sure if I’m understanding it correctly but wouldn’t it want to be flipped to only exit the loop and continue if it’s not smaller?

github.com

blakeblackshear/frigate/blob/master/frigate/video.py#L255


    areas_of_interest.append((x_min, y_min, x_max, y_max))
unused_motion_boxes = set(range(0, len(motion_boxes))).difference(used_motion_boxes)

# compute motion regions
motion_regions = [calculate_region(frame_shape, motion_boxes[i][0], motion_boxes[i][1], motion_boxes[i][2], motion_boxes[i][3], 1.2)
    for i in unused_motion_boxes]

# compute tracked object regions
object_regions = [calculate_region(frame_shape, a[0], a[1], a[2], a[3], 1.2)
    for a in areas_of_interest]

# merge regions with high IOU
merged_regions = motion_regions+object_regions
while True:
    max_iou = 0.0
    max_indices = None
    region_indices = range(len(merged_regions))
    for a, b in itertools.combinations(region_indices, 2):
        iou = intersection_over_union(merged_regions[a], merged_regions[b])
        if iou > max_iou:
            max_iou = iou

Also, is there a chance that the area size can also be added to the label here to help with debugging person sizes?

github.com

blakeblackshear/frigate/blob/master/frigate/objects.py#L92


# compute centroids of new objects
for obj in group:
    centroid_x = int((obj['box'][0]+obj['box'][2]) / 2.0)
    centroid_y = int((obj['box'][1]+obj['box'][3]) / 2.0)
    obj['centroid'] = (centroid_x, centroid_y)

if len(current_objects) == 0:
    for index, obj in enumerate(group):
        self.register(index, obj)
    return

new_centroids = np.array([o['centroid'] for o in group])

# compute the distance between each pair of tracked
# centroids and new centroids, respectively -- our
# goal will be to match each new centroid to an existing
# object centroid
D = dist.cdist(current_centroids, new_centroids)

# in order to perform this matching we must (1) find the
# smallest value in each row and then (2) sort the row

blakeblackshear · June 19, 2019, 5:35pm

The conditional looks right. It should continue at that point to avoid report a person that is too small. Are you sure you have the minimum set on the correct region?

Adding the person area to the label should be simple enough. It will help me keep track of that request if you create a github issue for it.

Kyle · June 19, 2019, 8:47pm

Thanks! I’ve submitted 2 issues. One for the enhancement, and one for the issue I’m encountering. Sorry for adding so many lately.

It has my config listed, which I believe is correct? https://github.com/blakeblackshear/frigate/issues/43

tc23 · June 20, 2019, 2:58am

What is Unable to grab frame indicative of? I was able to solve my queue full problem. But now, after about 15-60 seconds, I get a blast of about 25 lines of Unable to grab frame? Is it a FPS problem or bitrate?

blakeblackshear · June 20, 2019, 3:34am

That means opencv returned an error code for a single frame. It could be anything, but as long as it clears up on its own, I would just ignore it. Probably just a few corrupt frames.

tc23 · June 20, 2019, 3:38am

Hmm well it usually results in it terminating the capture process and restarting on that rstp. Now I’m wondering if its just a poor quality feed coming from the wyze camera.

blakeblackshear · June 20, 2019, 12:55pm

That would be my guess. It’s doing exactly what it should when the RTSP feed is failing. After a certain number of failed frames, it establishes a completely new capture process.

Kyle · June 20, 2019, 4:19pm

I know you responded yesterday to this, but I think the conditional is behaving as filtering the largest allowed size in the current form.

tc23 · June 21, 2019, 11:26am

Just as a tip in case anyone needs it in the future, I was able to get my Dafang Wyze camera to work using FixedQp Video Format with 1600x900, 5000 bitrate, and 15fps with motion detection and audio turned off. With these settings, the feed has only corrupted/terminated 4 times in 24hrs, which is a massive improvement. For my Unifi cameras, I’m using the 1280 resolution with 5fps. Thanks!

ivelin · June 22, 2019, 9:00pm

Great project.

@blakeblackshear have you looked at feeding streaming video packets (h264) directly into a TF model? I wonder if the camera built-in hardware compression to h264 can be a leverage to reduce processing time, reduce CPU and memory requirements.

1.If there is little change between frames, there wouldn’t be much new data to process by the TF model. Only some of the DNN parameters would have to be recalculated.
2. If the TF model can process directly the media stream, there wouldn’t be a need for the intermediate step of taking image snapshots from the video feed. This could save CPU and memory resources.
3. Latency and packet loss. If there are occasionally packet delays or losses, the TF model could potentially learn about them and correct for these errors. Denoising is an area of much progress in DNN lately.
4. Correction for low light (night time or foggy weather). With more compute resources available, It might be possible to apply additional DNN layers for improving detection in low light images.

blakeblackshear · June 22, 2019, 10:23pm

I haven’t thought of that, but it would require training a completely new model from scratch as the shape of the data would look completely different. The challenge is probably that all the good training data is currently images. Have you seen anyone do this?

Even if it was possible, I’m not sure the ROI would be very high. Detection is currently <1s, and with hardware accelerated h264 decoding the CPU usage is much less as well.

Eddie1974 · June 23, 2019, 5:18am

Apparently i can’t buy or ship a Coral to Australia. WTF???
Anyone want to buy one and i’ll payback through paypal?

uid0 · June 23, 2019, 8:30pm

I have 7 cameras, ideally would like to do 11 regions (or more) in total. I can’t seem to even get 5 cameras to be stable without filling the queue. I have these turned down to 4 FPS, not sure how else I can optimize.

Using Windows (or OSX) computers, using Virtualbox (USB3 passthrough) to share the USB to the host… Hmmm.

blakeblackshear · June 23, 2019, 9:12pm

I would start by trying to see what your inference times are. If you are under 10ms (I am averaging 7-8ms), you should be able to do 100 regions per second. With 7 cameras at 4fps, you should be able to do 3 regions per camera and still stay under 100 per second. If not, there is probably some overhead with the way VirtualBox passes the USB through, or another reason your machine may not be able to max out the USB speed of the Coral.

uid0 · June 24, 2019, 1:41am

Thanks. Perhaps a stupid question but how can I check my inference times?

blakeblackshear · June 24, 2019, 2:16am

I wrote a quick benchmarking script and included it in the repo here: https://github.com/blakeblackshear/frigate/blob/8218ea569974b83a470b40bf684319a9cac5b05f/benchmark.py

It is not in any of the published docker images, so you will need to checkout the repo and build it yourself. Not sure how technical you are.