[Share] Local Edge AI Object Detection (YOLO11, Face, Hand) on ESP32-S3 with ESPHome
Hi everyone,
I'm excited to share a custom ESPHome external component I've been developing: s3_vision.
This component allows you to run fully local, on-device Edge AI detection models (including YOLO11 with 80 COCO classes, pedestrian detection, hand detection, and human face detection) directly on a cheap ESP32-S3 camera board!
No external server, no cloud subscription, no Coral TPU needed — everything happens locally on the microcontroller, and it integrates natively with Home Assistant via MQTT.
Key Features
Multi-Model Support: Switch between YOLO11 (80 classes), Pedestrian, Hand, or Face detection with a single line in your YAML.
On-Device Overlays: Automatically draws real-time bounding boxes and labels directly onto the camera frames.
0% CPU Idle Mode: Includes vision.start and vision.stop actions. You can disable inference via a switch (e.g., when no motion is detected or at night) to drop CPU usage to zero instantly.
Home Assistant Native Integration: Exposes triggers for detected object counts, text summaries, and base64 JPEG snapshots.
Hardware Requirements
Item Minimum Recommended
MCU ESP32-S3 (S2/C3/C6 not supported) ESP32-S3 R8 (with Octal PSRAM)
Flash / PSRAM 8 MB Flash / 2 MB Quad PSRAM 16 MB Flash / 8 MB Octal PSRAM
Camera Module DVP Parallel (OV2640, OV3660...) OV2640 running at 320x240
Important: The camera module must be configured with pixel_format: RGB565. Hardware JPEG decoding is not used, so raw frames are fed directly to the ESP-DL AI pipeline.
Tested & Verified Boards:
Waveshare ESP32-S3-SIM7670G-4G
ESP32-S3-DevKitC + OV2640 module
Freenove ESP32-S3 WROOM
Quick Setup Guide
1. Add the External Component
yaml
external_components:
- source:
type: git
url: https://github.com/youkorr/s3_vision
ref: main
components: [ vision ]
2. Configure the Camera
yaml
esp32_camera:
id: my_camera
# ... your camera pins configuration ...
resolution: 320x240
pixel_format: RGB565 # MANDATORY
frame_buffer_count: 1
frame_buffer_location: PSRAM
3. Configure the Vision Component
Option A — Model embedded at build time (recommended):
yaml
vision:
id: my_vision
esp32_camera_id: my_camera
model_family: coco_detect
model_path: ./coco_detect_yolo11n_320_s8_v3.espdl
score_threshold: 0.30
detection_interval_ms: 200
draw_boxes: true
Option B — Model via jesserockz's file: component:
yaml
external_components:
- source:
type: git
url: https://github.com/youkorr/s3_vision
ref: main
components: [ vision ]
- source: github://jesserockz/esphome-components@1b449c22e749933d11ca57c77d8303f851a817e1
components: [file]
refresh: 10s
```
file:
- id: model_coco_detect
path: ./coco_detect_yolo11n_320_s8_v3.espdl
vision:
id: my_vision
esp32_camera_id: my_camera
model_family: coco_detect
model_id: model_coco_detect
score_threshold: 0.30
detection_interval_ms: 200
draw_boxes: true
4. Add MQTT Triggers
yaml
vision:
on_detection_image:
- then:
- mqtt.publish:
topic: device/${name}/camera/snapshot
payload: !lambda 'return esphome::base64_encode(image.data, image.length);'
on_object_detected:
- then:
- mqtt.publish:
topic: device/${name}/detection/state
payload: !lambda |-
auto dets = id(my_vision).get_detections();
std::string out = "{\"count\":";
out += std::to_string(dets.size());
out += ",\"objects\":[";
for (size_t i = 0; i < dets.size(); i++) {
if (i) out += ",";
char buf[128];
snprintf(buf, sizeof(buf),
"{\"class\":\"%s\",\"score\":%d}",
dets[i].label, (int)(dets[i].score * 100));
out += buf;
}
out += "]}";
return out;
```
5. Home Assistant MQTT Entities (optional)
yaml
# In your HA configuration.yaml
mqtt:
sensor:
- name: "Vision Object Count"
state_topic: "device/my-vision-cam/detection/state"
value_template: "{{ value_json.count }}"
image:
- name: "Vision AI Snapshot"
image_topic: "device/my-vision-cam/camera/snapshot"
content_type: "image/jpeg"
image_encoding: "b64"
6. Runtime Enable/Disable Switch (optional)
yaml
switch:
- platform: template
name: "${name} Vision Enabled"
id: vision_enabled
optimistic: true
restore_mode: ALWAYS_ON
turn_on_action:
- vision.start: my_vision
turn_off_action:
- vision.stop: my_vision
```
Partition Table Note
Because the firmware bundles the Espressif ESP-DL framework and the AI models, the binary size is around ~7 MB. You must use a custom partition table (provided in the repository documentation) to allocate a larger factory app slot (e.g., 12 MB).
Repository & Full Documentation
Check out the full guide, custom partition files, and troubleshooting steps on the GitHub repository:
Full Code & Docs: https://github.com/youkorr/s3_vision
Feedback, issues, and ideas for Phase 2 (like adding Pose estimation or Face recognition) are highly welcome! Let me know if you test it out on your S3 boards!