[Share] ESPHome Custom Component for Espressif Edge AI & YOLO11 on ESP32-S3

[Share] Local Edge AI Object Detection (YOLO11, Face, Hand) on ESP32-S3 with ESPHome

Hi everyone,

I'm excited to share a custom ESPHome external component I've been developing: s3_vision.

This component allows you to run fully local, on-device Edge AI detection models (including YOLO11 with 80 COCO classes, pedestrian detection, hand detection, and human face detection) directly on a cheap ESP32-S3 camera board!

No external server, no cloud subscription, no Coral TPU needed — everything happens locally on the microcontroller, and it integrates natively with Home Assistant via MQTT.

Key Features
Multi-Model Support: Switch between YOLO11 (80 classes), Pedestrian, Hand, or Face detection with a single line in your YAML.
On-Device Overlays: Automatically draws real-time bounding boxes and labels directly onto the camera frames.
0% CPU Idle Mode: Includes vision.start and vision.stop actions. You can disable inference via a switch (e.g., when no motion is detected or at night) to drop CPU usage to zero instantly.
Home Assistant Native Integration: Exposes triggers for detected object counts, text summaries, and base64 JPEG snapshots.
Hardware Requirements
Item Minimum Recommended
MCU ESP32-S3 (S2/C3/C6 not supported) ESP32-S3 R8 (with Octal PSRAM)
Flash / PSRAM 8 MB Flash / 2 MB Quad PSRAM 16 MB Flash / 8 MB Octal PSRAM
Camera Module DVP Parallel (OV2640, OV3660...) OV2640 running at 320x240
Important: The camera module must be configured with pixel_format: RGB565. Hardware JPEG decoding is not used, so raw frames are fed directly to the ESP-DL AI pipeline.

Tested & Verified Boards:

Waveshare ESP32-S3-SIM7670G-4G
ESP32-S3-DevKitC + OV2640 module
Freenove ESP32-S3 WROOM
Quick Setup Guide

1. Add the External Component
yaml
external_components:
  - source:
      type: git
      url: https://github.com/youkorr/s3_vision
      ref: main
    components: [ vision ]
2. Configure the Camera
yaml
esp32_camera:
  id: my_camera
  # ... your camera pins configuration ...
  resolution: 320x240
  pixel_format: RGB565  # MANDATORY
  frame_buffer_count: 1
  frame_buffer_location: PSRAM
3. Configure the Vision Component
Option A — Model embedded at build time (recommended):

yaml
vision:
  id: my_vision
  esp32_camera_id: my_camera
  model_family: coco_detect
  model_path: ./coco_detect_yolo11n_320_s8_v3.espdl
  score_threshold: 0.30
  detection_interval_ms: 200
  draw_boxes: true
Option B — Model via jesserockz's file: component:

yaml
external_components:
  - source:
      type: git
      url: https://github.com/youkorr/s3_vision
      ref: main
    components: [ vision ]
  - source: github://jesserockz/esphome-components@1b449c22e749933d11ca57c77d8303f851a817e1
    components: [file]
    refresh: 10s
```
file:
  - id: model_coco_detect
    path: ./coco_detect_yolo11n_320_s8_v3.espdl
vision:
  id: my_vision
  esp32_camera_id: my_camera
  model_family: coco_detect
  model_id: model_coco_detect
  score_threshold: 0.30
  detection_interval_ms: 200
  draw_boxes: true
4. Add MQTT Triggers
yaml
vision:
  on_detection_image:
    - then:
        - mqtt.publish:
            topic: device/${name}/camera/snapshot
            payload: !lambda 'return esphome::base64_encode(image.data, image.length);'
  on_object_detected:
    - then:
        - mqtt.publish:
            topic: device/${name}/detection/state
            payload: !lambda |-
              auto dets = id(my_vision).get_detections();
              std::string out = "{\"count\":";
              out += std::to_string(dets.size());
              out += ",\"objects\":[";
              for (size_t i = 0; i < dets.size(); i++) {
                if (i) out += ",";
                char buf[128];
                snprintf(buf, sizeof(buf),
                  "{\"class\":\"%s\",\"score\":%d}",
                  dets[i].label, (int)(dets[i].score * 100));
                out += buf;
              }
              out += "]}";
              return out;
```
5. Home Assistant MQTT Entities (optional)
yaml
# In your HA configuration.yaml
mqtt:
  sensor:
    - name: "Vision Object Count"
      state_topic: "device/my-vision-cam/detection/state"
      value_template: "{{ value_json.count }}"
  image:
    - name: "Vision AI Snapshot"
      image_topic: "device/my-vision-cam/camera/snapshot"
      content_type: "image/jpeg"
      image_encoding: "b64"
6. Runtime Enable/Disable Switch (optional)
yaml
switch:
  - platform: template
    name: "${name} Vision Enabled"
    id: vision_enabled
    optimistic: true
    restore_mode: ALWAYS_ON
    turn_on_action:
      - vision.start: my_vision
    turn_off_action:
      - vision.stop: my_vision
```
Partition Table Note
Because the firmware bundles the Espressif ESP-DL framework and the AI models, the binary size is around ~7 MB. You must use a custom partition table (provided in the repository documentation) to allocate a larger factory app slot (e.g., 12 MB).

Repository & Full Documentation
Check out the full guide, custom partition files, and troubleshooting steps on the GitHub repository:

Full Code & Docs: https://github.com/youkorr/s3_vision

Feedback, issues, and ideas for Phase 2 (like adding Pose estimation or Face recognition) are highly welcome! Let me know if you test it out on your S3 boards!

Your work is amazing !!!!!!!!!!!! A huge thumb up :+1:

Thanks to SeByDocKy (e-2-nomy) who has this Waveshare ESP32-S3-SIM7670G-4G device and tested it.