Hi Andy,
I’m quite new to HA, but I have implemented a solution that matches your requirements.
My solution is based on the doorpi project and I have been using it for two months now.
I run a Raspberry Pi 3 with a piface add-on board for connecting the doorbell button, switch on/off audio, nextion 3,2 inch display and IR LEDs (for night view). The camera is a pi noir camera equipped with a fish eye lense.
Motion detection is done via the software solution “motion”, video streaming via mjpeg_streamer, the main doorpi software manages events and establishes the SIP call.
When motion is detected, the LED lighting of the button and the Nextion are being switched on for one minute.
If a visitor uses the button, a pushbullet message including a snapshot is being sent.
I experimented with Google Cloud Vision to detect parcel services, but this is not so easy. The response times are acceptable, but the results are not reliable. Additionally, you always have to think of the legal requirements depending on where you live.
My SIP server is a Fritzbox (built in support for SIP), my client is a Fritzfon C5, but can be any other SIP client. Should also work with Asterisk as SIP server.
All status updates are being sent to HA via MQTT.
I have also built in a convenience feature to play a standard message “Coming to the door” when I cannot answer the call on the Fritzfon. This message can be played using an Alexa routine anywhere in the house.
The main obstacles for production readiness were:
a) Making the doorpi weather proof (rain, heat, cold, insects(!))
b) Secure the home network (cut lan connection on sabotage detection)
c) Echo cancellation, this is a serious problem that you have to address in one way or another. I’m using pulseaudio, additionally I rely on the capabilities of the Fritzfon. Don’t underestimate that problem.
EDIT: I tuned the rpi 3 for low power consumption, switching off everything I do not need and using a fixed cpu clock of 600 MHz. That should also help with any potential heat problem. The idle power consumption is about 2 watts, when audio, IR LED and nextion are activated, it will go up to about 7 watts.
With the motion detection running all the time, the CPU utilization is about 15%. The still images of the last 3 visitors are accessible through the doorpi webserver.