The voice/facial recognition can be done on the client side. The client is most probably a tablet that is always plugged in so battery life is not a concern. However the challenges are how to make it act like Echo which always listen for command.
I came across this project; Jasper Integration which I think is a step closer to this idea.
Anyway, there are still rough edges to this idea. Please feel free to share your thoughts on this.