This first part isn’t answering the question directly, but reading this made me think about another entertaining but related post. In this case, a camera with line crossing detection is used, and it’s assumed that when the dogs are in a particular location that they are barking.
Apple’s Siri also has sound recognition and support for detecting barking dogs. Not sure if that will work for you.
While I work in the machine learning field, I haven’t attempted embedded ML models, but have a look at TensorFlow Lite and TinyML that can run on an ESP32. I suspect you might even find pre-trained models. Regardless, there will be some DIY involved.