It will come down to adding LLM’s the hardware can use to understand your requests. Check out the Cookbook under number 7 and read through those guides to get an idea of what is involved.
Then regarding hardware what Arh is mentioning is the ‘Voice PE’ hardware, there are multiple other options around, namely Seeedstudios has their own hardware. There will be other options if you check the forum.