AI First approach to Voice (Using OpenAI Realtime API)

If you want your HA to be as local as possible, stop reading here :wink:

I would like to experiment with an AI first approach for voice.

I imagine a pipeline where a wake-word connects the voice device to the OpenAI realtime API so all the voice features are just piped to OpenAI and the LLM has access to the same functions and context as the AI-agent we have today.

The LLM does not even have to talk to the raw HA APIs, it can use ā€œassistā€ and send simpler texts to HA.

If i say ā€œItĀ“s too dark in the kitchenā€ the LLM can use assist to get the state of the lights in the kitchen and then decide if it needs to change a dimmer or turn something on.

Iā€™ve tested this in the OpenAI playground (with ā€œfakeā€ assist funktions) and the LLM understands what needs to be done very well in my (limited) testing, you can even say things like ā€œa little more pleaseā€ to turn up the dimmer even more.

Benefits:
Not turn-based, you can even interrupt the assistant and change your mind mid sentence.
Smarter, and evolves as the models get smarter.
No TTS/STT ā€¦ just voice (and function calls)

It would be quite easy to build this ā€œoutsideā€ of HA and just use the HA APIs,
but it would be best as a pipeline in HA so you can mix assistants.

What do you all think? Is this the way to Jarvis?