AI First approach to Voice (Using OpenAI Realtime API)

If you want your HA to be as local as possible, stop reading here :wink:

I would like to experiment with an AI first approach for voice.

I imagine a pipeline where a wake-word connects the voice device to the OpenAI realtime API so all the voice features are just piped to OpenAI and the LLM has access to the same functions and context as the AI-agent we have today.

The LLM does not even have to talk to the raw HA APIs, it can use ā€œassistā€ and send simpler texts to HA.

If i say ā€œItĀ“s too dark in the kitchenā€ the LLM can use assist to get the state of the lights in the kitchen and then decide if it needs to change a dimmer or turn something on.

I’ve tested this in the OpenAI playground (with ā€œfakeā€ assist funktions) and the LLM understands what needs to be done very well in my (limited) testing, you can even say things like ā€œa little more pleaseā€ to turn up the dimmer even more.

Benefits:
Not turn-based, you can even interrupt the assistant and change your mind mid sentence.
Smarter, and evolves as the models get smarter.
No TTS/STT … just voice (and function calls)

It would be quite easy to build this ā€œoutsideā€ of HA and just use the HA APIs,
but it would be best as a pipeline in HA so you can mix assistants.

What do you all think? Is this the way to Jarvis?

I’m a bit astonished that this topic hasn’t seen more activity! The realtime preview via the playground is really impressive.

I’m receiving my Voice Preview Edition this afternoon, with the goal of getting this integrated. I’d opt for the approach where the model has access to (part of) the HA API, instead of having it use assist. Bridging the two doesn’t need to be very difficult.

1 Like

I agree with this… While local is great, would love an option like this. This would make it more fluid like Alexa and Siri used to be… :slight_smile:

Realtime API is so good. Looking forward to this integration.

Also sending non-text audio through would be amazing. (like llm vision but realtime). Or having the response be more than just STT.

subscribing!

Yes please! This would be fantastic, and yet still seems not to far out of reach. Maybe the focus on streaming the TTS coming in the next version is a small step toward this?

Pay attention to the plumbing being put in the March release currently in beta… :wink: Doesn’t get 100% yet but supports streaming responses. Which would be required before this if my understanding of the architecture is sound… :sunglasses:

2 Likes

Hi folks! I just open-sourced my Arduino ESP32 + OpenAI Realtime API project. Originally I used it to create AI-powered talking toys but you could use it as a Jarvis-like assistant as well to turn off your kitchen lights with voice commands! Try it out here GitHub - akdeb/ElatoAI: Realtime AI speech with OpenAI Realtime API on Arduino ESP32 with Secure Websockets and Deno edge functions with >10min uninterrupted conversations globally for AI toys, AI companions, AI devices and more
Stack: Arduino, ESP32, PlatformIO, Deno Edge Functions, Supabase as a DB, Vercel for a frontend interface

2 Likes

Amazing project! Any plans to further integrate it to HA?

I just found out this exists :+1:: https://www.reddit.com/r/OpenAI/comments/1ixzra3/i_made_a_free_lifelike_openai_voice_assistant_for/

1 Like