With OpenAI’s announcement of GPT-4o, it’s becoming clear that there are significant benefits to systems that can input and output voice directly, without the need for a STT->Conversation->TTS pipeline. The main benefits are:
Decreased latency
Understanding and expressing emotion
Simplicity
While GPT-4o’s audio capabilities are not yet publicly available, there are services you can use right now that offer some of the same benefits (e.g. Vapi). Once open-source models catch up, this will surely become possible locally as well.
I propose we start working on supporting assistants that take voice in and generate voice out, without the need for separate STT and TTS components.
Not related to GPT-4o’s audio capabilities, as indeed there is currently no mention in openai’s API docs of an endpoint where an audio stream would be accepted (likely through the /v1/chat/completions endpoint similar to images).
However, I thought you might be interested in GPT-4o’s vision capabilities, as this opens up many possibilities, especially regarding automation.
ha-gpt4vision is a service that takes an image and promt and returns GPT’s reponse as a response variable, so it can easily be integrated into automations.
Has anyone here played with using HASS from O1? It’s a cool project that is trying to make a standard voice assistant interface for LLMs that also can run arbitrary actions on the host computer. It’s a bit terrifying but fun to play around with on a computer with nothing important on it. I imagine giving it device on-off permissions in HASS could provide a similar experience to the existing voice assistant feature in HASS.