I would like to request an integration that enables direct audio input processing using multimodal GPT models, such as OpenAI’s GPT-4 (with native audio capabilities) or Google Gemini.
The core idea is to allow users to speak directly to Home Assistant, with the audio being sent as-is to the AI model for interpretation — without requiring a separate speech-to-text (STT) pipeline.
Why This Matters
Current voice assistant implementations in Home Assistant rely on converting speech to text before passing it to intent processors or scripts. This introduces:
- Additional latency
- Potential transcription errors
- Unnecessary architectural complexity
Modern multimodal models are now capable of understanding audio natively. Leveraging this directly would streamline the process and improve the user experience significantly.
Benefits
- Reduced Latency: Eliminates the need for an STT step, speeding up response times.
- Higher Accuracy: AI models can interpret tone, emotion, and natural language better than conventional STT/NLU pipelines.
- Simplified Setup: Reduces reliance on external STT services and simplifies assistant configuration.
- Future-Proofing: Keeps Home Assistant compatible with the newest AI innovations in multimodal interaction.
Suggested Features
- Support for sending raw audio input from Home Assistant to a multimodal GPT API (e.g. OpenAI or Google Gemini).
- Ability to receive and handle text or audio-based responses from the AI.