Native Audio Integration with Multimodal GPT (e.g. OpenAI, Google Gemini) in Home Assistant

I would like to request an integration that enables direct audio input processing using multimodal GPT models, such as OpenAI’s GPT-4 (with native audio capabilities) or Google Gemini.

The core idea is to allow users to speak directly to Home Assistant, with the audio being sent as-is to the AI model for interpretation — without requiring a separate speech-to-text (STT) pipeline.


Why This Matters

Current voice assistant implementations in Home Assistant rely on converting speech to text before passing it to intent processors or scripts. This introduces:

  • Additional latency
  • Potential transcription errors
  • Unnecessary architectural complexity

Modern multimodal models are now capable of understanding audio natively. Leveraging this directly would streamline the process and improve the user experience significantly.


Benefits

  • Reduced Latency: Eliminates the need for an STT step, speeding up response times.
  • Higher Accuracy: AI models can interpret tone, emotion, and natural language better than conventional STT/NLU pipelines.
  • Simplified Setup: Reduces reliance on external STT services and simplifies assistant configuration.
  • Future-Proofing: Keeps Home Assistant compatible with the newest AI innovations in multimodal interaction.

Suggested Features

  • Support for sending raw audio input from Home Assistant to a multimodal GPT API (e.g. OpenAI or Google Gemini).
  • Ability to receive and handle text or audio-based responses from the AI.