Allow File Uploads (Images, PDFs) to Assist for Multimodal Interaction

Motivation: Large Language Models (LLMs) are rapidly evolving with multimodal capabilities, allowing them to understand and process various file types beyond text. Home Assistant’s Assist could leverage these advancements to enable more sophisticated interactions and automations if it could accept file inputs directly. Currently, Assist is limited to text/voice input.

Proposed Solution: Implement functionality within the Assist chat interface (in Home Assistant Core UI and companion apps) to allow users to attach files, primarily focusing on common types like images (JPEG, PNG, HEIC) and documents (PDF).

Benefits & Use Cases: This feature would enable users and automations to act upon the content of uploaded files, unlocking powerful new possibilities:

  1. Smart Inventory/Receipt Processing:
  • Scenario: A user uploads a photo of a grocery receipt to Assist.
  • Action: The user could say, “Add these items to the pantry.” An automation could then use an appropriate backend (OCR service, local/cloud multimodal LLM) to parse the image, identify items, update inventory helpers (input_text, list), and potentially estimate expiry dates for alerts.
  1. Visual Meter Readings: Upload a photo of a utility meter; an automation parses the reading and updates a sensor.
  2. Device Manual Interaction: Upload a PDF manual; ask Assist, “How do I pair this device?” and have an automation (using an LLM) extract relevant steps.
  3. Object/Plant Identification: Upload a photo of a plant; trigger an automation using image recognition to identify it and perhaps add it to the Plant integration.

Conclusion: Adding file attachment capabilities would significantly expand Assist’s functionality, making it a more versatile and powerful interface that keeps pace with modern AI trends, enabling richer and more context-aware smart home automations.