MCP Assist - 95% Token Reduction for Voice Assistants with Local & Cloud LLMs

Hey everyone! :wave:

I’ve been working on a custom integration that solves a problem that’s been bugging me for a while: voice assistants sending massive entity dumps to LLMs with every single request.

The Problem

Traditional voice assistant setups send your entire entity list (lights, switches, sensors, etc.) to the LLM every time you ask a question. For a typical home with 200+ devices, that’s:

  • 12,000+ tokens sent every time
  • Expensive API costs if using cloud LLMs
  • Slow response times
  • Context window limitations
  • Poor performance with large homes

The Solution: MCP Assist

Instead of dumping all entities, MCP Assist uses the Model Context Protocol (MCP) to give your LLM tools for dynamic entity discovery. The LLM only fetches what it needs, when it needs it.

Token reduction: 95% (from 12,000+ tokens down to ~400 per request)

How It Works

  1. MCP Assist starts an MCP server on Home Assistant
  2. Your LLM connects and gets access to discovery tools:
    • get_index - Smart Entity Index with system structure (~400-800 tokens)
    • discover_entities - Find entities by type, area, domain, device_class, or state
    • get_entity_details - Get current state and attributes
    • perform_action - Control devices
    • run_script / run_automation - Execute scripts and automations
    • list_areas / list_domains - List available areas and device types
  3. LLM discovers on-demand instead of getting everything upfront

Example: Complex Query

User: “Do we have a leak?”

Behind the scenes:

1. LLM calls get_index → Sees moisture sensors exist
2. LLM calls discover_entities(device_class="moisture")
   → Returns all leak sensors
3. LLM calls get_entity_details for each
   → Finds laundry sensor is "on", others "off"
4. LLM synthesizes response

Assistant: “Yes, the laundry room leak sensor is detecting water. The bathroom and kitchen sensors are dry.”

Key Features

  • :white_check_mark: 95% Token Reduction - Massive efficiency gain
  • :white_check_mark: Multi-Platform Support - Works with LM Studio, llama.cpp, Ollama, OpenAI, Google Gemini, Anthropic Claude, and OpenRouter
  • :white_check_mark: Multi-turn Conversations - Maintains context and history
  • :white_check_mark: Smart Entity Index - Pre-generated system structure for context-aware queries
  • :white_check_mark: Web Search Tools - Optional DuckDuckGo or Brave Search integration
  • :white_check_mark: Works with 1000+ Entities - Efficient even with large installations
  • :white_check_mark: Multi-Profile Support - Run multiple agents with different models/personalities
  • :white_check_mark: Local or Cloud - Your choice of local LLMs or cloud APIs

Installation

HACS (Recommended)

Open your Home Assistant instance and add this repository to HACS.

  1. Click the badge above or manually add as custom repository in HACS
  2. Install “MCP Assist”
  3. Restart Home Assistant
  4. Add integration via Settings → Devices & Services

Manual Installation

  1. Copy custom_components/mcp_assist to your HA custom_components directory
  2. Restart Home Assistant

Setup

The integration walks you through a 4-step setup:

  1. Profile name and server type selection
  2. Server URL (for local) or API key (for cloud)
  3. Model selection and prompts (models auto-load from your provider)
  4. Advanced settings (temperature, response mode, web search, etc.)

Then set it as your voice assistant in Settings → Voice Assistants.

What’s New in v0.11.0

Just released a major update with a complete conversation flow system overhaul:

  • User-controlled endings - Say “bye”, “thanks”, “stop” and it actually stops
  • Mode-specific behaviors - None/Smart/Always modes with distinct personalities
  • Configurable detection - Customize follow-up phrases and ending words per profile

Resources

Feedback & Support

This has been working really well for me, but I’d love feedback from the community! If you run into issues or have suggestions, feel free to reach out here or on GitHub Issues.

Happy to answer any questions! :house:

5 Likes

This is very cool and quite useful especially for those with GPUs / CPUs that have lower memory bandwidth, and the conversation smart modes definitely is something I think the current HA implementation is lacking.

I just wanted to point something out though:

This is only true if you configure it that way. You can easily decide in HA which entities are passed in and in most cases there is a small subset of devices that generally need to be accessed by HA. It still grows of course depending on your needs, but you can already slim down from that considerably by reducing the entities that are passed to assist instead of having it pass everything.

Yeah, can absolutely just expose a small number of entities, but the voice experience is restricted to basic single function calls - turn on/off, what is the status of … etc.

The cool thing about using an MCP and having many more entities exposed, is that the LLM can connect dots in natural and unexpected ways that makes the whole experience feel so much more real. :grinning:

1 Like

Interesting, I’d be curious to know more about that to see what could be missing.

I haven’t had any issues with it connecting multiple tool calls, for example:

  • Adjusting multiple devices at the same time
  • Using a web search to then search for a specific song to play on a speaker
  • I gave it the ability to search the YouTube API, and it is able to search YouTube and then play the search result on the TV that I specify

To be clear, not downplaying the work that has been done, just curious to see more specifics as I am not sure the abilities of assist are related to exposing so many more entities to it vs better prompting and descriptions on the tools which make it more capable.

You’re totally correct, and can of course chain multiple responses and tool calls together with the current apps, but the big issue is passing the entire context every time. So it all gets continually slower as we add more and more smart devices to our homes that we want connected and available to an LLM.

Anyway, each to their own, I just one day want the full Jarvis style experience of having everything interconnected. :joy:

1 Like

Yeah, that is the goal that I have as well. I decided to give this a try, and in my specific case it was a bit slower. I believe this is because depending on your hardware the MCP approach can be slower than a larger prompt, a GPU that is capable of high prompt tok/s doesn’t really get slowed down by a few more tokens with additional devices.

This definitely would have been useful for when I was using a 3050 and the prompt reading was slowing everything down. On a 3090 with 4000 tok/s prompt reading, it is not as much of an issue.

Anyway, sorry for the rambling, just wanted to share my perspective as I don’t think there is one approach to rule them all, there are so many variables depending on priorities :laughing:

To try and provide some helpful feedback:

One thing that makes this more difficult for me is that it doesn’t look like the tool calls get added to the debug view in home assistant, which makes it more difficult to debug my tools and see exactly what the LLM passed in. For scripts you can use the trace feature but for other python based tools it is more difficult to see.

This is what I am referring to which is missing:

Great suggestion!!

Although I have tons of debug logging into the main HA system logs, I would love for the tool calls to appear within the voice assist debug pages too.

I’ll look into it and get it added! Thanks.

1 Like

@crzynik I have now added the chatlog functionality for the voice assistant debug view you showed above.

Makes it much easier to follow what tools the LLM is using. :grin:

2 Likes

Can it strip emojis from LLM responses? I’m using Local OpenAI LLM with Ollama. It’s been good so far but keen to test yours especially to reduce context window and add more entities.

Nice work!

Thanks.

Yes, you can instruct the LLM to strip emojis as with other similar integrations.

Just edit the ## Response Rules section of the Technical Instructions prompt.

It is all trial and error though, finding the correct instruction that your LLM will actually listen to.

Could start by just changing the existing line to:

- Short, concise replies in plain text only (no emojis, *, **, markup, or URLs)

Yeah I’ve tried that but honestly it’s hit and miss. My prompt is quite long with all the tool definitions and entity context, and the LLM just ignores the formatting instructions half the time.

I’m using Qwen3 4b with 8k context as it’s the fastest model to fit my RTX 5070 Ti and keep responses < 1s but also found that other models with greater params can ignore emoji instructions too.

The hass_local_openai_llm integration does this at the code level instead - uses the demoji library to strip them during streaming:

import demoji

if strip_emojis:
    content = await loop.run_in_executor(None, demoji.replace, content, "")

Works every time regardless of what model you’re using or how long your prompt is. Would be a pretty small change if you’re open to adding it as a config option? Would be awesome!

@notownblues - Thanks for this great suggestion!

I have implemented it along with some other cleanup elements.

It’s good to get things out of the prompt whenever possible, to simplify things for the LLMs.

1 Like

Awesome!! :partying_face: Looking forward to try it

I just set this up but noticed that there isn’t a way to get the Voice Assistant to ‘prefer’ local first, before going to a LLM. Is that something that can be added?

@sparkydave That’s handled by Home Assistant in their Assist profile settings.

Under “Conversation agent” look for the “Prefer handling commands locally” setting.

I’m trying this out - seeing some odd behaviour where discover_entities only ever returns 20 entities (I have significantly more exposed for assist)

it’s failing to find entities I know are exposed (tbf - I think my LLM is also not smart enough to utilise the areas so I’m experimenting a bit there)

Yeah, sorry. That was my bad, I hadn’t done the final step to actually create the Voice profile… :person_facepalming:

1 Like

@matchett808-gh The default limit is set to 20, but the model can request up to 50

ah, I need closer to 120 lol

Ignore me the now - I’ll come back to this, someone on GH is asking just about the same questions as I am but better - will check back in a couple of days