MCP Assist - 95% Token Reduction for Voice Assistants with Local & Cloud LLMs

mikenott · January 22, 2026, 5:40pm

Hey everyone!

I’ve been working on a custom integration that solves a problem that’s been bugging me for a while: voice assistants sending massive entity dumps to LLMs with every single request.

The Problem

Traditional voice assistant setups send your entire entity list (lights, switches, sensors, etc.) to the LLM every time you ask a question. For a typical home with 200+ devices, that’s:

12,000+ tokens sent every time
Expensive API costs if using cloud LLMs
Slow response times
Context window limitations
Poor performance with large homes

The Solution: MCP Assist

Instead of dumping all entities, MCP Assist uses the Model Context Protocol (MCP) to give your LLM tools for dynamic entity discovery. The LLM only fetches what it needs, when it needs it.

Token reduction: 95% (from 12,000+ tokens down to ~400 per request)

How It Works

MCP Assist starts an MCP server on Home Assistant
Your LLM connects and gets access to discovery tools:
- get_index - Smart Entity Index with system structure (~400-800 tokens)
- discover_entities - Find entities by type, area, domain, device_class, or state
- get_entity_details - Get current state and attributes
- perform_action - Control devices
- run_script / run_automation - Execute scripts and automations
- list_areas / list_domains - List available areas and device types
LLM discovers on-demand instead of getting everything upfront

Example: Complex Query

User: “Do we have a leak?”

Behind the scenes:

1. LLM calls get_index → Sees moisture sensors exist
2. LLM calls discover_entities(device_class="moisture")
   → Returns all leak sensors
3. LLM calls get_entity_details for each
   → Finds laundry sensor is "on", others "off"
4. LLM synthesizes response

Assistant: “Yes, the laundry room leak sensor is detecting water. The bathroom and kitchen sensors are dry.”

Key Features

95% Token Reduction - Massive efficiency gain
Multi-Platform Support - Works with LM Studio, llama.cpp, Ollama, OpenAI, Google Gemini, Anthropic Claude, and OpenRouter
Multi-turn Conversations - Maintains context and history
Smart Entity Index - Pre-generated system structure for context-aware queries
Web Search Tools - Optional DuckDuckGo or Brave Search integration
Works with 1000+ Entities - Efficient even with large installations
Multi-Profile Support - Run multiple agents with different models/personalities
Local or Cloud - Your choice of local LLMs or cloud APIs

Installation

HACS (Recommended)

Click the badge above or manually add as custom repository in HACS
Install “MCP Assist”
Restart Home Assistant
Add integration via Settings → Devices & Services

Manual Installation

Copy custom_components/mcp_assist to your HA custom_components directory
Restart Home Assistant

Setup

The integration walks you through a 4-step setup:

Profile name and server type selection
Server URL (for local) or API key (for cloud)
Model selection and prompts (models auto-load from your provider)
Advanced settings (temperature, response mode, web search, etc.)

Then set it as your voice assistant in Settings → Voice Assistants.

What’s New in v0.11.0

Just released a major update with a complete conversation flow system overhaul:

User-controlled endings - Say “bye”, “thanks”, “stop” and it actually stops
Mode-specific behaviors - None/Smart/Always modes with distinct personalities
Configurable detection - Customize follow-up phrases and ending words per profile

Resources

GitHub Repository: GitHub - mike-nott/mcp-assist: MCP-powered Home Assistant conversation agent that solves entity context limitations through dynamic discovery instead of full entity dumps · GitHub
Full Documentation: In the README
Latest Release: v0.11.0

Feedback & Support

This has been working really well for me, but I’d love feedback from the community! If you run into issues or have suggestions, feel free to reach out here or on GitHub Issues.

Happy to answer any questions!

crzynik · January 22, 2026, 5:46pm

This is very cool and quite useful especially for those with GPUs / CPUs that have lower memory bandwidth, and the conversation smart modes definitely is something I think the current HA implementation is lacking.

I just wanted to point something out though:

This is only true if you configure it that way. You can easily decide in HA which entities are passed in and in most cases there is a small subset of devices that generally need to be accessed by HA. It still grows of course depending on your needs, but you can already slim down from that considerably by reducing the entities that are passed to assist instead of having it pass everything.

mikenott · January 22, 2026, 5:52pm

Yeah, can absolutely just expose a small number of entities, but the voice experience is restricted to basic single function calls - turn on/off, what is the status of … etc.

The cool thing about using an MCP and having many more entities exposed, is that the LLM can connect dots in natural and unexpected ways that makes the whole experience feel so much more real.

crzynik · January 22, 2026, 5:57pm

Interesting, I’d be curious to know more about that to see what could be missing.

I haven’t had any issues with it connecting multiple tool calls, for example:

Adjusting multiple devices at the same time
Using a web search to then search for a specific song to play on a speaker
I gave it the ability to search the YouTube API, and it is able to search YouTube and then play the search result on the TV that I specify

To be clear, not downplaying the work that has been done, just curious to see more specifics as I am not sure the abilities of assist are related to exposing so many more entities to it vs better prompting and descriptions on the tools which make it more capable.

mikenott · January 22, 2026, 6:08pm

You’re totally correct, and can of course chain multiple responses and tool calls together with the current apps, but the big issue is passing the entire context every time. So it all gets continually slower as we add more and more smart devices to our homes that we want connected and available to an LLM.

Anyway, each to their own, I just one day want the full Jarvis style experience of having everything interconnected.

crzynik · January 22, 2026, 6:22pm

Yeah, that is the goal that I have as well. I decided to give this a try, and in my specific case it was a bit slower. I believe this is because depending on your hardware the MCP approach can be slower than a larger prompt, a GPU that is capable of high prompt tok/s doesn’t really get slowed down by a few more tokens with additional devices.

This definitely would have been useful for when I was using a 3050 and the prompt reading was slowing everything down. On a 3090 with 4000 tok/s prompt reading, it is not as much of an issue.

Anyway, sorry for the rambling, just wanted to share my perspective as I don’t think there is one approach to rule them all, there are so many variables depending on priorities

To try and provide some helpful feedback:

One thing that makes this more difficult for me is that it doesn’t look like the tool calls get added to the debug view in home assistant, which makes it more difficult to debug my tools and see exactly what the LLM passed in. For scripts you can use the trace feature but for other python based tools it is more difficult to see.

This is what I am referring to which is missing:

mikenott · January 22, 2026, 6:31pm

Great suggestion!!

Although I have tons of debug logging into the main HA system logs, I would love for the tool calls to appear within the voice assist debug pages too.

I’ll look into it and get it added! Thanks.

mikenott · January 22, 2026, 10:26pm

@crzynik I have now added the chatlog functionality for the voice assistant debug view you showed above.

Makes it much easier to follow what tools the LLM is using.

notownblues · January 24, 2026, 5:19pm

Can it strip emojis from LLM responses? I’m using Local OpenAI LLM with Ollama. It’s been good so far but keen to test yours especially to reduce context window and add more entities.

Nice work!

mikenott · January 25, 2026, 11:12am

Thanks.

Yes, you can instruct the LLM to strip emojis as with other similar integrations.

Just edit the ## Response Rules section of the Technical Instructions prompt.

It is all trial and error though, finding the correct instruction that your LLM will actually listen to.

Could start by just changing the existing line to:

- Short, concise replies in plain text only (no emojis, *, **, markup, or URLs)

notownblues · January 25, 2026, 11:56am

Yeah I’ve tried that but honestly it’s hit and miss. My prompt is quite long with all the tool definitions and entity context, and the LLM just ignores the formatting instructions half the time.

I’m using Qwen3 4b with 8k context as it’s the fastest model to fit my RTX 5070 Ti and keep responses < 1s but also found that other models with greater params can ignore emoji instructions too.

The hass_local_openai_llm integration does this at the code level instead - uses the demoji library to strip them during streaming:

import demoji

if strip_emojis:
    content = await loop.run_in_executor(None, demoji.replace, content, "")

Works every time regardless of what model you’re using or how long your prompt is. Would be a pretty small change if you’re open to adding it as a config option? Would be awesome!

mikenott · January 25, 2026, 5:24pm

@notownblues - Thanks for this great suggestion!

I have implemented it along with some other cleanup elements.

It’s good to get things out of the prompt whenever possible, to simplify things for the LLMs.

notownblues · January 27, 2026, 1:31am

Awesome!! Looking forward to try it

sparkydave · January 27, 2026, 3:19am

I just set this up but noticed that there isn’t a way to get the Voice Assistant to ‘prefer’ local first, before going to a LLM. Is that something that can be added?

mikenott · January 27, 2026, 1:05pm

@sparkydave That’s handled by Home Assistant in their Assist profile settings.

Under “Conversation agent” look for the “Prefer handling commands locally” setting.

matchett808-gh · January 27, 2026, 4:17pm

I’m trying this out - seeing some odd behaviour where discover_entities only ever returns 20 entities (I have significantly more exposed for assist)

it’s failing to find entities I know are exposed (tbf - I think my LLM is also not smart enough to utilise the areas so I’m experimenting a bit there)

sparkydave · January 27, 2026, 9:14pm

Yeah, sorry. That was my bad, I hadn’t done the final step to actually create the Voice profile…

mikenott · January 27, 2026, 10:44pm

@matchett808-gh The default limit is set to 20, but the model can request up to 50

matchett808-gh · January 28, 2026, 3:10am

ah, I need closer to 120 lol

matchett808-gh · January 28, 2026, 3:16am

Ignore me the now - I’ll come back to this, someone on GH is asking just about the same questions as I am but better - will check back in a couple of days