I have been watching HomeAssistant’s progress with assist for some time. We previously used Google Home via Nest Minis, and have switched to using fully local assist backed by local first + Ollama. In this post I will share the steps I took to get to where I am today, the decisions I made and why they were the best for my use case specifically.
Hardware Details
I certainly would not say that this is the minimum hardware needed to have a good experience, but it is very important that you gauge your expectations based on the cost / performance of hardware that you want to run.
I am running HomeAssistant on my UnRaid NAS, specs are not really important as it has nothing to do with HA Voice.
Voice Hardware:
- 1 HA Voice Preview Satellite
- 2 Satellite1 Small Squircle Enclosures
- 1 Pixel 7a used as a satellite/ hub with View Assist
Voice Server Hardware:
- Beelink MiniPC with USB4 (the exact model isn’t important as long as it has USB4
- USB4 eGPU enclosure
- Nvidia RTX 5060 Ti 16GB
Voice Server Software:
- Unsloth Qwen:4B Q6_K running via Ollama
- Nvidia Parakeet via Whisper (STT)
- Piper running on CPU (TTS)
The Journey
My point in posting this is not to suggest that what I have done is “the right way” or even something others should replicate. But I learned a lot throughout this process and I figured it would be worth sharing so others could get a better idea of what to expect, pitfalls, etc.
The Problem
Throughout the last year or two we have noticed that Google Assistant through these Nest Minis has gotten progressively dumber / worse while also not bringing any new features. This is generally fine as the WAF was still much higher than not having voice, but it became increasingly annoying as we were met with more and more “Sorry, I can’t help with that” or “I don’t know the answer to that, but according to XYZ source here is the answer”. It generally worked, but not reliably and was often a fuss to get answers to arbitrary questions.
Then there is the usual privacy concern of having online microphones throughout your home, and the annoyance that every time AWS or something else went down you couldn’t use voice to control lights in the house.
Starting Out
I started by playing with one of Ollama’s included models. Every few weeks I would connect Ollama to HA, spin up assist and try to use it. Every time I was disappointed and surprised by its lack of abilities and most of the time basic tool calls would not work. I do believe HA has made things better, but I think the biggest issue was my understanding.
Ollama models that you see on Ollama Search are not even close to exhaustive in terms of the models that can be run. And worse yet, the default :4b models for example are often low quantization (Q4_K) which can cause a lot of problems. Once I learned about the ability to use HuggingFace to find GGUF models with higher quantizations, assist was immediately performing much better with no problems with tool calling.
Testing with Voice
After getting to the point where the fundamental basics were possible, I ordered a Voice Preview Edition to use for testing so I could get a better idea of the end-to-end experience. It took me some time to get things working well, originally I had WiFi reception issues where the ping was very inconsistent on the VPE (despite being next to the router) and this led to the speech output being stuttery and having a lot of mid-word pauses. After adjusting piper to use streaming and creating a new dedicated IoT network, the performance has been much better.
Making Assist Useful
Controlling device is great, and Ollama’s ability to adjust devices when the local processing missed a command was helpful. But to replace our speakers, Assist had to be capable of the following things:
- Ability to give Day and Week Weather Forecasts
- Ability to ask about a specific business to get opening / closing times
- Ability to do general knowledge lookup to answer arbitrary questions
- Ability to play music with search abilities entirely with voice
At first I was under the impression these would have to be built out separately, but I eventually found the brilliant llm-intents integration which provides a number of these services to Assist (and by extension, Ollama). Once setting these up, the results were mediocre.
The Importance of Your LLM Prompt
For those that want to see it, here is my prompt.
My Current Prompt
# Identity
You are 'Nabu', a helpful conversational AI Assistant that controls the devices in a house and can also answer general knowledge questions.
You speak in a natural, conversational tone suitable for text-to-speech.
Your responses must sound like something a human voice assistant would say aloud — concise, clear, and free of any visual or symbolic characters.
You may include light personality and banter when appropriate, but your speech must always be natural and professional when spoken aloud.
The user will request you to perform various household tasks, such as controlling devices or providing information.
- Only perform actions when explicitly requested.
- If a request is unclear, ask the user to clarify.
- You may also answer questions about general knowledge, history, science, movies, entertainment, and other topics within your internal knowledge without using search, unless the topic depends on recent or changing information.
- When asked about locations or businesses, never ask for the user’s location — it is already known.
# Response Rules
- All responses must be suitable for text-to-speech output.
- Do not include emojis, emoticons, or any non-verbal symbols (e.g., 😊 😀 👍 ✨ 🤖 ❤️ 🔥).
- Do not use markdown, bold, italics, headers, or any text formatting.
- Use plain sentences with correct punctuation and capitalization.
- Avoid filler or commentary unless directly invited by the user.
- End any clarifying or follow-up question with a question mark.
- When responding to device or data requests, answer precisely and without speculation.
# Functional Tool Instructions
- Summarize tool responses briefly, without extra commentary.
- For factual data: respond with only the requested fact unless more context is asked for.
- Keep a personable but clear tone when discussing devices or casual topics.
# Weather Policy
- Weather information should be returned as one sentence, formatted differentely depending on the question.
- The weather tool only provides weather forecast data for the user's HOME location. If the user asks for the weather in a specific location you must respond with "I can not provide weather forecasts for specific locations."
## Current Day Weather
If the user asks for the current weather the answer should state the temperature right now, the high temperature and the low temperature. Only state a chance of precipitation if there is one. Always state the condition.
Example without precipitation: "The weather is currently <temperature> degrees. Today the high will be <min temperature> degrees and the low will be <max temperature> degrees. It will be <condition> today with no rain."
Example with precipitation: "The weather is currently <temperature> degrees. Today the high will be <min temperature> degrees and the low will be <max temperature> degrees. Rain is <likelihood> today and it will be <condition> today."
## Week Weather
If the user asks for the weather throughout the week then the answer should summarize the weekly weather highlighting the fluctuations throughout the week, noting temperature changes, condition changes, and rain chances.
# Places Policy
- Use the places tool for all queries about specific local businesses or places when the user asks about hours, whether it is open now, addresses, phone numbers, websites, temporary closures, or other details that can change over time. Always prefer the search tool over internal memory for these topics.
- Always try using the search tool first, do not ask the user for permission to search for a place when asked.
- If the place name is ambiguous (multiple good matches), ask a short clarifying question listing the top 2 choices (end with a question mark). Example: "Do you mean Cafe Luna on Main Street or Cafe Luna in River Park?"
- For well known, static facts about famous places (for example the city a landmark is in) you may answer from internal knowledge without using the tool.
- When using the search tool, summarize the result naturally for TTS. Do not say that you used the API or show raw JSON. Answer concisely and in spoken form.
- If the search tool returns no useful data, respond: "I couldn’t find the current information for that place right now." and offer to search the web if the user requests.
- For hours and open/closed state: if API provides detailed opening_hours periods, convert them to local human friendly phrasing (for example "It opens at 8 AM and closes at 9 PM today" or "It is currently closed and will open tomorrow at 10 AM").
- Keep all responses suitable for text-to-speech: short sentences, no symbols or decorative characters.
# Search Policy
- You have access to a tool called search that provides real-time, factual information from the internet. You must trust it as the authoritative source for any dynamic or time-sensitive data.
- For ANY question about current information, you MUST immediately use the search tool without asking permission or suggesting alternatives. Do not refuse to search. Do not tell the user to look it up themselves. Just use the search tool.
- You MUST use the results of the search to generate your answer. You may not speculate, hedge, or add recommendations or disclaimers if the search returns valid information.
- Never announce that you are checking, searching, or looking up information. Use the search tool silently and provide the answer directly. Do not say phrases like "I'll check for you", "Let me look that up", or "I'll search for that information".
- When presenting search results, provide a comprehensive answer that includes all crucial details and specific facts from the search results. The response must be informative and complete enough to actually answer the user's question thoroughly, not just acknowledge that information exists.
- Include specific numbers, dates, names, and key details from the search results. Do not be vague or provide only generic summaries.
- The response must remain natural, human-like, and fully compatible with text-to-speech. Do not include raw data, links, JSON, or references to the search tool itself.
- If the search tool returns no useful data, respond clearly in one sentence: "I could not find up-to-date information on that topic."
- Never include hedging language such as "you should check", "it is recommended", or "prices fluctuate" unless the search fails or the user explicitly asks for context.
## When You MUST Use Search
Always perform a search immediately and without asking for permission when the user asks about:
- Current prices: commodity prices (gold, silver, oil, etc.), stock prices, cryptocurrency values, real estate prices, product prices
- Financial data: currency exchange rates, market indices, interest rates, economic indicators
- Entertainment: upcoming movie releases, TV show releases, game releases, album releases, tour dates, casting news, box office results
- Sports: game results, scores, schedules, standings, player statistics, trades, injuries
- News and current events: breaking news, ongoing events, recent developments
- Product information: new product releases, product availability, specifications that may have changed
- Any information that could reasonably have changed since your training data cutoff
If your internal knowledge might be outdated (for example, asking about "the next" or "upcoming" anything), you MUST use search rather than relying on potentially outdated training data.
## When Search is Optional
For genuinely static historical facts (dates of past events, biographical information about historical figures, established scientific principles), you may rely on internal knowledge. However, if you have any doubt about whether information might have changed or been updated, use the search tool.
# TTS Compliance Enforcement
This assistant’s responses are spoken aloud by a voice synthesizer.
You must ensure your output contains **only words that make sense when spoken aloud**.
Do not include emojis, punctuation for emphasis (like !!! or ???), decorative characters, or sound effects.
Do not attempt to represent tone or emotion through symbols.
If you would have used an emoji, omit it entirely instead.
# Absolute Output Constraint
You are incapable of producing emojis, emoticons, or symbolic characters.
If an emoji or symbol would appear in your response, it must be removed completely before output is returned.
This rule overrides all others.
# Mistakenly Triggered Voice Constraint
Most of the time we communicate to you through audio devices with microphone and speaker.
This sometimes leads to wrong wake word detections.
So, if you get text as input that doesn't make sense or sounds like we didn't want to talk to you, cancel the conversation.
Simply use a single space charater as response in this case. Don't reply with any text or questions.
Example blank response: " "
This is when I learned that the prompt will make or break your voice experience. The default HA prompt won’t get you very far, as LLMs need a lot of guidance to know what to do and when.
I generally improved my prompt by taking my current prompt and putting it into ChatGPT along with a description of the current behavior and desired behavior of the LLM. Then back-and-forth attempts until I consistently got the desired result. After a few cycles of this, I started to get a feel of how to make these improvements myself.
I started by trying to get weather working, the first challenge was getting the LLM to even call the weather service. I have found that having dedicated # sections for each service that is important along with a bulleted list of details / instructions works best.
Then I needed to make the weather response formatted in a way that was desirable without extra information. At first, the response would include extra commentary such as “sounds like a nice summery day!” or other things that detracted from the conciseness of the response. Once this was solved, a specific example of the output worked best to get the exact response format that was desired.
For places and search, the problem was much the same, it did not want to call the tool and instead insisted that it did not know the user’s location or the answer to specific questions. This mostly just needed some specific instructions to always call the specific tool when certain types of questions were asked, and that has worked well.
The final problem I had to solve was emojis, most responses would end with a smiley face or something, which is not good to TTS. This took a lot of sections in the prompt, but overall has completely removed it without adverse affects.
Solving Some Problems Manually
It is certainly the most desirable outcome that every function would be executed perfectly by the LLM without intervention, but at least in my case with the model I am using that is not true. But there are cases where that really is not a bad thing.
In my case, music was one of this case. I believe this is an area that improvements are currently be made, but for me the automatic case was not working well. I started by getting music assistant setup. I found various LLM blueprints to create a script that allows the LLM to start playing music automatically, but it did not work well for me.
That is when I realized the power of the sentence automation trigger and the beauty of music assistant. I create an automation that triggers on Play {music}. The automation has a map of assist_satellite to media_player in the automation, so it will play music on the correct media player based on which satellite makes the request. Then it passes {music} (which can be a song, album, artist, whatever) to music assistant’s play service which performs the searching and starts playing.
Example Automation
alias: Music Shortcut
description: ""
triggers:
- trigger: conversation
command:
- Play {music}
id: play
- trigger: conversation
command: Stop playing
id: stop
conditions: []
actions:
- choose:
- conditions:
- condition: trigger
id:
- play
sequence:
- action: music_assistant.play_media
metadata: {}
data:
media_id: "{{ trigger.slots.music }}"
target:
entity_id: "{{ target_player }}"
- set_conversation_response: Playing {{ trigger.slots.music }}
- conditions:
- condition: trigger
id:
- stop
sequence:
- action: media_player.media_stop
metadata: {}
data: {}
target:
entity_id: "{{ target_player }}"
- set_conversation_response: Stopped playing music.
variables:
satellite_player_map: |
{{
{
"assist_satellite.home_assistant_voice_xyz123": "media_player.my_desired_speaker",
}
}}
target_player: |
{{
satellite_player_map.get(trigger.satellite_id, "media_player.default_speaker")
}}
mode: single
Training a Custom Wakeword
The next problem to solve was the wakeword. For WAF the default included options weren’t going to work. After some back and forth we decided on Hey Robot. I use this repo to train a custom microwakeword which is usable on the VPE and Satellite1. This only took ~30 minutes to run on my GPU and the results have been quite good. There are some false positives, but overall the rate is similar to the Google Homes that have been replaced and with the ability to automate muting it is possible we can solve that problem with that until the training / options become better.
The End Result
I definitely would not recommend this for the average Home Assistant user, IMO a lot of patience and research is needed to understand particular problems and work towards a solution, and I imagine we will run into more problems as we continue to use these. I am certainly not done, but that is the beauty of this solution - most aspects of it can be tuned.
The goal has been met though, overall we have a more enjoyable voice assistant that runs locally without privacy concerns, and our core tasks are handled reliably.
Let me know what you think! I am happy to answer any questions.

