Going all-in with Voice Assistant (help needed)

tl;dr Scroll down to the “Needs” section where I list what phrases I need and how most don’t work.

Background

By the end of this week, when all my speakers have finally come in the mail, I’ll be fully transitioning over to only using Home Assistant Voice Preview Edition after 10 years of Google Home and Alexa.

I currently use a combination of:

  1. Amazon Echo Dot for basic smart home tasks.
  2. Google Home Mini for music and broadcasts.
  3. Google Chromecast Audio for music to speakers over a 3.5mm cable.

Reasons I use both

  1. Alexa announcements broke this year because of added noise cancellation. Announcements end up getting chopped up to the point where you couldn’t understand them, or they’re mostly slience. Google Home Mini is also louder than my 2nd Gen Echo Dots.
  2. My 1st gen Google Home Mini devices are super slow for smart home tasks. I mean, for the first year of ownership, they couldn’t even give me the time. On the other hand, their ability to play music is wonderful. I wouldn’t use them if not for that feature.
  3. I like that Alexa lets me find my phone and works 99.9% of the time no matter what I ask it. It even has a feature where you can whisper to it, and it whispers back as well as the ability to talk to another Alexa device (intercom) by using the “drop-in” keyword. You can even call people’s phones when you’re still in bed!

Needs when using only Home Assistant Voice PE

Voice Assistant phrases I use just about every day:

  1. “What time is it?”
  2. “What day is it?”
  3. “What’s the temperature [outside]?”
  4. “Turn on Exhaust Fan for 10 minutes”.
  5. “Broadcast (or Announce) Dinner’s ready!”.
  6. “Drop in on Kitchen”: intercom feature.
  7. “Find my phone” feature (rings the phone by calling it or playing an alarm).
  8. “What did you do?” for recalling the last action.

It’s more rare to ask “what sound does a badger make” or “what’s 5 times 8042?”. While fun, those are effectively irrelevant because I can always use my phone. The other ones, I want immediate feedback.

Another issue, Voice Assistant always does weird stuff like this:

What it does today

:white_check_mark: 1. “What time is it?”

:no_entry: 2. “What day is it?”

This question always does some wild stuff when it runs directly on the AI without local.

:white_check_mark: 2.1. “What’s the date?”

This works, but any deviation gives weird responses again.

:no_entry: 3. “What’s the temperature [outside]?”



:no_entry: 3.1 Exposing weather entities

It appears I didn’t have any weather entities exposed, so I did that:

And it still didn’t work :cry::

:white_check_mark: 3.2. Ask for the weather

After asking for the weather directly, now I’m getting the outside temperature.

:no_entry: 4. “Turn on Exhaust Fan for 10 minutes”.

:white_check_mark: 4.1 “Turn off Exhaust Fan in 10 minutes.”

Since turning it off on a timer using the “Turn On” command didn’t work, I tried something simpler:

I’ll find out in 10 minutes. Timers like these work correctly in my testing earlier this month.

UPDATE: It worked.

It sucks to have to turn it on first and then ask it to run a command in 10 minutes rather than a 2-for-1 situation.

:no_entry: 5. “Broadcast (or Announce) Dinner’s ready!”.

It’s late, so I’ll have to do this another day. As far as I understand, this does not work.

My earlier test tonight, trying to announce to a single room, also didn’t work.

:no_entry: 6. “Drop in on Kitchen”: intercom feature.

Simply doesn’t work. There’s no voice transfer or recording functionality sadly.

Maybe Sendspin will make this possible in the future, but I’d like something working now.

:no_entry: 7. “Find my phone” feature.

It can’t tell what I’m asking.

It’s also late, so I can’t verify if I can announce to my phone. The most I know I can do is probably send a notification.

:no_entry: 8. “What did you do?” for recalling the last action.

It has no clue what I’m asking when I say this.

It’s useful to have this to check what it did if it says “yes, I did that thing you wanted” and the thing didn’t happen. Now you have to wonder what actually happened.

Conclusion

Pretty much all Voice Assistant can do for me is give me the time and “date”, not “day”.

Music Playback :+1::+1:

The other thing I needed from Voice PE is streaming music, and it does that fantastically well with Music Assistant!

While the onboard speaker is complete and utter trash, not useful unless it’s right by your ear, I’ve attached every single Voice PE to a nice speaker over 3.5mm. This also replaces all my Google Chromecast Audio devices!

I suddenly remembered some other stuff I say (I keep remembering more):

  1. "Turn off <ROOM_NAME> lights."
  2. “Play <SONG_NAME> on <SPEAKER_GROUP_NAME>.”
  3. “Set volume in <ROOM_NAME> to 30%.”

And as I said before, whispering commands to avoid it shouting “OK, TURNING ON LIGHTS IN <ROOM_NAME>” back at me. This probably won’t ever work unless a context like “user is shouting” or “user is whispering” is integrated into Home Assistant.

Wide variation depending on STT provider
Speech to Phrase STT is good but vary specific. I find that it does not take kindly to low input volume (speak loudly or be close to device). It had set dictionary and i think my TV in the background (loud) interferes and prevents it from matching.

Faster Whisper STT is good but if mistakenly woken by TV it will ramble on with its “It didnt understand … 30 seconds later…command”.

Nabu cloud was very good for some reason. no complaints but havent used in a while.

I use both Speech2phrase and faster whisper currently. Alexa and Hey Jarvis wake words enable one or other. doing this so I can test each.

Drop in feature currently does not exist.
You can directly send TTS message to a device. I use this for now but drop in is much needed. Maybe by end of 2026

1 Like

I can appreciate you testing and giving a rundown on what didn’t work but… You also did not tell us what method (speech to phrase or llm) you’re using for voice. It’s NOT plug and play quite far from it.

The folks here can help you get better accuracy but we’ll need more info and honestly lower your expectation a tad - it is still not a finished product at all (preview edition) I say this because I note you bounce products when one didn’t do something you wanted. The vpe will have a lot you don’t like about it for a while Im sure.

Voice needs if you’re not using an LLM. extensive tools. Read: you need to ensure tools exist for your ask. And if they don’t write them yourself or pull blueprints. There’s a great one for music control by music assistant for instance.

If not using an LLM, speech to phrase would trip over time of day for sure. No tool.

If you are using an llm, tools and extensive grounding. Are required (see the grandma’s cardboard box problem on Fridays party - search it’ll be like hit #1) simply exposing entities is NOT enough, you also have to prompt effectively to tell the llm what to do. Without, we’ll you see the result - it’ll feel like a drunk college student. But you already found that out.

So are you using llm or speech to phrase?

Default keyword recognition is bad

@tmjpugh Even shouting, it’s not great at understanding me.

Even next to it, it sometimes does nothing when I say the wake word. Maybe it needs more consonants?

I also have the mic on “High” sensitivity.

AI models

@NathanCu I’m using Home Assistant Cloud with “Ok Nabu” as the keyword and “prefer local”.

I set up Ollama with a bunch of different models and the only decent experiences were with llama3.2 and qwen3. Anything else I tried had a lot of trouble. llama3.2 was the best overall, but it seemed to get worse as I used it. Like it would respond correctly a couple times, then suddenly gave me weirdo responses a minute later for the same question even with a new context. In the end, I decided to avoid any Ollama setup for now.

I also tried changing the prompt, but it seemed like that only made things worse.

Speech to Phrase

@NathanCu What’s Speech to Phrase?

Explorations

I was asking ChatGPT for some ideas last night, and it was telling me to use 100% YAML custom intents and phrases. It’s okay, but I don’t like the idea of putting entity ids in YAML without a UI component. Seems like an easy way to break stuff until you find out months later.

I couldn’t find that Grandma’s Voice Box thing you were talking about either.

Here you go

STT (speech to text) is used to convert words you say into text (a sentence) that may be used as input command to AI, LLM, your voice assistant basically.

Speech to Phrase is a STT provider
Whisper is another STT provider

When setting up your voice assistant you should have setup one of these.

TTS (text to speech) is opposite. It is used by voice assistant to convert the output text response into a audible speech response for playback

Piper is generally used for this.

In case of STT and TTS there are others available but those above are the HA recomendations I beileve. In both case you can use Nabu Cloud. I think that used Speech-to-phrase for STT and piper for TTS. I used Nabu for a while and it worked the best I believe but I think I need to test again to have good comparison

1 Like

Are you saying that I only need to add Tater to Ollama, and I’m good? What prompt are you using? Default?

After that I’m still not sure what you use now for processing the commands itself.
Is it another LLM model now?

If you’re not sure if the problem is your local model, or if HA simply doesn’t provide the needed tools / capabilities for your requests, you could try one of the cheaper cloud models (e.g. gtp-<any-number>-mini would be a good starting point).
Once you got things running here, try a local model and find out what’s not working anymore and investigate why.
Might be easier than trying to solve all problems out there at once. :wink:

As you seem to use LLMs instead of speech to phrase based on your comments, I would suggest also disabling “prefer local” for the time being.
Otherwise you’re fighting against 2 systems, where one or both might not support what you’re testing and you end up with mixed results.

What you then need, is a good prompt to tell the AI important stuff about how you want to control things to get repeatable results and how it should behave / control things.

The other part are user intends / scripts that the AI can call to do what you want.
HA simply hasn’t everything on board so far to make power voice users happy.
That might change over time, as they are adding more and more intends.

Even a search tool for entities is very helpful if you expose a lot entities, so the LLM can easily lookup the needed ones for rooms, user tags, …
Home Assistant only provides a large lists of entities with state and some attributes, so based on how “smart” your LLM is and how large the entity list
Otherwise simple requests like “turn off all lights in the living room” might not always work as expected.


Once you're at that point you should be ready to add your own scripts to add "capabilities" to your assistant.

Not sure how Tater’s setup is exactly structured. I’ve seen it’s a whole addon, so wasn’t able to find any single scripts at a glance.
Is it using HA intents / scripts Tater, or are you using a different approach to control things externally, as your package also adresses other things beside HA?

Nate’s Friday’s Party thread that he linked above, offers the most advanced collection of tooling and prompt shared in the forums.
He’s also bundling up everything for an easier start to use it at the moment.


About your music control: You most likely use the script from TheFes to control Music Assistant (as it offers more options compared to the internal Music Playback script, at least this was the case until 1 or 2 months ago)?

I wrote another script to do music search capabilities for the LLM instead of just playback by search terms, which opens new possibilities.
Wrote about that here.

If you need even more inspiration on some simple scripts or prompting, I try to write down all the problems / solutions I run into while setting up Assist to use in our house:

https://community.home-assistant.io/t/about-making-inexpensive-models-smarter-by-providing-tools-and-context-local-models-gpt-5-mini-gpt-4-1-mini-gpt-4o-mini/

Edit, about your list:

  1. “What time is it?”
    Should be possible out of the box with an LLM based assist.

  2. “What day is it?”
    Should be possible out of the box with an LLM based assist.

  3. “What’s the temperature [outside]?”
    Needs some prompting how it should retreive that to be reliable and always the same way. Also tell it in the prompt if you want to use a temperature sensor or a weather entity for questions like that (which also needs a tool if that didn’t change recently. I use this one).

  4. “Turn on Exhaust Fan for 10 minutes”.
    I think delayed commands are already possible with a default intent from HA. Not 100% sure, don’t use this often myself.

  5. “Broadcast (or Announce) Dinner’s ready!”.
    Will need a script to do TTS to a media player in a specific room.
    Also currently on my todo list.

  6. “Drop in on Kitchen”: intercom feature.
    As you mentioned, this is most likely not easily possible atm.

  7. “Find my phone” feature (rings the phone by calling it or playing an alarm).
    A script that allows to send notifications to mobile phones would allow for that too, if you allow the script to also send criticalk notifications (and allow this for the HA companion app in the iOS settings).
    Then you can send notifications with sound that is also played if the device is muted.

  8. “What did you do?” for recalling the last action.
    An LLM based assist should be able to tell you that without additional tools / information. If not, something is broken with conversation history in your LLM setup or the LLM integration you use (which means each request will become a new conversation and the LLM won’t have any information about what you asked before)

3 Likes

I recently bought a Claude AI subscription, connected it to Home Assistant with the unofficial HA-MCP project.

Since then, I’ve been building up automations and fixing things that’ve been in my backlog, and now I’m free to look at Voice Assistant again. I can use Claude to help integrate anything suggested, so I don’t have to do all of it myself.

I would prefer to use local-only if I can make it work the way I’d like. I was only using AI as a fallback because local controls (by default) are very limiting.

I do have tags on everything to make it easier to find something. They’re all clearly named as well.

I’m not using any of these (Tater nor Nate’s Friday Party). I don’t know what they are nor why I’d want one over the other.

I’ll have to check out TheFes’s script.

And your script! :+1:

The other things were great that you wrote! Thanks. I’ll go through the list and see how many of my requirements I can get through now!

1 Like

Home Assistant has had a ton of updates since I first made this thread including Music intents that specifically work with Music Assistant.

New Data

Tried this again with gemma4:e4b doing the processing when local commands don’t work.

1: Current Time

:white_check_mark: 1.1: “What time is it?”

:white_check_mark: 1.2: What’s the time?

2: Current Date

:white_check_mark: 2.1: “What day is it?”

:white_check_mark: 2.2 “What’s the date?”

:white_check_mark: 2.3 “What day of the week?”

3: Temperature

:no_entry: 3.1: “What’s the temperature?” (expecting outside)

If it doesn’t know, it should ask me more simply “which area would you like to know?”

:white_check_mark: 3.2: “What’s the temperature outside?”

:white_check_mark: 3.3: “What’s the temperature in the office?”

It’s right!

I tried asking for other rooms with temperature sensors on devices, but not a Thermostat on the wall, and it couldn’t do it:

I just realized that’s because I haven’t exposed those entities. I may or may not do that in the future, but it’s good to know. Anyone have advice on if I should blanket expose all my entities like that?

:no_entry: 3.4: “What’s the temperature downstairs?”

Maybe it’d be able to find the one thermostat on the first floor instead?:

I tried coaxing it into giving me more abstract data, but it couldn’t figure it out:

I can probably add an alias for “Downstairs” and “Upstairs” on the thermostat entities.

4: Weather

:white_check_mark: 4.1: “What’s the weather like?” (expecting today)

I tried to coax it into giving more human-style info, but it couldn’t do that:

I was hoping I could get the “what it feels like” temperature, but I don’t know what I need to ask:


:no_entry: 4.2: “Is it going to rain this week?”

I did expose the “Forecast” entity, so I don’t know what’s going on here:

Even with a different question, it couldn’t answer more than the current conditions:

5: Timers

:no_entry: 5.1: “Turn on Kitchen exhaust fan for 1 minute.”

This is interesting. This is actually something that works without an LLM in the past, but only if I told it to turn something off in X time. What am I missing?

:no_entry: 5.2: “Set a timer for 1 minute.”

6: Announcements

I know announcements don’t work right. I have a script to do it, but it appears it only announces to satellite speakers, not media player entities.

7: Drop in on Room

Not possible to have a conversation over Voice yet.

8: Find my phone

Kinda… It can call my scripts, but can’t ring my phone or make a call. Is there a better solution for this?

9: “What did you do?”

I have an example above of it being able to tell me what it did or how it go to a conclusion. Much better than before!

10: Music Playback

I tested this already, and I know it works.

It can even find songs with Romanized Japanese titles. Haven’t check if it can look up songs with Japanese characters though.

Findings

  1. Most tests passed now that didn’t pass in the past. Helps to have an LLM, but even better is gemma4 vs llama3.2.
  2. I don’t have enough entities exposed for the things I wanna ask sometimes, and I don’t know if I should expose them or not.
  3. Timers no longer work. They used to though. I’m not sure what happened.
1 Like

Expose as FEW as possible the more you expose the bigger your context gets. You want a solution like an index for information entities

1 Like

Announcements: That all depends on your script. You should be able to send a TTS message to any media player

Find my phone: Also here, if you create a good script, you should be able to make your phone ring.

Timers: Timers don’t work on any device, as the timer will run on the device itself. It should work if you issue it by voice on the HA Voice PE

The fun thing about voice in HA, is that you can build anything you want using sentence triggered automations and/or scripts. But for some things you need to build the logic yourself.
I created some blueprints which you can use to get e.g. weather forecasts working. You can find more info here (oh wait, I see @Thyraz already linked to them)

1 Like

Look at my postman script for examples for announce it does all those things.Like TTS and find phone. It’s based on Drew’s implementation.

I use Fes scripts for music and weather for Friday. And won’t write my own unless his break.

Timers are a known pain point. Until HA decides on how we do a central timer registry. Rn they’re managed on device. And that really needs a better permanent solution.

1 Like

Should it? Did you put that in your prompt? I have this type of behavior in my prompt, though I am running a larger model (Gemma4 26B-A4B) but it has no problem asking when something is not clear.

This requires external tools, the built in tools do not have this capability at all. I’d recommend GitHub - skye-harris/llm_intents: Exposes internet search tools for use by LLM-backed Assist in Home Assistant · GitHub which adds lots of tools for things like this.

1 Like

Please PLEASE PLEASE be very VERY careful with a bare search and scrape tool that does not include prompt injection defense

This (unprotected web search) is the #1 way to compromise an LLM controlled install

I use a module for Litellm tor this and set litellm between me and any inference services (including local)

1 Like

No.

No it does not unless you have instructed it otherwise.

Everyone needs to remember in LLM land - when it’ ‘wakes’ think this way…

To think like an LLM, Imagine yourself in a dark room with NOTHING.
You get ABSOLUTELY NOTHING unless you are given instructions. WHO you are what you are supposed to do and what does the world look like and how you do the things are all things that need to be explained to you. You know language and can understand instructions and have some basic knowledge about making decisions and understand better and worse. But that’s about it from any practical start.

Take NOTHING for granted, do not assume it CAN do anything. << Good frame to start with.

You are an assistant - OK I’m an assistant what’s that? <<< Literally this is your agent - and why I hate that stupid one liner default prompt.

So this is a 100% burn this into your brain when prompting.

ASSUME THE LLM CANNOT UNDERSTAND WHAT YOU MEAN Is 8 years old and needs EVERYTHING explained if you don’t you get what you get.

  • If you want it to shut up and do nothing - then tell it.
  • If you want it to ask questions and challenge you when it doesnt know guess what. You MUST tell it.
  • You cannot ASSUME it can do anything.

For that, add context. - the entire story of Friday - You get what you tell it. Not anymore. But in a ironic turn, more context you add… The worse it performs and there is a hard top limit to context. Most models you’ll realistically use that top end is about 256K. BUT for performant use you really need that CTX less than 128K (And the smaller the better) And at 128K with most decent home use LLMS that are capable - that 128K context means give or take a 12 second lag time before first token. (TTFT) Sooo.

To reduce TTFT go as absolutely small on CTX as possible (but guess what, the smaller you go the less capability you have…LOL) Thus, why you don’t want to dump hundreds of ents in your assist context just to be able to read them. (I’m a big proponent for a readable index of some type outside the assist control surface)

What you’re learning to do is balance instructions v. context size and the RIGHT answer is putting in the things you need while trimming the things you don’t. You’ll have to make a lot of compromises.

You will find yourself basically saying…
“Well, I can make it do this but that’s a 16K block into the prompt or I can make it do that and it’s harder and maybe more convoluted but prevents me from dumping 16K in the prompt just to do XYZ…”

2 Likes

Yeah, absolutely. Sometimes it is easy to think of it as a mind reader / expect it to know what you mean from a few words. I have found it quite useful when an LLM doesn’t behave the way I expected to just ask it why it did that. That made me realize some of my rules were actually contradicting each other. It told me it was confusing and that was correct.

1 Like

tbh at this point I’m finding my time (and my Claude sessions) better spent crafting extensive Sentence Trigger automations and just skipping using an LLM conversation agent altogether. I hate that it has to be this way (having to semi hard-code my commands) but slots/variables in custom sentences gets me 99% of what I need out of Voice with much less fuss.

It’s ridiculous but all this mushy prompt wrangling, juggling context sizes, having to minimize the number of exposed entities, and generally acting more like a babysitter than a leader is silly, especially if the LLM is just as likely to return or execute incorrect info (after spinning its wheels for 10 seconds!) versus just rejecting the request. I’d rather it reject the request immediately so I can rephrase more clearly.

If I want access to general knowledge questions I’ll create a separate LLM agent that has no control over my smart home, and access it via the 2nd wake word.

2 Likes