Blueprint on AI using ollama on Apple Silicon

Hey everyone! :waving_hand:

I wanted to share my setup for a completely local voice assistant using Home Assistant with a Mac Mini as the AI inference server. This setup uses zero cloud services - everything runs on your own hardware!

Hardware

  • Mac Mini M2 - Runs all AI services
  • Home Assistant - On separate hardware
  • Home Assistant Voice Preview Edition - ESP32-S3 based voice satellite## The Stack

On Mac Mini (All using Apple Silicon acceleration):

  1. Whisper.cpp (Port 8910)

    • STT using ggml-large-v3-turbo model
    • Launched via LaunchAgent with whisper-server
  2. Wyoming-Whisper-API-Client (Port 10300)

    • Bridges whisper.cpp to Wyoming protocol
    • Installed via Homebrew
  3. Ollama (Port 11434)

    • Local LLM using llama3.2-vision
    • Installed via Homebrew as a service
  4. Wyoming-Piper (Port 10200) - :warning: Important Note

    • TTS with en_US-hfc_female-medium voice
    • Using custom fork: https://github.com/jooray/wyoming-piper
    • This fork fixes compatibility with piper-tts==1.3.0
    • Installed in Python 3.12 venv (Python 3.13 removed audioop module)

Why the Wyoming-Piper Fork?

The official wyoming-piper is currently broken with the latest piper-tts. This fork provides a temporary fix by using command-line invocation instead of the Python API. Use this fork until the official version is updated.

Gotchas I Encountered

  1. HTTPS breaks ESP audio - Use HTTP for internal_url in configuration.yaml
  2. Python 3.13 breaks audio libs - Use Python 3.12 for Wyoming-Piper
  3. Wyoming-Piper needs the fork - Official version has compatibility issues
  4. Ollama needs all interfaces - Set OLLAMA_HOST=0.0.0.0
  5. Choosing a good model (see below) - tool calling
  6. Naming my devices - I have a light called bedroom, but also an AC called bedroom. They are both in my bedroom. When I tell it to turn on the bedroom AC, it usually fails, unless I create an Alias for this entity in Assistant configuration (literally “Bedroom AC”, “Bedroom Air Conditioning”).

Model choice

This is where I would like some help. A lot of information out there is out of date. What is the best model?

HA switched to tool calling for ollama, which is great, but not all models support tool calling. You can find those that do here. Out of these, qwen3-based models work best, but the problem is it is hard to turn off thinking (reasoning), I haven’t figured out how to do it. That means, that the model is thinking too much and that greatly increases the latency. I tried all models up to qwen3:4b.

LLaMA-based models - the smaller ones - seem to be worse and often don’t do what I need (and llama3 does not support tool calling). I generally often use gemma, with small gemma 3n it should be great, but these also don’t support tool calling.

The models specifically trained for HA such as fixt/home-3b-v3 don’t work anymore, because they don’t support function calling. So it’s quite hard to find online what people are using these days - past recommendations are often broken (and in AI world, 6 month old recommendation is basically paleolithic anyway).

Venice.ai

I have also tried venice-ai using their API. This is generally much faster and I can afford to run bigger model, but I also don’t know which one to choose, but 70b llama models seem at least usable. I would still prefer running local models.

Future Improvements

Once the official wyoming-piper is fixed, I’ll update to remove the fork dependency. Also considering adding more voice satellites around the house.

Hope this helps someone else achieve a fully local voice assistant! Happy to answer questions about the setup. :studio_microphone::house:


Note: This is a working setup as of July 2025.

1 Like

Newest ollama integration (2025.7.x) has direct support for think tags and defaults off.

Llama vision is good for images but probably heavy for daily driver.

You’ll want a tool use model that supports a decent number of params (i usually go 4-8b) and your context window size is important. Shoot for 8K or better.

I use venice-ai integration. Had to put the link in comment, as a new user I can only add two links to the post, sorry.

Any particular model recommendation that works for you?

The think tag support is OK, it’s just that the thinking happens, which slows down the response. It works OK, it does not speak its reasoning to me. But the fact that it needs to go through the thinking process makes the reaction slower. (I can see in the logs that it is thinking)

It depends on use case. There is not one model fits all. And if there is it will be a reasoning model with a very high context size. I don’t do that for everything. I use a mix.

Llama3.2:4b or 8b for the single pass Summaries…
Llama3.2:vision for camera and image reads
And For local reasoning qwen.

Honestly qwen is going to give a really good size perf model but a lot of your success is the prompt.

That said I need a gigantic context size for what I’m doing so I use oai GPT4.1-mini (paid) for the Frontline calls until I have a faster local inference farm.

Gigantic - vram you talk a lot about models but never mentioned vram.

What makes these things go is context size. I can make a relatively small model outperform with a large context and good context description. Meanwhile I can make a big model like qwen run like crap with a small context size and no instructions…

I have 32GB, so I can fit most models. I had a great experience with mistral-small now. Qwen has latency and the small llamas need a really good prompt.

For remote inference with venice, I get good results with qwen3-235b, but you need to disable thinking, otherwise it does not work. Patch to the integration incoming.

Check out the newest qwen releases (8/1/2025) - they have fixed most of the issues and have a thinking and non thinking separate model - Should be able to get a 32b that punches WAY above it’s class. I’m eyeballing them hopefully.

1 Like

gpt-oss:latest (the 20b model) works pretty well!

I’m also looking for the right model to run on a Mac mini M4 with 32 GB RAM, and so far the results have been pretty mixed. It’s either too slow or just not smart enough.

I’m currently dealing with 300+ entities, around 100 of them exposed to the voice assistant. That’s obviously quite a performance hit, but if I trim things down too much, the voice assistant loses most of its value.

Have any of you found newer models that work really well with Home Assistant?