Running a Local LLM on an Intel N150 Mini PC for Home Assistant — A Practical Journey
TL;DR: A cheap (<€200) Intel N150 mini PC with 16GB RAM can run a local LLM fast enough to be genuinely useful for async cached Home Assistant automations and announcements, when set up correctly with Vulkan GPU acceleration. This post documents the full journey — hardware, OS, model selection, performance benchmarks, and HA integration.
The Hardware
Soyo M4 Plus 2
- Intel N150 (Alder Lake-N, 4 cores)
- 16GB DDR4 (slow, shared with iGPU)
- 512GB SSD
- Intel UHD integrated graphics (Xe, ADL-N)
This is a fanless, passively cooled mini PC in the €150-€200 range. It was originally purchased to migrate my Mosquitto MQTT broker from an old Raspberry Pi, but repurposed here as a local LLM inference server on a temporary basis, for a POC.
Another identical N150 unit also runs as a separate dedicated machine in my HA stack, as a Whisper/Piper STT/TTS server for Home Assistant voice, using OpenVINO acceleration.
The Goal
Not to replace cloud AI. Not to run a powerful reasoning model. The specific goal was:
Run a local LLM that can generate natural language announcements and summaries for Home Assistant automations, where speed is not critical because results are cached asynchronously.
Examples of this use case:
- Weather condition summaries spoken by a TTS system
- Morning briefings compiled from HA sensor data
- Presence arrival/departure announcements
- Anomaly alerts (window open, temperature warnings, etc.)
This is a niche but real use case. If your automations pre-generate text and cache it in an input_text helper, a slow model is perfectly acceptable. Inspiration for this approach came from this blog post by Strato
The Stack
- OS: Debian 13 (Trixie) — minimal install, SSH server + standard utilities only
- Inference engine: llama.cpp (PrismML fork,
prismbranch) with Vulkan backend - HA integration: home-llm by acon96 (via HACS)
- API: OpenAI-compatible REST on port 8080, served by a systemd service
Note: The stack evolved during this process rather than being planned upfront. I started with the PrismML fork of llama.cpp specifically because the first model I tested — Ternary Bonsai 8B — uses a custom Q2_0 quantization format not yet available in mainline llama.cpp. It’s worth noting that Ternary Bonsai models are primarily designed to run natively on Apple devices (Mac, iPhone, iPad) via MLX, where they truly shine. I used a community GGUF variant to test on x86 hardware — a perfectly valid approach, but not PrismML’s primary target platform. Vulkan support was added later, as a second build step, after realizing that the Intel integrated GPU on the N150 could provide significant acceleration. By the time I settled on the final model choice, mainline llama.cpp would have worked equally well — but the PrismML fork with Vulkan remains in place and works perfectly, and will allow other ternary models from PrismML to be tested if needed. This evolution will make more sense as the model selection journey is described below.
Model Journey
Attempt 1: Ternary Bonsai 8B (Q2_0)
The Ternary Bonsai 8B was the original target. It uses 1.58-bit ternary weights ({-1, 0, +1}), resulting in a 2.03GB GGUF file — remarkably small for an 8B model.
Why it seemed promising:
- Tiny memory footprint (~677MB RAM at runtime)
- Designed specifically for CPU inference efficiency
- Apache 2.0 licensed
Results (CPU only, no Vulkan — PrismML fork uses custom Q2_0 format not yet in mainline llama.cpp):
| Metric | Value |
|---|---|
| Prompt processing | ~2.2 tok/s |
| Token generation | ~2.1 tok/s |
| Weather prompt (280 tokens) | ~150 seconds total |
| RAM usage | ~677MB |
Verdict: Technically functional, but 150 seconds for a weather summary is too slow even for async use. The ternary format also doesn’t benefit from Vulkan acceleration in its current implementation, which is a significant disadvantage on this hardware.
Output quality: Decent for simple queries, but struggled with multi-constraint prompts — hallucinated on some values, ignored formatting instructions, exceeded character limits.
Attempt 2: IBM Granite 3.1 3B a800m Instruct (Q6_K_L)
Discovered via a community benchmark post on Reddit showing impressive N150 performance with Vulkan. First model tested with the Vulkan-enabled llama.cpp build.
The a800m notation indicates 800M active parameters per forward pass out of 3.3B total — IBM’s efficiency-focused architecture designed for constrained environments.
This required a re-build of llama.cpp
Vulkan setup required:
sudo apt install -y libvulkan-dev vulkan-tools mesa-vulkan-drivers intel-media-va-driver glslc spirv-headers
sudo usermod -aG render,video $USER
cmake -B build -DGGML_VULKAN=ON
cmake --build build -j$(nproc)
Results (Vulkan, 33/33 layers offloaded to Intel iGPU):
| Metric | Value |
|---|---|
| Prompt processing | ~51 tok/s |
| Token generation | ~11.9 tok/s |
| Weather prompt | ~10 seconds total |
| RAM / Vulkan VRAM | 76MB CPU + 2604MB GPU |
A 23x improvement in prompt processing over Bonsai on CPU alone.
Output quality: Fast and capable for simple tasks, but unreliable on complex multi-constraint prompts. Examples of failures on the weather summary prompt:
- Invented windchill values not present in the data (“wind chill is a relaxing 0°C”)
- Described 20.9°C as “chilly”
- Ignored the “no forecasting” instruction
- Used km/h despite being explicitly told not to
Verdict: Excellent speed, insufficient instruction following for structured constrained output at 3B parameters. Useful for simpler tasks.
Final Choice: Qwen3 4B Instruct 2507 (Q4_K_M)
Alibaba’s Qwen3 4B in its July 2025 revision. Key differentiator: native support for /no_think suffix to disable chain-of-thought reasoning, significantly improving both speed and instruction following for structured tasks.
File: Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf (~2.5GB) from bartowski’s GGUF repo
Results (Vulkan, /no_think):
| Metric | Value |
|---|---|
| Prompt processing | 22–85 tok/s (scales with prompt length) |
| Token generation | ~5 tok/s |
| Weather prompt | ~10–15 seconds total |
| RAM / Vulkan VRAM | ~3.5GB total |
Note on PP variability: Prompt processing speed scales with prompt length because longer prompts fill llama.cpp’s batch windows more efficiently. A 280-token weather prompt achieves ~85 tok/s PP; a short 8-token query achieves ~17 tok/s PP. Token generation is consistent regardless of prompt length.
Output quality — weather summary comparison:
GPT-4o mini:
“It’s currently 16.4 degrees Celsius outside, which feels the same, with 53% humidity and a light wind coming from the southwest at 7.9 kilometers per hour.”
Qwen3 4B local:
“It’s 16.3 degrees, feels just like that, a bit humid with a gentle 14.8 kilometers per hour wind from the east and a UV index of 7.”
Both are correct, natural, and usable. The local model is slightly more concise and notably more consistent — lower temperature + deterministic hardware = less variance between runs. The data is not identical since they were run at different times.
Other tested prompts:
Device state summary:
“The living room lights are on at 40% brightness, the TV is on, the bedroom lights are off, and the front door is locked.”
Arrival announcement:
“Hey there! Sandro has just arrived home — welcome back, friend!
”
Just needs an additional instruction to not use emojis…
Anomaly alert:
“Just a quick note: the kitchen window has been open for three hours, and it’s 8°C outside. Might want to check and close it soon.”
Morning briefing:
“Good morning, it’s 7:30 am on a Tuesday. The weather is 12 degrees and cloudy. You have a dentist appointment at 10 am and a team meeting at 3 pm today.”
Performance Summary
| Model | Size | PP tok/s | TG tok/s | Weather prompt | Instruction following |
|---|---|---|---|---|---|
| Bonsai 8B Q2_0 | 2.0GB | ~2.2 | ~2.1 | ~150s | Poor on complex prompts |
| Granite 3.1 3B Q6_K_L | 2.7GB | ~51 | ~11.9 | ~10s | Poor on complex prompts |
| Qwen3 4B 2507 Q4_K_M | 2.5GB | ~17–85 | ~5 | ~10-15s | Good with /no_think |
Home Assistant Integration
Using the home-llm HACS integration (v0.4.7), configured for the llama.cpp backend.
Key lessons learned:
-
The
descriptionandrequiredfields in thestructureblock ofai_task.generate_dataare functionally important, not optional. Withoutdescription, structured output extraction returnsdata: {}. -
Add
/no_thinkto the system prompt in the AI Task Handler configuration to disable reasoning globally for all tasks:You are a task-specific assistant. Follow the task instructions and return the requested data. /no_think -
Some changes to the default parameters for the AI Task configuration:
Top P = 0.95
Top K = 0.20
Remote Request Timeout = 120 seconds (increased to account for slow model on this hardware) -
The AI task structure should look like this:
structure: sentence: description: A one-sentence conversational weather summary required: true selector: text: null
Example automation action:
- action: ai_task.generate_data
data:
task_name: Summarize outdoor weather conditions
instructions: >
Act as a local observer. Provide a one-sentence conversational summary
of the weather as if answering "How is it outside?" using this data:
Temperature: {{ states('sensor.outdoor_temperature') }}°C,
feels like {{ states('sensor.feels_like') }}°C,
wind {{ states('sensor.wind_speed') }} kilometers per hours
from {{ ['N','NE','E','SE','S','SW','W','NW','N']
[((states('sensor.wind_direction') | float / 45) | round(0)) % 8] }}
({{ states('sensor.wind_direction') }}°).
Be factual and conversational. No forecasting. Under 250 characters.
/no_think
entity_id: ai_task.your_local_llm_ai_task_entity_id
structure:
sentence:
description: A one-sentence conversational weather summary
required: true
selector:
text: null
response_variable: weather_summary
- action: input_text.set_value
target:
entity_id: input_text.weather_summary_cache
data:
value: "{{ weather_summary.data.sentence }}"
systemd Service
The llama-server runs as a systemd service, auto-starting on boot:
[Unit]
Description=Llama.cpp Server - Qwen3 4B Instruct 2507
After=network.target
[Service]
Type=simple
User=youruser
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
-m /home/youruser/models/Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
-c 4096 \
--host 0.0.0.0 \
--port 8080 \
-ngl 99
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
The -ngl 99 flag offloads all layers to the Intel iGPU via Vulkan.
Conclusions
This experiment demonstrates a genuine niche use case for low-powered mini PCs in a Home Assistant setup:
Where it works well:
- Async cached text generation where latency of 10–30 seconds is acceptable
- Short structured outputs (announcements, summaries, alerts)
- Offline operation — no cloud dependency, no API costs, no privacy concerns
- Always-on, low power consumption (~6–10W at idle)
Where it doesn’t replace cloud AI:
- Real-time conversation agents
- Complex multi-step reasoning
- Tasks requiring very strict instruction following with many simultaneous constraints
The key insight: The Intel integrated GPU on N150/N250 hardware has meaningful Vulkan compute capability that llama.cpp can leverage. Without Vulkan, this hardware is marginal. With Vulkan, it becomes genuinely useful for the right workload. The jump from ~2 tok/s (CPU only) to ~50+ tok/s prompt processing (Vulkan) is not incremental — it’s transformative for this use case.
A Mac Mini or Mac Studio with Apple Silicon would obviously be far more capable, but for a machine that was already earmarked for another purpose (Mosquitto MQTT broker), getting usable local LLM inference as a bonus is a pleasant surprise.
Setup tested on: Soyo M4 Plus 2 (Intel N150, 16GB DDR4), Debian 13 (Trixie), llama.cpp PrismML fork, home-llm v0.4.7, Home Assistant 2026.4.3