Running a Local LLM on an Intel N150 Mini PC for Home Assistant — A Practical Journey

Running a Local LLM on an Intel N150 Mini PC for Home Assistant — A Practical Journey

TL;DR: A cheap (<€200) Intel N150 mini PC with 16GB RAM can run a local LLM fast enough to be genuinely useful for async cached Home Assistant automations and announcements, when set up correctly with Vulkan GPU acceleration. This post documents the full journey — hardware, OS, model selection, performance benchmarks, and HA integration.


The Hardware

Soyo M4 Plus 2

  • Intel N150 (Alder Lake-N, 4 cores)
  • 16GB DDR4 (slow, shared with iGPU)
  • 512GB SSD
  • Intel UHD integrated graphics (Xe, ADL-N)

This is a fanless, passively cooled mini PC in the €150-€200 range. It was originally purchased to migrate my Mosquitto MQTT broker from an old Raspberry Pi, but repurposed here as a local LLM inference server on a temporary basis, for a POC.

Another identical N150 unit also runs as a separate dedicated machine in my HA stack, as a Whisper/Piper STT/TTS server for Home Assistant voice, using OpenVINO acceleration.


The Goal

Not to replace cloud AI. Not to run a powerful reasoning model. The specific goal was:

Run a local LLM that can generate natural language announcements and summaries for Home Assistant automations, where speed is not critical because results are cached asynchronously.

Examples of this use case:

  • Weather condition summaries spoken by a TTS system
  • Morning briefings compiled from HA sensor data
  • Presence arrival/departure announcements
  • Anomaly alerts (window open, temperature warnings, etc.)

This is a niche but real use case. If your automations pre-generate text and cache it in an input_text helper, a slow model is perfectly acceptable. Inspiration for this approach came from this blog post by Strato


The Stack

  • OS: Debian 13 (Trixie) — minimal install, SSH server + standard utilities only
  • Inference engine: llama.cpp (PrismML fork, prism branch) with Vulkan backend
  • HA integration: home-llm by acon96 (via HACS)
  • API: OpenAI-compatible REST on port 8080, served by a systemd service

Note: The stack evolved during this process rather than being planned upfront. I started with the PrismML fork of llama.cpp specifically because the first model I tested — Ternary Bonsai 8B — uses a custom Q2_0 quantization format not yet available in mainline llama.cpp. It’s worth noting that Ternary Bonsai models are primarily designed to run natively on Apple devices (Mac, iPhone, iPad) via MLX, where they truly shine. I used a community GGUF variant to test on x86 hardware — a perfectly valid approach, but not PrismML’s primary target platform. Vulkan support was added later, as a second build step, after realizing that the Intel integrated GPU on the N150 could provide significant acceleration. By the time I settled on the final model choice, mainline llama.cpp would have worked equally well — but the PrismML fork with Vulkan remains in place and works perfectly, and will allow other ternary models from PrismML to be tested if needed. This evolution will make more sense as the model selection journey is described below.


Model Journey

Attempt 1: Ternary Bonsai 8B (Q2_0)

The Ternary Bonsai 8B was the original target. It uses 1.58-bit ternary weights ({-1, 0, +1}), resulting in a 2.03GB GGUF file — remarkably small for an 8B model.

Why it seemed promising:

  • Tiny memory footprint (~677MB RAM at runtime)
  • Designed specifically for CPU inference efficiency
  • Apache 2.0 licensed

Results (CPU only, no Vulkan — PrismML fork uses custom Q2_0 format not yet in mainline llama.cpp):

Metric Value
Prompt processing ~2.2 tok/s
Token generation ~2.1 tok/s
Weather prompt (280 tokens) ~150 seconds total
RAM usage ~677MB

Verdict: Technically functional, but 150 seconds for a weather summary is too slow even for async use. The ternary format also doesn’t benefit from Vulkan acceleration in its current implementation, which is a significant disadvantage on this hardware.

Output quality: Decent for simple queries, but struggled with multi-constraint prompts — hallucinated on some values, ignored formatting instructions, exceeded character limits.


Attempt 2: IBM Granite 3.1 3B a800m Instruct (Q6_K_L)

Discovered via a community benchmark post on Reddit showing impressive N150 performance with Vulkan. First model tested with the Vulkan-enabled llama.cpp build.

The a800m notation indicates 800M active parameters per forward pass out of 3.3B total — IBM’s efficiency-focused architecture designed for constrained environments.

This required a re-build of llama.cpp

Vulkan setup required:

sudo apt install -y libvulkan-dev vulkan-tools mesa-vulkan-drivers intel-media-va-driver glslc spirv-headers
sudo usermod -aG render,video $USER
cmake -B build -DGGML_VULKAN=ON
cmake --build build -j$(nproc)

Results (Vulkan, 33/33 layers offloaded to Intel iGPU):

Metric Value
Prompt processing ~51 tok/s
Token generation ~11.9 tok/s
Weather prompt ~10 seconds total
RAM / Vulkan VRAM 76MB CPU + 2604MB GPU

A 23x improvement in prompt processing over Bonsai on CPU alone.

Output quality: Fast and capable for simple tasks, but unreliable on complex multi-constraint prompts. Examples of failures on the weather summary prompt:

  • Invented windchill values not present in the data (“wind chill is a relaxing 0°C”)
  • Described 20.9°C as “chilly”
  • Ignored the “no forecasting” instruction
  • Used km/h despite being explicitly told not to

Verdict: Excellent speed, insufficient instruction following for structured constrained output at 3B parameters. Useful for simpler tasks.


Final Choice: Qwen3 4B Instruct 2507 (Q4_K_M)

Alibaba’s Qwen3 4B in its July 2025 revision. Key differentiator: native support for /no_think suffix to disable chain-of-thought reasoning, significantly improving both speed and instruction following for structured tasks.

File: Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf (~2.5GB) from bartowski’s GGUF repo

Results (Vulkan, /no_think):

Metric Value
Prompt processing 22–85 tok/s (scales with prompt length)
Token generation ~5 tok/s
Weather prompt ~10–15 seconds total
RAM / Vulkan VRAM ~3.5GB total

Note on PP variability: Prompt processing speed scales with prompt length because longer prompts fill llama.cpp’s batch windows more efficiently. A 280-token weather prompt achieves ~85 tok/s PP; a short 8-token query achieves ~17 tok/s PP. Token generation is consistent regardless of prompt length.

Output quality — weather summary comparison:

GPT-4o mini:

“It’s currently 16.4 degrees Celsius outside, which feels the same, with 53% humidity and a light wind coming from the southwest at 7.9 kilometers per hour.”

Qwen3 4B local:

“It’s 16.3 degrees, feels just like that, a bit humid with a gentle 14.8 kilometers per hour wind from the east and a UV index of 7.”

Both are correct, natural, and usable. The local model is slightly more concise and notably more consistent — lower temperature + deterministic hardware = less variance between runs. The data is not identical since they were run at different times.

Other tested prompts:

Device state summary:

“The living room lights are on at 40% brightness, the TV is on, the bedroom lights are off, and the front door is locked.”

Arrival announcement:

“Hey there! Sandro has just arrived home — welcome back, friend! :glowing_star:

Just needs an additional instruction to not use emojis…

Anomaly alert:

“Just a quick note: the kitchen window has been open for three hours, and it’s 8°C outside. Might want to check and close it soon.”

Morning briefing:

“Good morning, it’s 7:30 am on a Tuesday. The weather is 12 degrees and cloudy. You have a dentist appointment at 10 am and a team meeting at 3 pm today.”


Performance Summary

Model Size PP tok/s TG tok/s Weather prompt Instruction following
Bonsai 8B Q2_0 2.0GB ~2.2 ~2.1 ~150s Poor on complex prompts
Granite 3.1 3B Q6_K_L 2.7GB ~51 ~11.9 ~10s Poor on complex prompts
Qwen3 4B 2507 Q4_K_M 2.5GB ~17–85 ~5 ~10-15s Good with /no_think

Home Assistant Integration

Using the home-llm HACS integration (v0.4.7), configured for the llama.cpp backend.

Key lessons learned:

  1. The description and required fields in the structure block of ai_task.generate_data are functionally important, not optional. Without description, structured output extraction returns data: {}.

  2. Add /no_think to the system prompt in the AI Task Handler configuration to disable reasoning globally for all tasks:

    You are a task-specific assistant. Follow the task instructions and return the requested data. /no_think
    
  3. Some changes to the default parameters for the AI Task configuration:
    Top P = 0.95
    Top K = 0.20
    Remote Request Timeout = 120 seconds (increased to account for slow model on this hardware)

  4. The AI task structure should look like this:

    structure:
      sentence:
        description: A one-sentence conversational weather summary
        required: true
        selector:
          text: null
    

Example automation action:

- action: ai_task.generate_data
  data:
    task_name: Summarize outdoor weather conditions
    instructions: >
      Act as a local observer. Provide a one-sentence conversational summary
      of the weather as if answering "How is it outside?" using this data:
      Temperature: {{ states('sensor.outdoor_temperature') }}°C,
      feels like {{ states('sensor.feels_like') }}°C,
      wind {{ states('sensor.wind_speed') }} kilometers per hours
      from {{ ['N','NE','E','SE','S','SW','W','NW','N'] 
      [((states('sensor.wind_direction') | float / 45) | round(0)) % 8] }} 
      ({{ states('sensor.wind_direction') }}°).
      Be factual and conversational. No forecasting. Under 250 characters.
      /no_think
    entity_id: ai_task.your_local_llm_ai_task_entity_id
    structure:
      sentence:
        description: A one-sentence conversational weather summary
        required: true
        selector:
          text: null
  response_variable: weather_summary
- action: input_text.set_value
  target:
    entity_id: input_text.weather_summary_cache
  data:
    value: "{{ weather_summary.data.sentence }}"

systemd Service

The llama-server runs as a systemd service, auto-starting on boot:

[Unit]
Description=Llama.cpp Server - Qwen3 4B Instruct 2507
After=network.target

[Service]
Type=simple
User=youruser
ExecStart=/home/youruser/llama.cpp/build/bin/llama-server \
  -m /home/youruser/models/Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 99
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

The -ngl 99 flag offloads all layers to the Intel iGPU via Vulkan.


Conclusions

This experiment demonstrates a genuine niche use case for low-powered mini PCs in a Home Assistant setup:

Where it works well:

  • Async cached text generation where latency of 10–30 seconds is acceptable
  • Short structured outputs (announcements, summaries, alerts)
  • Offline operation — no cloud dependency, no API costs, no privacy concerns
  • Always-on, low power consumption (~6–10W at idle)

Where it doesn’t replace cloud AI:

  • Real-time conversation agents
  • Complex multi-step reasoning
  • Tasks requiring very strict instruction following with many simultaneous constraints

The key insight: The Intel integrated GPU on N150/N250 hardware has meaningful Vulkan compute capability that llama.cpp can leverage. Without Vulkan, this hardware is marginal. With Vulkan, it becomes genuinely useful for the right workload. The jump from ~2 tok/s (CPU only) to ~50+ tok/s prompt processing (Vulkan) is not incremental — it’s transformative for this use case.

A Mac Mini or Mac Studio with Apple Silicon would obviously be far more capable, but for a machine that was already earmarked for another purpose (Mosquitto MQTT broker), getting usable local LLM inference as a bonus is a pleasant surprise.


Setup tested on: Soyo M4 Plus 2 (Intel N150, 16GB DDR4), Debian 13 (Trixie), llama.cpp PrismML fork, home-llm v0.4.7, Home Assistant 2026.4.3

8 Likes

Super cool project! I’ve actually got a mini PC doing something similar (not LLM, but running some HA stuff + energy monitoring scripts). Interesting to see how much you can squeeze out of these cheap boxes, especially with GPU acceleration. I’m curious, did you notice any big difference in response speed after switching to Vulkan? Also, how’s the power draw on the N150? I keep an eye on mine since it’s running off my solar/battery setup and every watt counts.

1 Like

Thanks! I was really skeptical at first, very surprised it actually worked in the end.

I am amazed at what these tiny machines are capable of. My other one, which I dedicated exclusively to Whipr/Piper runs quite well for TTS using OpenVino,

About the speed difference, I had only tested the Bonsai without Vulkan and that was very slow (as mentioned, it seems to be very geared towards MLX and perhaps the GGUF variant does not do it justice), and the following models with Vulkan, but the speed increase was very noticeable, especially the prompt processing. I included a table with some basic data from my limited table

It is still slow for a chat interaction, granted, but more than good enough for async work where a few seconds more don’t really matter.

About power draw, have not measured yet, but I have a few Tasmota plugs lying around, happy to connect one and share some readings with you

About 35w. I have one as well. It also has a Nvidia 5070ti lashed to it (that’s on a separate EGPU chassis with its own PS. And doesn’t count in the 35 :rofl:)

…but my Intel side is setup very similar to this and runs my slow inference pipeline for Friday. My Intel XPU is primarily voice and pre-inference rn though.

A setup like this is indeed slower, but great for summarization work… don’t knock how much work you can do in HA with summarization jobs. It’s the entire basis for ZenOS (what Friday runs on)

1 Like

Was wondering if you’d chime in. I am slowly reading through your Friday’s Party: Creating a Private, Agentic AI using Voice Assistant tools topic… fascinating work, very interesting indeed.

Thanks!

The intel side is a NUC14ai. It is indeed great for the work you describe.

In fact I just overhauled the camera handlers last night.

Where pre summary comes in is there’s a lot of junk our homes do that you kind of have to interpret. If you can do that interpretation ahead of time when the system produces it and park it in case you need it in the near future… When it’s called it happens instantly and seems like magic… It also can be used as context for other events.

The back door lock unlocked is a much more urgent signal at 9p with someone recently seen in the back yard. (cached camera summary + lock signal + scheduled change == very interesting status…)

1 Like

I couldn’t agree more. I’m already going down the rabbit hole

I am presently running HA OS on one N150, Whisper/Piper on another and now this slow local LLM on the 3rd one… these have fallback to OpenAI if needed, but are working well so far.

I already have a considerable power draw from the old machine that runs all my cameras with BlueIris and CodeProject AI, so I wanted to use smaller hardware.

I like the idea of delegating tasks/concerns to different machines so they work with more headroom and, despite the added complexity, are less critical if one of them drops out

The latest stuff I’m putting up this month related to alert manager taskmaster and the camera suite are all about a sync work. I’m working on a post detailing how the alert pipeline works. - almost ALL of it could / should run on lower power or longer run summary pipes.

I currently have

All of the KFCs summary according to thier schedule (currently ~20 with ~40 triggers) on average around 10-15 of those fire in any given hour.

Roll smaller summary into larger summary. For instance.

A single camera summarizes itself on event or schedule. After they update, an event fires and the the camera component summarrizes the resulting summaries. Camera summary triggers the security component. The security triggers alert… . and so on and so on. Live agent nexer sees raw data unless it asks.

Result hi Nathan. Uh lock the door please.

2 Likes

It may take a while but I think I will be re-reading the whole topic you started from the top. Thanks a lot for sharing all this work!

Sooner or later I will procure a more capable machine for local inference, but I want to get the slower offline tasks sorted first and use the better machine later for more important items, in the same spirit of a tiered architecture

I have correlated events in mind for my cameras, for example, to work after CodeProject AI has confirmed alerts, and to use in combination with instructions that indicate where each camera is, what its field of view is, etc. and for these I would want as near-real-time vision capabilities as possible.

Hint. Put the first post of the Party and the link to my repo in Notebook Lm, then ask it to collect all the technical articles for your source data. Then build a podcast. Explainer style, long.

Yes it totally works. I do it after every other release to ensure the docs make sense. (and I haven’t run into a ditch)

2 Likes

Ohhh, had not thought of that at all!!! Will do, thanks a lot!!!

1 Like

Update, an interesting point.

I ran my periodic local weather summary, for caching and later TTS use, through OpenAI and also through the local Qwen LLM. I used the same instructions for both.

The instructions contain, amongst other things:

Be conversational and natural but factual. Avoid poetic or flowery language. Always mention the temperature and how it feels — include feels like if it differs meaningfully from actual, or windchill if applicable. Write ‘kilometers per hour’ not ‘km/h’. Avoid other technical abbreviations or symbols. Keep the response under 250 characters.

Only describe current conditions — never forecast, predict, or imply what the weather will or will not do. Do not mention rain unless the rain rate is greater than zero. Omit any condition that is not noteworthy or cannot be confirmed from the data provided.

OpenAI using GPT-4o mini
It’s currently 10.7 degrees Celsius outside, and it feels the same, with a light breeze from the south at 6.4 kilometers per hour and high humidity at 97 percent.

Local LLM using Qwen3 4B Instruct (Q4_K_M)

It’s 10.8 degrees, feels exactly like that, humid with a 9.7 km/h south wind, and it’s pretty cloudy with a steady rain rate of 9.0 mm per hour.

OpenAI decided 9mm/hr rain was not worth mentioning… Qwen did, but then decided to use km/h instead of kilometers per hour…

Clearly neither is perfect and use of an abbreviation is not the best for a TTS announcement, but the result from Qwen seems much more consistent with the data and the instructions provided.

The difference in wind speed is only because they were run a few seconds apart.

1 Like

Did you try other models? Gemma?

Not yet… I am still trying to figure out how much I can push this tiny machine, and what aysnc cached tasks I can delegate to it. Of the very few models I tried this Qwen 3 4b seemed to work best. I was hoping for a Qwen 3.5 4b instruct version of a similar file size, but I don’t see one yet. It’s easy enough to try Gemma though, I will have a look :slight_smile:

OK, so I tested google_gemma-4-E4B-it-Q4_K_M.

With reasoning off, it is marginally slower than the Qwen model I was using.

With reasoning on, of course, it is very slow.

Some details:

Prompt:
Act as a local observer. Provide a one-sentence conversational summary of the weather as if answering “How is it outside?” using this data:

Time/Sun: 12:00 (Daylight). Next Sunset: 19:16. Moon: Waxing Gibbous
Conditions: 10.9°C (Feels like 10.9°C). 98% Humidity. Wind: 7.2 km/h from
SE (128°). Wind Chill: n/a. Sky/Rain: 83.11 W/m2, UV Index: 0. Rain rate:
0.0 mm/hr. Pressure: 1021.9 hPa.

Be conversational and natural but factual. Avoid poetic or flowery language. Always mention the temperature and how it feels — include feels like if it differs meaningfully from actual, or windchill if applicable. Write ‘kilometers per hour’ not ‘km/h’. Avoid other technical abbreviations or symbols. Keep the response under 250 characters. Only describe current conditions — never forecast, predict, or imply what the weather will or will not do. Do not mention rain unless the rain rate is greater than zero. Omit any condition that is not noteworthy or cannot be confirmed from the data provided.

Reasoning ON: took over 4 minutes
It’s 10.9 degrees Celsius and feels just as cool, with a gentle breeze blowing from the southeast at 7.2 kilometers per hour.

Stats:
PP: 317 tokens | 9.8s | 32.39 tokens/s
TG: 824 tokens | 3min 54s | 3.51 t/s

Reasoning OFF: took around 21 seconds
It’s about ten point nine degrees Celsius right now, feeling like ten point nine, and it’s quite humid with a light breeze blowing from the southeast at seven point two kilometers per hour.

Stats:
PP: 317 tokens | 9.8s | 32.44 tokens/s
TG: 41 tokens | 11s | 3.70 t/s

Comparatively, the same prompt on Qwen_Qwen3-VL-4B-Instruct-Q4_K_M:

No specific think or reasoning tag set: took around 18 seconds
t’s 10.9 degrees Celsius, feels just like that, with 98% humidity and a gentle 7.2 kilometers per hour breeze from the southeast. The sun’s still up, and the sky’s bright but not too warm.

Stats:
PP: 282 tokens | 6.9s | 41.10 tokens/s
TG: 53 tokens | 11s | 4.69 t/s

1 Like

Follow-up: Three-Model Benchmark and Dual-Model Routing on Intel N150

Following the discussion and continuing to push how far this tiny machine can go, I spent a little time benchmarking three models and then thought I’d take things a step further with a dual-model routing setup. All tests run on the same N150 hardware with Vulkan acceleration.


Models Tested

Three architecturally distinct approaches, all at Q4_K_M quantization for a fair comparison:

  • LiquidAI LFM2-8B-A1B (5.05GB) — hybrid conv+attention Mixture-of-Experts, only ~1.5B active parameters per token despite 8.3B total
  • Google Gemma4-E4B (Q4_K_M) — MoE transformer with configurable thinking mode, run with --reasoning off
  • Qwen3-4B-Instruct-2507 (2.5GB) — dense transformer, run with /no_think

Part 1: Three-Model Benchmark

Six prompts representative of typical Home Assistant automation tasks, run sequentially against each model with a warmup request before timing begins.

Speed Results

Model Avg PP tok/s Avg TG tok/s
LFM2-8B 23.4 11.2
Qwen3-4B 32.8 4.7
Gemma4-E4B 23.3 3.9

LFM2’s token generation speed is remarkable — nearly 3x faster than Qwen3 and Gemma4. This is the sparse MoE architecture in action: despite being an 8B parameter model, only ~1.5B parameters activate per token, giving generation speed comparable to a much smaller model.

Quality Results

Prompt LFM2-8B Gemma4-E4B Qwen3-4B
Weather summary :white_check_mark: natural, correct units :warning: lost detail :cross_mark: added unrequested UV/sunset
Device state :white_check_mark: :white_check_mark: :white_check_mark:
Arrival announcement :warning: grammar quirk :white_check_mark: perfect :white_check_mark: emoji (TTS issue)
Anomaly alert :white_check_mark: :white_check_mark: best phrasing :white_check_mark:
Morning briefing :warning: editorialised :white_check_mark: clean :warning: unrequested advice
Temperature anomaly :warning: missed one outlier :white_check_mark: found both :warning: speculated on cause

Full Responses

Weather summary (20.9°C, 41% humidity, 5.5 km/h south wind, no rain):

  • LFM2: “It’s a mild, quiet evening with a steady 20.9°C, soft south wind at 5.5 kilometers per hour, and a clear sky—no rain in sight.”
  • Gemma4: “It’s about 21 degrees right now, feeling pretty mild with a light breeze coming from the south.”
  • Qwen3: “It’s 20.9°C outside, feels pretty warm, 41% humidity, light breeze at 5.5 kilometers per hour from the south. UV index is low, no wind chill. Sunset is in 11 minutes.”

Arrival announcement:

  • LFM2: “Sandro’s home has just arrived—welcome back!” (grammar quirk)
  • Gemma4: “Sandro is finally home!” (cleanest for TTS)
  • Qwen3: “Sandro has arrived home! :house_with_garden::blush:(emoji problematic for some TTS engines)

Temperature anomaly (21 sensors, garage at 30.6°C and walk-in closet at 26.2°C are outliers):

  • LFM2: “The Garage stands out with a significantly elevated temperature of 30.6°C, far above the rest of the rooms.” (missed walk-in closet)
  • Gemma4: “The Garage (30.6°C) and Walk-In Closet (26.2°C) are significantly warmer than the other rooms.” (only model to find both)
  • Qwen3: “The garage at 30.6°C is significantly higher than the other rooms, indicating an unusual temperature spike likely due to external heat exposure.” (speculated on cause without data)

Morning briefing:

  • LFM2: “Good morning, it’s Tuesday at 7:30. The weather is 12 degrees and cloudy—perfect for a quiet start. Don’t forget your dentist appointment at 10am and the team meeting at 3pm. Let’s make it a productive day.” (editorialised)
  • Gemma4: “Good morning, it’s 7:30 on a cloudy 12-degree Tuesday. Remember your dentist appointment at ten and the team meeting scheduled for three this afternoon. Have a productive day.” (clean, times as words — better for TTS)
  • Qwen3: “Good morning, it’s 7:30am on a cloudy Tuesday with a chilly 12 degrees outside. Dentist at 10am, then team meeting at 3pm—dress warm and be ready for the day.” (unrequested advice)

Verdict

No single model wins everything. Each has a clear niche:

  • LFM2-8B — speed king on token generation (11.2 tok/s). Best for factual, structured, data-heavy tasks. Occasional grammar quirks on creative tasks.
  • Gemma4-E4B — best overall quality and instruction following. Most natural phrasing for TTS. Slowest TG but worth it for announcement-style tasks. Only model to correctly identify both temperature anomalies.
  • Qwen3-4B — fastest prompt processing (32.8 tok/s), solid all-rounder, but tends to add unrequested information.

Part 2: Dual-Model Routing

The natural conclusion from the benchmark is task routing — use LFM2 for factual/analytical tasks, Gemma4 for anything spoken aloud. Since the N150 has 16GB RAM and these are one-off stateless requests (no conversation history, minimal KV cache growth), both models fit simultaneously in memory.

Memory with both models loaded:

Mem:   15Gi total   10Gi used   236Mi free   5.3Gi available
Swap:  15Gi total   149Mi used

10GB used with both models loaded and the OS running — 5.3GB still available. The 149MB of swap is harmless page shuffling, not memory pressure.

Both servers run as background processes, each on their own port:

  • LFM2 → port 8080
  • Gemma4 → port 8081

Routing Rules

Task type Model Reasoning
Weather summaries LFM2 Factual, structured sensor data
Device state summaries LFM2 Factual, no creative interpretation needed
Temperature/sensor anomaly detection LFM2 Data analysis, speed matters across many sensors
Arrival/departure announcements Gemma4 Spoken, quality and natural phrasing critical
Anomaly alerts Gemma4 Spoken alert, natural phrasing matters
Morning/evening briefings Gemma4 Spoken, benefits from Gemma’s natural sentence rhythm

Routing Benchmark Results

Each prompt sent to its designated model, sequentially (realistic for async cached HA automations):

Prompt Model PP tok/s TG tok/s Total (s) Response
weather_summary LFM2 40.27 11.12 8.8 It’s a mild, quiet evening with a steady 20.9°C, 41% humidity, and a gentle south breeze at 5.5 kilometers per hour—no rain in sight, and the UV index is low.
device_state LFM2 16.22 11.27 6.3 Living room lights are on at 40% brightness, the TV is on, bedroom lights are off, the front door is locked, and the kitchen window is open.
temperature_anomaly LFM2 28.88 11.40 5.9 The Garage stands out with a significantly elevated temperature of 30.6°C, far exceeding the rest of the rooms.
arrival_announcement Gemma4 15.96 4.39 4.0 Sandro is finally home!
anomaly_alert Gemma4 21.88 3.90 7.4 The kitchen window has been open for three hours and it’s only eight degrees outside.
morning_briefing Gemma4 19.87 3.75 15.4 Good morning, it’s 7:30 on a cloudy 12-degree Tuesday. Remember your dentist appointment at ten and the team meeting scheduled for three this afternoon. Have a productive day.

Neither model showed any performance degradation from sharing memory with the other. LFM2 maintained its full 11+ tok/s TG throughout. Gemma4 similarly unaffected.

Every response in this run is production-ready — no hallucinations, no grammar issues, no unrequested information. The routing logic naturally plays to each model’s strengths.

Implementation in Home Assistant

In practice, routing is simply a matter of pointing each automation’s AI task at the appropriate endpoint:

  • Factual/analytical automations → http://[llama.cpp IP address]:8080/v1
  • Announcement/alert automations → http://[llama.cpp IP address]:8081/v1

Using the home-llm HACS integration, this means configuring two separate service instances — one per port — and selecting the appropriate one in each AI task.


Summary

Two small LLMs, two ports, one €150 mini PC — handling six different Home Assistant automation task types with response times between 4 and 15 seconds, all locally, all private, with no API costs.

The N150 with Vulkan acceleration continues to surprise. It was never going to replace a Mac Studio for real-time conversation, but for the async cached pattern — generate when the data is fresh, store it, use it as context when needed — it is genuinely capable and now demonstrably so across a structured benchmark.


Hardware: Soyo M4 Plus 2 — Intel N150, 16GB DDR4, 512GB SSD
OS: Debian 13 (Trixie), minimal install
Inference: llama.cpp (PrismML fork, build 8846) with Vulkan backend
HA integration: home-llm v0.4.7 by acon96 (HACS)

3 Likes

Update: management Script, and switch to mainline llama.cpp

A few more developments since the last update.

systemd services and management script

Both models now start automatically on boot via dedicated systemd service files, with a 25 second stagger between them to avoid memory pressure during loading. Total time from boot to both models ready is around 90 seconds.

For interactive management I wrote a small bash TUI — invoked via SSH — that shows live status of both ports, allows starting, stopping and swapping models (including vision variants), and handles log cleanup. Nothing fancy but very practical for a headless machine.

Switch to mainline llama.cpp

I had been running the PrismML fork of llama.cpp, which was originally needed for the Ternary Bonsai model’s custom Q2_0 quantization format. Since I moved on from that model, there was no longer any reason to stay on the fork. Switching to mainline was straightforward — clone, build with -DGGML_VULKAN=ON, done.

The performance improvement was consistent and measurable:

Model PP tok/s (PrismML) PP tok/s (Mainline) TG tok/s (PrismML) TG tok/s (Mainline)
LFM2-8B ~28 avg ~29 avg 11.3 avg 12.7 avg
Gemma4-E4B ~20 avg ~20 avg 3.9 avg 4.6 avg

PP improvements are modest (~1-3%) but TG improved by ~12-15% across both models — a meaningful gain for generation-heavy tasks, with no regressions anywhere.

1 Like