Fast (low latency) cloud models mit OpenAI compatible endpoint: Which do you use and recommend?

I wrote in another thread that I noticed HUGE differences while testing different models and providers with the OpenRouter integration.
Especially compared to the popular OpenAI models like gpt-??-mini.

My main model so far was gpt-4.1-mini, which has now been replaced with gpt-oss-120b hosted by groq (which is a cloud model hosting platform and not to be confused with Grok, the llm from xAI).

They are not the cheapest provider, but the one with the lowest latency that was shown on OpenRouter.
Their gpt-oss-120b pricing is still less than half of the price compared to gpt-4.1-mini from OpenAI.

Sadly OpenRouter (or the HA integration?) seem to have a lot problems.
Regardless of the model or provider I often get API errors or bad response errors in the conversation.
It looks like this isn’t just me, there are a lot other reports here in the forums and on Github.

So I found two integrations that can be used:

How fast is it?

First, I tried a simple question where the answer was in the training data and the models didn’t need to use tool calls: What’s the height of the eiffel tower?

  • gpt-4.1-mini fells snappy here with about 2 seconds reaction time.
  • gpt-oss-120b on groq took below 0.5 seconds in comparison.

That’s not the real reason to switch, right?

But let’s take a look at question with tool calls.
The next try was What outdoor temperature will be tomorrow betwee 8am and 9am?, which will use the Weather LLM script provided by TheFes.

  • gpt-4.1-mini needed 7 seconds to get the answer, and didn’t feel snappy at all anymore.
  • gpt-oss-120b on groq took less than 2 seconds in comparison

Fun part: gpt-oss-120 even used one tool call more compared to gpt-4.1-mini to ensure which date is tomorrow and was still way faster.

And a least, more complex example:
How much solar energy did we export this september compared to last september. Just give me the kwh for both months and calculate the difference. Also tell me the mean outdoor temperature for the garden thermometer for both months.

At least both models were successful and replied with the same, correct values.

  • gtp-4.1-mini took about 35 seconds to complete
  • gtp-oss-120b needed only about 7.5 seconds in comparison

A few words about the hass_local_openai_llm integration linked above:
It seems to be updated with the latest HA development.
You can create seperate entries for assistants or ai tasks.
It also allows to modify the promt and supports STT and TTS streaming.

Feel free to share your own experience in this thread. :slightly_smiling_face:

1 Like

Thanks for the tip. I managed to install it and it seems to work. So far it indeed feels quite a bit faster than the openai integration. How does the streaming TTS work? Does that only work if you use local TTS? Or can it stream to Google cloud TTS also? I can’t seem to find any docs on how to enable it?

1 Like

As I use Nabu Casa cloud TTS (which has support for streaming since quite some time), I never investigated which other TTS integrations support it.

It’s explained in the dev docs, and most likely many integrations got updated to support it.

You might have to take a look at the docs of whatever you use if it’s mentioned there.

This is always streaming emulation, as TTS models must receive sufficient data to correctly generate speech. Splitting is primarily done by specific punctuation marks. Most wyoming servers and TTS integrations handle this internally (including Casa Cloud TTS, which uses Azure).
Some cloud services handle this on their end; you just need to send them a stream of text.

Google Cloud is the latter case; StreamingSynthesize can be implemented via RPC. They handle all text segmentation and normalization. But to implement this, someone needs to be willing to take on the challenge.

As for the TTS option in Gemini integration, it’s not particularly suitable for regular use (especially if you modify it into a streaming version), as the generation speed isn’t the fastest, and the request limits at standard levels are quite low.

1 Like

Got it thanks! I’ll keep playing with it. I doubt though that the TTS streaming is the overall biggest latency contributor :slight_smile:

If you look at the debug view of your LLM you can see how long SST, LLM an TTS took.

Normally the LLM part is by far the largest (‘thinking’, calling tools, maybe repeat this a few times to think about the response and call more tools, generating the final response). So yes.
This is why choosing a fast hoster/model here brings the largest benefit.

TTS streaming is only noticeable if you ask the LLM something that produces a long response. In this case it can be quite noticeable. (E.g summary of today’s news.) But for the most Smart Home replies it’s not that important.

Streaming will affect all responses longer than 60 characters. Therefore, using a TTS with streaming support is highly recommended.

And when assessing the latency of the NLU stage, it is important to take into account that sound synthesis and playback can be performed at this stage.This is especially true if there are multi-turn instrument calls. Long answers (when token and sound generation occur almost in parallel) have already been mentioned.

1 Like