Something else I found out yesterday evening when testing a model in OpenRouter:
The ~ 7-8s minimal delay (from end of speaking until start of response) we’re used to get from online models, doesn’t seem to be set in stone after all.
I am down to reliable 2-3 seconds for simple questions without tool call at the moment.
And questions about simple tool calls like stopping music can also end in a response in about 3s (and the music beeing stopped even faster, as it happens in the tool call before the response).
Also more complex questions about entity history, result most of the time in 4-5s response time.
Only web searches take more time here, as it uses a separate call with the Gemini web search tool described earlier in this thread.
This already feels snappy enough that you don’t get this “Ahhh, Alexa was really faster” feeling. ![]()
So, what made the difference?
It seems like there are quite a few provider in OpenRouter lately, that are optimized for very fast responses instead of being as cheap as possible.
And you can even sort for this time-to-first-token median time in the provider list for a specific model. Here’s an example for gpt-oss-120b which I’m currently using for these tests:
In your profile in the Open-Router website settings, you can select to only use a list of allowed providers. Choose one (or more to have a fallback in case it has a down-time) with a very low latency here and you’re set. ![]()
As I didn’t change anything about SST and TTS (in my case Nabu Casa cloud) they need the same time as before The difference of the model execution is therefore even more impressive when used in the assist chat window where no SST and TTS is executed.
It feels like the responses are almost instant sometimes.
