One of the downsides to running Ollama for your conversation agent currently is the delay when a conversation is started as the system prompt is evaluated by the LLM. This is particularly impactful when running larger models, large contexts, or when running on older GPUs. In my environment, starting the conversation takes about 25 seconds, but continuing it takes only 5.
What if we could mitigate this by preloading conversations?
Here’s how it would work:
We start by defining a number of “slots” that are available to work with. This maps to Ollama’s request concurrency. I don’t see a way to read this via the API at the moment, but we could make it configurable and default to one.
Each slot would have it’s state (active or not) tracked by the integration.
Slots that don’t have an active conversation going would be refreshed in the background on a configurable cadence, perhaps 1 minute by default.
Side Note: We might want to provide an option to refresh active conversations as well. This would allow an ongoing conversation to be able to read updated sensor values.
If the user starts a new conversation, and a preloaded slot is available, that would be used, and the user could skip waiting for the system prompt to be evaluated.