I’m erring on the side of caution and using 30K context which ends up with Ollama splitting model between system ram and GPU ram which isn’t ideal.
Is there a way to see how big the call is in tokens? So I can set context to be more precise and maybe get everything onto the GPU?
Thanks!