Improving response times when using GPU

scyto · November 15, 2024, 6:13am

HASSOS is in a VM hosted on proxmox with only an intel iGPU that might be able to be passed through.

I moved whisper and ollama off to a machine with an eGPU (2080ti) connected. This is running a medium-int8 whisper model (i would like to try large but the linuxserver.io image doesn’t support that, and it uses beam 5).

Is this reasonable speed? (its certainly much faster than it all running in VM on cpu).
It would seem i am at law of diminishing returns on the speech to text, but would be cool to get it faster if possible.

Any suggestions on how to get the NLP with ollama down to under a second? (this is currently with 55 exposed objects)

NIUB · November 15, 2024, 4:15pm

The same idea, gpt4o mini takes about two seconds, and my GPU also takes about two seconds (3060). Normally, openai devices are like rockets, while I am like a snail. But there is no difference in time.

scyto · November 15, 2024, 6:29pm

Thanks are you saying GPT-4o is a better model, or that all models seems to give this ~2 end to end time no matter what we choose?

NIUB · November 15, 2024, 6:52pm

Yes, because OpenAI has countless Tesla GPUs. The model parameters have reached billions. This is something our RTX cannot compare to. But both require about 2 seconds of time. I don’t know where the problem lies

scyto · November 17, 2024, 8:09am

does anyone know if it is possible to pass the --verbose argument to the runner when it runs the model, this would provide additional inference statsitics…