I currently have Llama3 running on a server (lenovo p520) with a 12GB 3060. Works great with Home Assistant and is really responsive. Now that I have some Voice Assistant PEs I’m looking at accelerating and improving Whisper and Piper. So I’m considering adding a second 3060 to the server. Anyone know if it’s possible to segment the GPUs (1 for Llama3, 1/2 for Whisper, and 1/2 for Piper)? Home Assistant is running on a separate server.
I have a 3060 12gb as well, I’m running a dedicated machine with Ubuntu. I have whisper, kokoro tts and ollama running all together.
It’s a squeeze with 12gb of vram but it works. I find llama useless. Qwen2.5 7b is minimum, 14b runs but is slow if I have whisper and kokoro running as its more than 12gb.
You can specify which gpu number is your docker configs for whisper and piper.