I’ve been looking into self-hosting LLMs or stable diffusion models using something like LocalAI and / or Ollama and LibreChat.
Some questions to get a nice discussion going:
- Any of you have experience with this?
- What are your motivations?
- What are you using in terms of hardware?
- Considerations regarding energy efficiency and associated costs?
- What about renting a GPU? Privacy implications?
Mostly via terminal, yeah. It’s convenient when you’re used to it - I am.
Let’s see, my inference speed now is:
As of quality, I try to avoid quantisation below Q5 or at least Q4. I also don’t see any point in using Q8/f16/f32 - the difference with Q6 is minimal. Other than that, it really depends on the model - for instance, llama-3 8B is smarter than many older 30B+ models.
Thanks! Glad to see the 8x7B performing not too bad - I assume that’s a Mistral model? Also, does the CPU significantly affect inference speed in such a setup, do you know?
If your CPU isn’t ancient, it’s mostly about memory speed. VRAM is very fast, DDR5 RAM is reasonably fast, swap is slow even on a modern SSD.
8x7B is mixtral, yeah.