Help with Home Server Architecture and Hardware Selection?

libretech@reddthat.com · edit-2 3 months ago

Help with Home Server Architecture and Hardware Selection?

Estebiu@lemmy.dbzer0.com · 3 months ago

For llama 70B I’m using an rtx a6000; slightly older but it does the job magnificently with hers 48gb of vram.

sntx@lemm.ee · 3 months ago

I’m also on p2p 2x3090 with 48GB of VRAM. Honestly it’s a nice experience, but still somewhat limiting…

I’m currently running deepseek-r1-distill-llama-70b-awq with the aphrodite engine. Though the same applies for llama-3.3-70b. It works great and is way faster than ollama for example. But my max context is around 22k tokens. More VRAM would allow me more context, even more VRAM would allow for speculative decoding, cuda graphs, …

Maybe I’ll drop down to a 35b model to get more context and a bit of speed. But I don’t really want to justify the possible decrease in answer quality.

libretech@reddthat.com · 3 months ago

This is exactly the sort of tradeoff I was wondering about, thank you so much for mentioning this. I think ultimately I would probably align with you in prioritizing answer quality over context length (but it sure would be nice to have both!!) I think my plan for now based on some of the other comments is to go ahead with the NAS build and keep my eyes peeled for any GPU deals in the meantime (though honestly I am not holding my breath). Once I’ve proved to myself I can something stable without burning the house down, I’ll on something more powerful for the localLLM. Thanks again for sharing!

Estebiu@lemmy.dbzer0.com · 3 months ago

Uhh, a lot of big words here. I mostly just play around with it… Never used LLMs for anything more serious than a couple of test, so I don’t even know now many tokens can my setup generate…

bradd@lemmy.world · 3 months ago

I’m running 70b on two used 3090 and an a6000 nvlink. I think i got these for $900ea, and maybe $200 for the nvlink. Also works great.

libretech@reddthat.com · 3 months ago

Thanks for sharing! Will probably try to go this route once I get the NAS squared away and turn back to localLLMs. Out of curiosity, are you using the q4_k_m quantization type?

libretech@reddthat.com · 3 months ago

Wow, that sounds amazing! I think that GPU alone would probably exceed my budget for the whole build lol. Thanks for sharing!

Estebiu@lemmy.dbzer0.com · 3 months ago

You can still run smaller models on cheaper gpus, no need for the greatest gpu ever. Btw, I use it for other things too, not only LLMs