Embed Notice
HTML Code
Corresponding Notice
- Embed this notice
guizzy (in exile) (guizzy@shitposter.club)'s status on Thursday, 10-Aug-2023 02:29:30 JSTguizzy (in exile) @Moon @Christmas_Man @meso This is the guy making 4bit quantized models for home use: https://huggingface.co/TheBloke
GPTQ models are for GPU based inference, GGML are for CPU based inference (though you can get speed boost from moving some of the load on your GPU).
I'd recommend installing Oobabooga's Text Generation Webui. It's the Automatic1111 of LLMs: https://github.com/oobabooga/text-generation-webui
With 24Gb VRAM, you can run GPTQ 13b to 20b models with room to spare for extended (over 2048 token) context and keeping Stable Diffusion loaded at the same time. Or you are supposed to be just about able to run 30b models with 2048 context on a headless linux machine. Expect double digit tokens per second. Answers will pop up in seconds.
With GGML models your RAM is going to be your limit, and speed is going to depend on CPU, GPU, RAM speed and how much you can offload to GPU/VRAM. But in general it's likely to be MUCH slower than GPTQ, but if you're running as big a model as you can fit in your machine, expect single digit tokens per second. Expect to wait sometimes over a minute for an answer. Sometimes it's worth it, sometimes not. I've heard people say that the returns from 30b to 70b are quite a bit diminished (ie: it's not really noticeably smarter, just different).