Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://shitposter.club/objects/a918e417-a201-4f12-ad68-ccb84481bc76">guizzy (in exile) (guizzy@shitposter.club)'s status on Thursday, 10-Aug-2023 02:29:30 JST</a><a href="https://shitposter.club/users/guizzy" title="guizzy@shitposter.club"><img src="https://gnusocial.jp/theme/gnusocialjp/default-avatar-stream.png" width="48" height="48" alt="guizzy (in exile)" style="position: absolute; left: 0; top: 0;">guizzy (in exile)</a><div><a href="https://shitposter.club/objects/fff333aa-73ea-44f2-aeb8-d49b6c805304" rel="in-reply-to">in reply to</a><ul><li><li><a href="https://gnusocial.jp/user/120378" title="christmas_man@thebag.social">Christmas Man</a></li><li><a href="https://gnusocial.jp/user/134374" title="meso@the.asbestos.cafe">meso</a></li></ul></div></section><article><a href="https://shitposter.club/users/Moon">@Moon</a> <a href="https://thebag.social/users/Christmas_Man">@Christmas_Man</a> <a href="https://the.asbestos.cafe/users/meso">@meso</a> This is the guy making 4bit quantized models for home use: <a href="https://huggingface.co/TheBloke">https://huggingface.co/TheBloke</a><br>GPTQ models are for GPU based inference, GGML are for CPU based inference (though you can get speed boost from moving some of the load on your GPU).<br><br>I'd recommend installing Oobabooga's Text Generation Webui. It's the Automatic1111 of LLMs: <a href="https://github.com/oobabooga/text-generation-webui">https://github.com/oobabooga/text-generation-webui</a><br><br>With 24Gb VRAM, you can run GPTQ 13b to 20b models with room to spare for extended (over 2048 token) context and keeping Stable Diffusion loaded at the same time. Or you are supposed to be just about able to run 30b models with 2048 context on a headless linux machine. Expect double digit tokens per second. Answers will pop up in seconds.<br><br>With GGML models your RAM is going to be your limit, and speed is going to depend on CPU, GPU, RAM speed and how much you can offload to GPU/VRAM. But in general it's likely to be MUCH slower than GPTQ, but if you're running as big a model as you can fit in your machine, expect single digit tokens per second. Expect to wait sometimes over a minute for an answer. Sometimes it's worth it, sometimes not. I've heard people say that the returns from 30b to 70b are quite a bit diminished (ie: it's not really noticeably smarter, just different).</article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/1928346#notice-3786180">In conversation</a><time datetime="2023-08-10T02:29:30+09:00" title="Thursday, 10-Aug-2023 02:29:30 JST">about a year ago</time> <span>from <span><a href="https://shitposter.club/objects/a918e417-a201-4f12-ad68-ccb84481bc76" rel="external" title="Sent from shitposter.club via ActivityPub">shitposter.club</a></span></span><a href="https://shitposter.club/objects/a918e417-a201-4f12-ad68-ccb84481bc76">permalink</a><h4>Attachments</h4><ol><li><article><header><div>Domain not in remote thumbnail source whitelist: opengraph.githubassets.com</div><h5><a href="https://github.com/oobabooga/text-generation-webui">GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, LLaMA, and Pygmalion.</a></h5><div></div></header><div>A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, LLaMA, and Pygmalion. - GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models ...</div><footer></footer></article></li><li><article><header><div>Domain not in remote thumbnail source whitelist: cdn-thumbnails.huggingface.co</div><h5><a href="https://huggingface.co/TheBloke">TheBloke (Tom Jobbins)</a></h5><div></div></header><div>LLM: quantisation, fine tuning</div><footer></footer></article></li></ol></footer></blockquote>

Corresponding Notice

Embed this notice
guizzy (in exile) (guizzy@shitposter.club)'s status on Thursday, 10-Aug-2023 02:29:30 JSTguizzy (in exile)
in reply to
@Moon @Christmas_Man @meso This is the guy making 4bit quantized models for home use: https://huggingface.co/TheBloke
GPTQ models are for GPU based inference, GGML are for CPU based inference (though you can get speed boost from moving some of the load on your GPU).

I'd recommend installing Oobabooga's Text Generation Webui. It's the Automatic1111 of LLMs: https://github.com/oobabooga/text-generation-webui

With 24Gb VRAM, you can run GPTQ 13b to 20b models with room to spare for extended (over 2048 token) context and keeping Stable Diffusion loaded at the same time. Or you are supposed to be just about able to run 30b models with 2048 context on a headless linux machine. Expect double digit tokens per second. Answers will pop up in seconds.

With GGML models your RAM is going to be your limit, and speed is going to depend on CPU, GPU, RAM speed and how much you can offload to GPU/VRAM. But in general it's likely to be MUCH slower than GPTQ, but if you're running as big a model as you can fit in your machine, expect single digit tokens per second. Expect to wait sometimes over a minute for an answer. Sometimes it's worth it, sometimes not. I've heard people say that the returns from 30b to 70b are quite a bit diminished (ie: it's not really noticeably smarter, just different).
In conversationabout a year ago from shitposter.clubpermalink
Attachments
1. Domain not in remote thumbnail source whitelist: opengraph.githubassets.com
  GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, LLaMA, and Pygmalion.
  A gradio web UI for running Large Language Models like GPT-J 6B, OPT, GALACTICA, LLaMA, and Pygmalion. - GitHub - oobabooga/text-generation-webui: A gradio web UI for running Large Language Models ...
2. Domain not in remote thumbnail source whitelist: cdn-thumbnails.huggingface.co
  TheBloke (Tom Jobbins)
  LLM: quantisation, fine tuning

Public

Embed Notice

HTML Code

Corresponding Notice