Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://poa.st/objects/3b498d4b-8fce-45cd-8014-2a5a40523109">snap (snappler@poa.st)'s status on Thursday, 30-Jan-2025 16:39:13 JST</a><a href="https://poa.st/users/snappler" title="snappler@poa.st"><img src="https://gnusocial.jp/avatar/253307-48-20240402133237.webp" width="48" height="48" alt="snap" style="position: absolute; left: 0; top: 0;">snap</a><div><a href="https://poa.st/objects/664935c8-5c2f-4ecd-ad0a-f0d850015e58" rel="in-reply-to">in reply to</a><ul><li><li><a href="https://gnusocial.jp/user/1773" title="iamal_pharius@poa.st">Chinese man ? #nobot</a></li><li><a href="https://gnusocial.jp/user/132538" title="rustycrab@clubcyberia.co">Rusty Crab</a></li><li><a href="https://gnusocial.jp/user/246838" title="populistright@poa.st">PopulistRight</a></li></ul></div></section><article><a href="https://poa.st/users/PopulistRight">@PopulistRight</a> <a href="https://clubcyberia.co/users/RustyCrab">@RustyCrab</a> <a href="https://clubcyberia.co/users/Inginsub">@Inginsub</a> <a href="https://poa.st/users/IAMAL_PHARIUS">@IAMAL_PHARIUS</a> There are two different parts concerning LLMs or basically any of these models, sort of regardless of architecture. There's training and there's inference/generation.<br><br>For training, you do need to run very large matrix multiplication a lot of times so the high-end GPUs are valuable for that step. What Deepseek seems to suggest, and other smaller successes from groups with less compute like Mistral in France, is that you do not need the scale that people were thinking to make a really good model. The GPU's massive parallelization is useful here, compared with inferences. And given we don't know how to reliably make linear progress on model quality, the ability to train a lot of models is good, assuming you're twisting the right knobs and dials. But, the reality here is that there's not a guarantee you will ever get the right combination of factors to make the next big jump even with all that compute. Which means plenty of companies are going to cut back because investors aren't going to like them spending $200M on GPUs for a "maybe we beat this model we could use for free." I also remember some people suggesting there are some questions about the copyrightability of trained model weights, but I haven't looked into that.<br><br>On the inference side, what you primarily need is RAM Speed. Doing inference on a given set of tokens isn't a particularly complicated or expensive process once the weights are baked, but you have to keep the entire model in RAM at once because loading it from disk, or swapping the layers between RAM and VRAM is very slow, splitting them is also quite slow, so if you can have massive amounts of very fast GDDR ram, you can get good Token/sec output even on a more modest processor, though GPU architecture is still better suited to it. DeepSeek's inference is actually being run on Chinese-made Huawei chips, while the training portion was done on Nvidia hardware (2048 H100s, I believe).<br><br>To cover a few other things that are tangential and give context:<br>There are two kinds of Large Language models currently in use, dense and Mixture of Expert (MoE). In the case of DeepSeek, V3 and the reasoning version of V3, R1 are both MoE models. This means that while the whole model is 685B parameters, it only has 27B parameters active at any one time because of the way they did the architecture. I won't get into the weeds on how that works exactly, but the short version is that it makes inferencing very fast and very cheap as long as you have enough VRAM to hole the entire 600GB model. Compare that to GPT-4, which is estimated to be 1.6T parameters with 8 experts of ~200B each, you can see how that runs costs up.<br><br>Dense models, like Claude's Sonnet and Opus are estimated to be a few hundred billion parameters, have to pass through every parameter to inference each token. You can think of parameters as the brain cells for a close-enough conception. And a token is a piece of a word, you can visualize tokens over here to get a better idea:<br><a href="https://platform.openai.com/tokenizer">platform.openai.com/tokenizer</a><br><br>The last thing I'll say since I don't think I've covered it anywhere is that a part of the problem with the US system is that Big Tech promotes and gives out raises based on "impact." So when people see a project coming down the pipeline, it is very common for ladder climbers and other parasites to abandon whatever team they're on to go glom onto the big new thing that's sure to get promotions handed out like candy. And they inject themselves into the processes to prove contribution to the project which makes the whole enterprise slower, more inefficient, and more likely to get derailed into non-primary concerns like ethics and safety and so on. We have a decent amount of good data that safety/ethics training actively harms model performance by basically creating these tensor black holes that suck in any passing request and turn it into "As an AI language model" type shit. But that's a different set of complaints.</article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/4472397#notice-8763674">In conversation</a><time datetime="2025-01-30T16:39:13+09:00" title="Thursday, 30-Jan-2025 16:39:13 JST">about 5 months ago</time> <span>from <span><a href="https://poa.st/objects/3b498d4b-8fce-45cd-8014-2a5a40523109" rel="external" title="Sent from poa.st via ActivityPub">poa.st</a></span></span><a href="https://poa.st/objects/3b498d4b-8fce-45cd-8014-2a5a40523109">permalink</a><h4>Attachments</h4><ol><li><label><a rel="external" href="https://gnusocial.jp/attachment/4048392">Untitled attachment</a></label><br></li></ol></footer></blockquote>

Corresponding Notice

Embed this notice
snap (snappler@poa.st)'s status on Thursday, 30-Jan-2025 16:39:13 JSTsnap
in reply to
@PopulistRight @RustyCrab @Inginsub @IAMAL_PHARIUS There are two different parts concerning LLMs or basically any of these models, sort of regardless of architecture. There's training and there's inference/generation.

For training, you do need to run very large matrix multiplication a lot of times so the high-end GPUs are valuable for that step. What Deepseek seems to suggest, and other smaller successes from groups with less compute like Mistral in France, is that you do not need the scale that people were thinking to make a really good model. The GPU's massive parallelization is useful here, compared with inferences. And given we don't know how to reliably make linear progress on model quality, the ability to train a lot of models is good, assuming you're twisting the right knobs and dials. But, the reality here is that there's not a guarantee you will ever get the right combination of factors to make the next big jump even with all that compute. Which means plenty of companies are going to cut back because investors aren't going to like them spending $200M on GPUs for a "maybe we beat this model we could use for free." I also remember some people suggesting there are some questions about the copyrightability of trained model weights, but I haven't looked into that.

On the inference side, what you primarily need is RAM Speed. Doing inference on a given set of tokens isn't a particularly complicated or expensive process once the weights are baked, but you have to keep the entire model in RAM at once because loading it from disk, or swapping the layers between RAM and VRAM is very slow, splitting them is also quite slow, so if you can have massive amounts of very fast GDDR ram, you can get good Token/sec output even on a more modest processor, though GPU architecture is still better suited to it. DeepSeek's inference is actually being run on Chinese-made Huawei chips, while the training portion was done on Nvidia hardware (2048 H100s, I believe).

To cover a few other things that are tangential and give context:
There are two kinds of Large Language models currently in use, dense and Mixture of Expert (MoE). In the case of DeepSeek, V3 and the reasoning version of V3, R1 are both MoE models. This means that while the whole model is 685B parameters, it only has 27B parameters active at any one time because of the way they did the architecture. I won't get into the weeds on how that works exactly, but the short version is that it makes inferencing very fast and very cheap as long as you have enough VRAM to hole the entire 600GB model. Compare that to GPT-4, which is estimated to be 1.6T parameters with 8 experts of ~200B each, you can see how that runs costs up.

Dense models, like Claude's Sonnet and Opus are estimated to be a few hundred billion parameters, have to pass through every parameter to inference each token. You can think of parameters as the brain cells for a close-enough conception. And a token is a piece of a word, you can visualize tokens over here to get a better idea:
platform.openai.com/tokenizer

The last thing I'll say since I don't think I've covered it anywhere is that a part of the problem with the US system is that Big Tech promotes and gives out raises based on "impact." So when people see a project coming down the pipeline, it is very common for ladder climbers and other parasites to abandon whatever team they're on to go glom onto the big new thing that's sure to get promotions handed out like candy. And they inject themselves into the processes to prove contribution to the project which makes the whole enterprise slower, more inefficient, and more likely to get derailed into non-primary concerns like ethics and safety and so on. We have a decent amount of good data that safety/ethics training actively harms model performance by basically creating these tensor black holes that suck in any passing request and turn it into "As an AI language model" type shit. But that's a different set of complaints.
In conversationabout 5 months ago from poa.stpermalink
Attachments
1. Untitled attachment

Public

Embed Notice

HTML Code

Corresponding Notice