VRAM modules themselves are fairly cheap, though there's definitely work involved in making sure your PCB layouts and scheduling and so on make effective use of them.
That said, there's no reason high-VRAM needs to mean high board cost. It's not a particularly power hungry or low-yield part and GDDR5 is more than fast enough for training and inference at present.
NVidia was just been raking in insane amounts of money on insane data center orders while gen-over-gen bumping the prices of the consumer cards that USED to be the backbone of their business so they could reserve the silicon for datacenter stuff. So now, people are on very slow GPU upgrade cycles on the PC side with the 50 series underperforming in early reviews at the same time as it turns out you DON'T need to match Meta or X or OpenAI's massive fleets of H100s (which is what the clueless megacorps were insisting you needed). You can actually do it on a tiny fraction of the compute if you're just not retarded.
There's going to be a lot of bad actor information trying to make it sound like DeepSeek cheated somehow, but they didn't They'll try to claim they "distilled" OAI's models, but they used them for synthetic datasets at most. Everyone does that even if it's against TOS. Google is using Anthropics models to make synthetic datasets for Gemini. It's all over.
Anyway, Nvidia tanked because the speculators are not convinced everyone is cancelling their $100M orders for more H100s and Hopper stuff and so there's not going to be an infinite money printer in Jensen's oven anymore.
@Marakus@RustyCrab@Inginsub@IAMAL_PHARIUS Yep. The company heads are almost exclusively finance douchebags who bought the top ML guys because ML has been useful for finance for a while now (trend prediction, trade data interpretation, related junk) so they're selling it to investors like they sell everything in modern tech. "We a frontier model and have the best people and we're paying them insane amounts to stay with us, that guarantees we're going to get better models faster." But it does not.
Even then, the "best model" is pretty finite in quality so eventually everyone will have the model and you'll be competing on features/tools and user reach. It's a money free-for-all for whoever has the best model until we reach the meaningful apex of general-use LLMs so that's what all the jockeying represents at present, even if the investors don't realize it.
To clarify, though - can NVIDIA's GPUs still be used to power Deepseek - so that the increased efficiency of Deepseek + those GPUs would result in a better/faster AI?
@PopulistRight@RustyCrab@Inginsub@IAMAL_PHARIUS There are two different parts concerning LLMs or basically any of these models, sort of regardless of architecture. There's training and there's inference/generation.
For training, you do need to run very large matrix multiplication a lot of times so the high-end GPUs are valuable for that step. What Deepseek seems to suggest, and other smaller successes from groups with less compute like Mistral in France, is that you do not need the scale that people were thinking to make a really good model. The GPU's massive parallelization is useful here, compared with inferences. And given we don't know how to reliably make linear progress on model quality, the ability to train a lot of models is good, assuming you're twisting the right knobs and dials. But, the reality here is that there's not a guarantee you will ever get the right combination of factors to make the next big jump even with all that compute. Which means plenty of companies are going to cut back because investors aren't going to like them spending $200M on GPUs for a "maybe we beat this model we could use for free." I also remember some people suggesting there are some questions about the copyrightability of trained model weights, but I haven't looked into that.
On the inference side, what you primarily need is RAM Speed. Doing inference on a given set of tokens isn't a particularly complicated or expensive process once the weights are baked, but you have to keep the entire model in RAM at once because loading it from disk, or swapping the layers between RAM and VRAM is very slow, splitting them is also quite slow, so if you can have massive amounts of very fast GDDR ram, you can get good Token/sec output even on a more modest processor, though GPU architecture is still better suited to it. DeepSeek's inference is actually being run on Chinese-made Huawei chips, while the training portion was done on Nvidia hardware (2048 H100s, I believe).
To cover a few other things that are tangential and give context: There are two kinds of Large Language models currently in use, dense and Mixture of Expert (MoE). In the case of DeepSeek, V3 and the reasoning version of V3, R1 are both MoE models. This means that while the whole model is 685B parameters, it only has 27B parameters active at any one time because of the way they did the architecture. I won't get into the weeds on how that works exactly, but the short version is that it makes inferencing very fast and very cheap as long as you have enough VRAM to hole the entire 600GB model. Compare that to GPT-4, which is estimated to be 1.6T parameters with 8 experts of ~200B each, you can see how that runs costs up.
Dense models, like Claude's Sonnet and Opus are estimated to be a few hundred billion parameters, have to pass through every parameter to inference each token. You can think of parameters as the brain cells for a close-enough conception. And a token is a piece of a word, you can visualize tokens over here to get a better idea: platform.openai.com/tokenizer
The last thing I'll say since I don't think I've covered it anywhere is that a part of the problem with the US system is that Big Tech promotes and gives out raises based on "impact." So when people see a project coming down the pipeline, it is very common for ladder climbers and other parasites to abandon whatever team they're on to go glom onto the big new thing that's sure to get promotions handed out like candy. And they inject themselves into the processes to prove contribution to the project which makes the whole enterprise slower, more inefficient, and more likely to get derailed into non-primary concerns like ethics and safety and so on. We have a decent amount of good data that safety/ethics training actively harms model performance by basically creating these tensor black holes that suck in any passing request and turn it into "As an AI language model" type shit. But that's a different set of complaints.
@snappler@PopulistRight@RustyCrab@Inginsub hey, you heard anything about deepseek jailbreaking the H100s to do things differently. something about by passing cuda? whats that about
Here's the article about that. CUDA isn't really something you need to jailbreak past so much as it is a handy supplied library from Nvidia for interfacing with the hardware. PTX is basically just doing a lower level form of programming instead of using CUDA so they can write more efficient functions that are potentially more tailor to their workloads.
This is sort of like writing your own lighting/ambient occlusion/whatever shader for a video game because the one that comes with the game engine doesn't necessarily work as efficiently as you want or lacks features you'd like to implement. Stuff like that. Nothing too spicy and I would hope the engineers at Meta and OAI and so on are doing similar things but given their card fleet size and likely hardware turnover, they may not. Stuff like that CAN be very fragile between GPU board revisions, driver updates, and different generations of cards so they can be a pain to keep working.
@snappler@PopulistRight@RustyCrab@Inginsub understood your explanation. DID NOT understand a single sentence in that article. software sisters need to come back to earth. i thought deepseek violated some kind of aftermarket tos.
For DeepSeek, it's possible there are other restrictions on the H800s but I'm not aware of anything that Nvidia would be able to do other than refuse to sell further hardware if they found out.
I saw rumblings (with no proof) that they did some driver unlocking and a few other things that are fairly common for overclockers to do, but can void warranties. Nvidia is famously shitty about keeping their drivers locked down to the point that even board partners like ASUS and EVGA had to crack/reverse engineer their drivers to be able to build their own overclocked official cards.
As a tech guy who has some understanding of the stuff going on, to me this looks like literally big tech being outcompeted by a plucky company that got lucky with their model. But since that company is Chinese, they're losing their fucking minds and trying to find reasons they cheated and China could NEVER. It's pretty embarrassing to watch.