@pepsi_man@sickburnbro One of my big focuses lately, and I wish I saw more people talking about it, is pacification. Demoralization and humiliation are part of it, but all media teaches "don't take revenge" and "the noble thing is to forgive." So in addition to the rest, fighting being wrong in the first place is something that's hard to unlearn.
Our boys are pacified from birth and their either cuck to it and become trannies or soft, conflict averse soys OR they get in touch with that anger without being raised to understand how to direct or manage it and it's all rage and anger with ability to direct it well. Some make it through, but role models are important to correcting this. And our side creating media. Even if it's not explicit or overt. We need to make more culture. Murdoch stands as one of the strongest points to that end, I think.
@shippoaster@IAMAL_PHARIUS@Codeki@Inginsub@PopulistRight@RustyCrab I got called a shill like ten years ago by simply explaining that a lot of NVidia's domination wasn't superior hardware in itself, but developer support. They offered and still offer easy libraries to lazy/burdened AAA game development teams that let the implement useful, flashy features with minimal friction and they also have TONS of people available to answer developer questions. AMD/ATI did not. So AMD languished and fell behind because their lower sales meant no ability to create impactful libraries or technologies that were cross platform. FSR has been their most successful one to date and it's very shit and the new FSR4 is AMD only, killing any momentum there. That's on the game side, but consumer ubiquity is important for being the defacto for a lot of data center uses. It's why Ryzen has been slow to enter the datacenter.
On the other side, the research/AI side, we have the problem of all the people being college educated morons who work in Python and don't actually code in the way any programmer or engineer thinks about it. This has infected a lot of things, but it came baked in with the researchers. That helped CUDA become defacto and made sure that anything that wasn't as easy and reliable just fell by the wayside because who is ordering 500 AMD cards to try to run whatever their half-working, 5x slower brand of CUDA is this week?
So here we are. No one really discussing custom hardware beyond Nvidia's lazy slapped together rackmounts, which barely count. In some ways, I don't blame them. I'm not suggesting they're doing it out of wisdom, but the idea that transformers are the end of the road is pretty premature, so purpose building hardware to catch the wave before it crashes is a good way to go bankrupt. BitNet's ternary architecture could be very valuable if it ever gets proven out. Could lead to very fast, very small models with relatively cheap custom hardware (one day).
For DeepSeek, it's possible there are other restrictions on the H800s but I'm not aware of anything that Nvidia would be able to do other than refuse to sell further hardware if they found out.
I saw rumblings (with no proof) that they did some driver unlocking and a few other things that are fairly common for overclockers to do, but can void warranties. Nvidia is famously shitty about keeping their drivers locked down to the point that even board partners like ASUS and EVGA had to crack/reverse engineer their drivers to be able to build their own overclocked official cards.
As a tech guy who has some understanding of the stuff going on, to me this looks like literally big tech being outcompeted by a plucky company that got lucky with their model. But since that company is Chinese, they're losing their fucking minds and trying to find reasons they cheated and China could NEVER. It's pretty embarrassing to watch.
Here's the article about that. CUDA isn't really something you need to jailbreak past so much as it is a handy supplied library from Nvidia for interfacing with the hardware. PTX is basically just doing a lower level form of programming instead of using CUDA so they can write more efficient functions that are potentially more tailor to their workloads.
This is sort of like writing your own lighting/ambient occlusion/whatever shader for a video game because the one that comes with the game engine doesn't necessarily work as efficiently as you want or lacks features you'd like to implement. Stuff like that. Nothing too spicy and I would hope the engineers at Meta and OAI and so on are doing similar things but given their card fleet size and likely hardware turnover, they may not. Stuff like that CAN be very fragile between GPU board revisions, driver updates, and different generations of cards so they can be a pain to keep working.
@IAMAL_PHARIUS@Codeki@PopulistRight@RustyCrab@Inginsub Given the 50 series uplift is only about 8%, they were definitely counting on 24/7 A100s (or H100s in China) burning out and continued fever pitch from Big Tech and get-rich-quick start ups. I think they intended the Digits platform to sort of placate the prosumer home user/trainer, which is basically GDDR in a little shell with a mediocre processor. NVidia seemed to intentionally resist bumping VRAM on the consumer cards for this reason, especially considering the increasing VRAM use of AAA games.
If anyone wants to run real R1 at home (not the Ollama distills), grab a few Macs with their unified RAM and you can do network distributed inference of a decent quality quant with llama.cpp. Since it's only 27B active parameters, it should run at a decent clip as long as you can get it loaded up.
@PopulistRight@RustyCrab@Inginsub@IAMAL_PHARIUS There are two different parts concerning LLMs or basically any of these models, sort of regardless of architecture. There's training and there's inference/generation.
For training, you do need to run very large matrix multiplication a lot of times so the high-end GPUs are valuable for that step. What Deepseek seems to suggest, and other smaller successes from groups with less compute like Mistral in France, is that you do not need the scale that people were thinking to make a really good model. The GPU's massive parallelization is useful here, compared with inferences. And given we don't know how to reliably make linear progress on model quality, the ability to train a lot of models is good, assuming you're twisting the right knobs and dials. But, the reality here is that there's not a guarantee you will ever get the right combination of factors to make the next big jump even with all that compute. Which means plenty of companies are going to cut back because investors aren't going to like them spending $200M on GPUs for a "maybe we beat this model we could use for free." I also remember some people suggesting there are some questions about the copyrightability of trained model weights, but I haven't looked into that.
On the inference side, what you primarily need is RAM Speed. Doing inference on a given set of tokens isn't a particularly complicated or expensive process once the weights are baked, but you have to keep the entire model in RAM at once because loading it from disk, or swapping the layers between RAM and VRAM is very slow, splitting them is also quite slow, so if you can have massive amounts of very fast GDDR ram, you can get good Token/sec output even on a more modest processor, though GPU architecture is still better suited to it. DeepSeek's inference is actually being run on Chinese-made Huawei chips, while the training portion was done on Nvidia hardware (2048 H100s, I believe).
To cover a few other things that are tangential and give context: There are two kinds of Large Language models currently in use, dense and Mixture of Expert (MoE). In the case of DeepSeek, V3 and the reasoning version of V3, R1 are both MoE models. This means that while the whole model is 685B parameters, it only has 27B parameters active at any one time because of the way they did the architecture. I won't get into the weeds on how that works exactly, but the short version is that it makes inferencing very fast and very cheap as long as you have enough VRAM to hole the entire 600GB model. Compare that to GPT-4, which is estimated to be 1.6T parameters with 8 experts of ~200B each, you can see how that runs costs up.
Dense models, like Claude's Sonnet and Opus are estimated to be a few hundred billion parameters, have to pass through every parameter to inference each token. You can think of parameters as the brain cells for a close-enough conception. And a token is a piece of a word, you can visualize tokens over here to get a better idea: platform.openai.com/tokenizer
The last thing I'll say since I don't think I've covered it anywhere is that a part of the problem with the US system is that Big Tech promotes and gives out raises based on "impact." So when people see a project coming down the pipeline, it is very common for ladder climbers and other parasites to abandon whatever team they're on to go glom onto the big new thing that's sure to get promotions handed out like candy. And they inject themselves into the processes to prove contribution to the project which makes the whole enterprise slower, more inefficient, and more likely to get derailed into non-primary concerns like ethics and safety and so on. We have a decent amount of good data that safety/ethics training actively harms model performance by basically creating these tensor black holes that suck in any passing request and turn it into "As an AI language model" type shit. But that's a different set of complaints.
@IAMAL_PHARIUS Some context no one asked for since fuck you, I'm spamming everywhere about LLM shit today:
DeepSeek V3 is the non-thinking version of R1. It has pretty severe repetition issues in multi-turn (chat) settings but is pretty good overall. I say that
The benchmarks here should always be taken with a grain of salt because they poorly reflect real world use, but, broadly speaking, they point toward a vague understanding of baseline ability for the models. Of the benchmarks listed, MMLU-Pro has the most general correlation with end-user ability, but it is still a rough-fit kind of thing. Benchmarks are automated, not human evaluated with an eye for detail, so they can only do so much.
THAT BEING SAID, this is very good performance for a non-thinking/reasoning model so it's very promising and you can try it out yourself at chat.qwenlm.ai
Qwen models tend to be pretty strongly "aligned" (cucked for "safety" purposes), so you are likely going to be able to have less fun with Qwen 2.5 Max than DeekSeek R1.
@Marakus@RustyCrab@Inginsub@IAMAL_PHARIUS Yep. The company heads are almost exclusively finance douchebags who bought the top ML guys because ML has been useful for finance for a while now (trend prediction, trade data interpretation, related junk) so they're selling it to investors like they sell everything in modern tech. "We a frontier model and have the best people and we're paying them insane amounts to stay with us, that guarantees we're going to get better models faster." But it does not.
Even then, the "best model" is pretty finite in quality so eventually everyone will have the model and you'll be competing on features/tools and user reach. It's a money free-for-all for whoever has the best model until we reach the meaningful apex of general-use LLMs so that's what all the jockeying represents at present, even if the investors don't realize it.
VRAM modules themselves are fairly cheap, though there's definitely work involved in making sure your PCB layouts and scheduling and so on make effective use of them.
That said, there's no reason high-VRAM needs to mean high board cost. It's not a particularly power hungry or low-yield part and GDDR5 is more than fast enough for training and inference at present.
NVidia was just been raking in insane amounts of money on insane data center orders while gen-over-gen bumping the prices of the consumer cards that USED to be the backbone of their business so they could reserve the silicon for datacenter stuff. So now, people are on very slow GPU upgrade cycles on the PC side with the 50 series underperforming in early reviews at the same time as it turns out you DON'T need to match Meta or X or OpenAI's massive fleets of H100s (which is what the clueless megacorps were insisting you needed). You can actually do it on a tiny fraction of the compute if you're just not retarded.
There's going to be a lot of bad actor information trying to make it sound like DeepSeek cheated somehow, but they didn't They'll try to claim they "distilled" OAI's models, but they used them for synthetic datasets at most. Everyone does that even if it's against TOS. Google is using Anthropics models to make synthetic datasets for Gemini. It's all over.
Anyway, Nvidia tanked because the speculators are not convinced everyone is cancelling their $100M orders for more H100s and Hopper stuff and so there's not going to be an infinite money printer in Jensen's oven anymore.
@Wiz@Zergling_man@dotnet@Hoss@bronze@skylar@hancoktom41 I like shondo. I think she's a cool guy. I have seen the leaked Gura sex tape and I didn't enjoy that so much. And 2025 is the year of Linux on the desktop. And I think we should all name our favorite Nijisanji personality. I don't know any, but I'm sure I'd like one of them.
@dotnet@hancoktom41@Hoss@Wiz@bronze I'm also outraged about Wiz, who apparently is rich, for some reason (I didn't read the thread and don't know what's happening). I think he should buy us all Shondo marketable plushies and dakimakuras as a form of restitution for whatever is happening.
@Will2Power It's rough out in the woods and we probably all take it a little bit too easy on heavy, high-powered machines. If one got Ken Block (snowmobile, but same idea), it can get anybody. I'm surprised these backwoods nature/zipline resorts still rent pretty high power Can-Ams and Polaris stuff to any day-drinking visitor with a driver's license.