The real reason it is dropped is because when training models it’s all about crunching numbers. The underlying transformation algorithms aren’t that hard to comprehend, but being able to go through iterations faster adds up quickly. I could implement these algorithms to be correct, but could I implement them fast compared to someone who really knows the domain and has spent years tinkering with pointer tricks in C? Probably not. Not overnight, at least. I’d probably have to spend years, just like them, learning all about this shit.
And that’s before we even bring up TPUs and ASICs.