Fascinating report from 404 Media's Samantha Cole on a trove of leaked NVIDIA Slack messages and emails about how they're scraping millions of YouTube videos to train their own new foundation video generation model: https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/
Posted a few of my own notes here: https://simonwillison.net/2024/Aug/5/nvidia-scraping-videos/
It's not surprising to learn that they're doing this - that's practically the industry standard right now - but is still really interesting to see internal details of what they're collecting and why