@paninid @argv_minus_one @Uair @hosford42 @actuallyautistic @neurodivergence
It depends on what you are trying to use the LLM for.
I'm not sure what the purpose of injecting random data that is only curated by language. Even for "English", which English are you curationg? What population?
Even the notion that the worldviews of all English speaking populations is "universal" is false, and thus an uncurrated data set is GIGO (Garbage In, Garbage Out). It's an inappropriate use of the tech.