There's a newish genre of LLM paper I'm starting to see, where the authors put together an enormous suite of benchmarks, test a bunch of models on them, and write a 50-100 page opus talking about how LLMs can now do shiny new thing.
The reader is absolutely buried in some enormous matrix of tests (models x benchmarks, in the simplest case) and each benchmark goes by way too quickly to actually establish construct validity.
>>