Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://dair-community.social/users/emilymbender/statuses/110330844216506557">Prof. Emily M. Bender(she/her) (emilymbender@dair-community.social)'s status on Monday, 08-May-2023 17:09:12 JST</a><a href="https://dair-community.social/@emilymbender" title="emilymbender@dair-community.social"><img src="https://gnusocial.jp/avatar/76831-48-20221219012403.webp" width="48" height="48" alt="Prof. Emily M. Bender(she/her)" style="position: absolute; left: 0; top: 0;">Prof. Emily M. Bender(she/her)</a></section><article><p>There's a newish genre of LLM paper I'm starting to see, where the authors put together an enormous suite of benchmarks, test a bunch of models on them, and write a 50-100 page opus talking about how LLMs can now do shiny new thing.</p><p>The reader is absolutely buried in some enormous matrix of tests (models x benchmarks, in the simplest case) and each benchmark goes by way too quickly to actually establish construct validity.</p><p>&gt;&gt;</p></article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/1442805#notice-2940904">In conversation</a><time datetime="2023-05-08T17:09:12+09:00" title="Monday, 08-May-2023 17:09:12 JST">Monday, 08-May-2023 17:09:12 JST</time> <span>from <span><a href="https://dair-community.social/@emilymbender/110330844216506557" rel="external" title="Sent from dair-community.social via ActivityPub">dair-community.social</a></span></span><a href="https://dair-community.social/@emilymbender/110330844216506557">permalink</a></footer></blockquote>

Corresponding Notice

Embed this notice
Prof. Emily M. Bender(she/her) (emilymbender@dair-community.social)'s status on Monday, 08-May-2023 17:09:12 JST Prof. Emily M. Bender(she/her)
There's a newish genre of LLM paper I'm starting to see, where the authors put together an enormous suite of benchmarks, test a bunch of models on them, and write a 50-100 page opus talking about how LLMs can now do shiny new thing.
The reader is absolutely buried in some enormous matrix of tests (models x benchmarks, in the simplest case) and each benchmark goes by way too quickly to actually establish construct validity.
>>
In conversationMonday, 08-May-2023 17:09:12 JST from dair-community.socialpermalink

Public

Embed Notice

HTML Code

Corresponding Notice