Embed Notice

HTML Code

<blockquote style="position: relative; padding-left: 55px;"><section><a href="https://wandering.shop/users/xgranade/statuses/113409169139560378">Cassandra Granade 🏳️‍⚧️ (xgranade@wandering.shop)'s status on Saturday, 02-Nov-2024 07:58:04 JST</a><a href="https://wandering.shop/@xgranade" title="xgranade@wandering.shop"><img src="https://gnusocial.jp/avatar/114762-48-20230711182559.webp" width="48" height="48" alt="Cassandra Granade 🏳️‍⚧️" style="position: absolute; left: 0; top: 0;">Cassandra Granade 🏳️‍⚧️</a><div><a href="https://wandering.shop/@xgranade/113409097245295996" rel="in-reply-to">in reply to</a></div></section><article><p>I pulled a random arXiv preprint¹ off a Google Scholar search for prompt engineering, and this is the size of the datasets² that they're using for all their conclusions.</p><p>The quality of evidence in that field is almost nonexistent. They're reporting on deltas of less than 1% based on a sampling procedure that can at *best* give 3% margins of error.</p><p>¹accepted at a major conference, using a preprint to avoid paywalls<br>²apologies for shitty alt-text, getting alt-text of tables is tricky</p></article><footer><a rel="bookmark" href="https://gnusocial.jp/conversation/3920832#notice-7660898">In conversation</a><time datetime="2024-11-02T07:58:04+09:00" title="Saturday, 02-Nov-2024 07:58:04 JST">about 20 days ago</time> <span>from <span><a href="https://wandering.shop/@xgranade/113409169139560378" rel="external" title="Sent from wandering.shop via ActivityPub">wandering.shop</a></span></span><a href="https://wandering.shop/@xgranade/113409169139560378">permalink</a><h4>Attachments</h4><ol><li><label><a rel="external" href="https://gnusocial.jp/attachment/3390682">Dataset Subtasks |Ttrain| |Tdev| |Ttest| # Random Samples
MultiArith (Roy and Roth, 2015) - 100 100 400 1
GSM8K (Cobbe et al., 2021) - 100 100 1319 1
Instruction Induction (Honovich et al., 2023) 14 Subtasks 100 20 100 5
Counterfactual Eval (Wu et al., 2024) 12 Subtasks 100 20 100 5
BIG-Bench Hard (BBH format used in Suzgun et al. (2023)) 27 Subtasks 100 100 50 1
BIG-Bench Hard (Alternative format; see §A.2) 2 Subtasks 100 100 500 1</a></label><br><a href="https://stockroom.wandering.shop/media_attachments/files/113/409/149/155/302/516/original/a72f8b2770756503.png" rel="external">https://stockroom.wandering.shop/media_attachments/files/113/409/149/155/302/516/original/a72f8b2770756503.png</a></li></ol></footer></blockquote>

Corresponding Notice

Embed this notice
Cassandra Granade 🏳️‍⚧️ (xgranade@wandering.shop)'s status on Saturday, 02-Nov-2024 07:58:04 JST Cassandra Granade 🏳️‍⚧️
in reply to
I pulled a random arXiv preprint¹ off a Google Scholar search for prompt engineering, and this is the size of the datasets² that they're using for all their conclusions.
The quality of evidence in that field is almost nonexistent. They're reporting on deltas of less than 1% based on a sampling procedure that can at *best* give 3% margins of error.
¹accepted at a major conference, using a preprint to avoid paywalls
²apologies for shitty alt-text, getting alt-text of tables is tricky
In conversationabout 20 days ago from wandering.shoppermalink
Attachments
1. Dataset Subtasks |Ttrain| |Tdev| |Ttest| # Random Samples MultiArith (Roy and Roth, 2015) - 100 100 400 1 GSM8K (Cobbe et al., 2021) - 100 100 1319 1 Instruction Induction (Honovich et al., 2023) 14 Subtasks 100 20 100 5 Counterfactual Eval (Wu et al., 2024) 12 Subtasks 100 20 100 5 BIG-Bench Hard (BBH format used in Suzgun et al. (2023)) 27 Subtasks 100 100 50 1 BIG-Bench Hard (Alternative format; see §A.2) 2 Subtasks 100 100 500 1
  https://stockroom.wandering.shop/media_attachments/files/113/409/149/155/302/516/original/a72f8b2770756503.png

Public

Embed Notice

HTML Code

Corresponding Notice