Public
- Public
- Network
- Groups
- Featured
- Popular
- People

Dataset Subtasks |Ttrain| |Tdev| |Ttest| # Random Samples MultiArith (Roy and Roth, 2015) - 100 100 400 1 GSM8K (Cobbe et al., 2021) - 100 100 1319 1 Instruction Induction (Honovich et al., 2023) 14 Subtasks 100 20 100 5 Counterfactual Eval (Wu et al., 2024) 12 Subtasks 100 20 100 5 BIG-Bench Hard (BBH format used in Suzgun et al. (2023)) 27 Subtasks 100 100 50 1 BIG-Bench Hard (Alternative format; see §A.2) 2 Subtasks 100 100 500 1

Download link

Dataset Subtasks |Ttrain| |Tdev| |Ttest| # Random Samples MultiArith (Roy and Roth, 2015) - 100 100 400 1 GSM8K (Cobbe et al., 2021) - 100 100 1319 1 Instruction Induction (Honovich et al., 2023) 14 Subtasks 100 20 100 5 Counterfactual Eval (Wu et al., 2024) 12 Subtasks 100 20 100 5 BIG-Bench Hard (BBH format used in Suzgun et al. (2023)) 27 Subtasks 100 100 50 1 BIG-Bench Hard (Alternative format; see §A.2) 2 Subtasks 100 100 500 1
https://stockroom.wandering.shop/media_attachments/files/113/409/149/155/302/516/original/a72f8b2770756503.png

Notices where this attachment appears

Embed this notice
Cassandra Granade 🏳️‍⚧️ (xgranade@wandering.shop)'s status on Saturday, 02-Nov-2024 07:58:04 JST Cassandra Granade 🏳️‍⚧️
in reply to

I pulled a random arXiv preprint¹ off a Google Scholar search for prompt engineering, and this is the size of the datasets² that they're using for all their conclusions.
The quality of evidence in that field is almost nonexistent. They're reporting on deltas of less than 1% based on a sampling procedure that can at *best* give 3% margins of error.
¹accepted at a major conference, using a preprint to avoid paywalls
²apologies for shitty alt-text, getting alt-text of tables is tricky

In conversation about 20 days ago from wandering.shop permalink