Dataset Subtasks |Ttrain| |Tdev| |Ttest| # Random Samples MultiArith (Roy and Roth, 2015) - 100 100 400 1 GSM8K (Cobbe et al., 2021) - 100 100 1319 1 Instruction Induction (Honovich et al., 2023) 14 Subtasks 100 20 100 5 Counterfactual Eval (Wu et al., 2024) 12 Subtasks 100 20 100 5 BIG-Bench Hard (BBH format used in Suzgun et al. (2023)) 27 Subtasks 100 100 50 1 BIG-Bench Hard (Alternative format; see §A.2) 2 Subtasks 100 100 500 1
https://stockroom.wandering.shop/media_attachments/files/113/409/149/155/302/516/original/a72f8b2770756503.png