I pulled a random arXiv preprint¹ off a Google Scholar search for prompt engineering, and this is the size of the datasets² that they're using for all their conclusions.
The quality of evidence in that field is almost nonexistent. They're reporting on deltas of less than 1% based on a sampling procedure that can at *best* give 3% margins of error.
¹accepted at a major conference, using a preprint to avoid paywalls
²apologies for shitty alt-text, getting alt-text of tables is tricky