The second paper highlights two problems with the first.
1. - 4% of the problems in the test set cannot be solved with the information provided, or in some cases, at all:
>Below you are given the delays for the different gates you were permitted to use in part D above. Compute the propagation delay of your circuit from D.
That's the entire question. There is no part D above, yet the claim is that GPT-4 answered this question correctly. There are many questions like this in the test set - this second paper links to a spreadsheet with the complete list of questions, good and bad.
2. - the answers provided by GPT-4 are scored by GPT-4. If GPT-4 tells GPT-4 that GPT-4 got the question wrong, GPT-4 gets to try again indefinitely.
Supposedly the answers were verified manually, but if so, they did a poor job because they missed all the wrong questions.
3. - not included in the paper, but posted today on Twitter, the original code used to run the tests leaks the answers used for verification by GPT-4 to the GPT-4 instance answering the questions.