Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data
Thanks to the folks at Apple for experimentally proving what I’ve been repeating for a decade or so: #AI without formal inference is a statistical parrot and a glorified pattern matcher.
Throwing more neurons and layers at the problem may just make it harder to spot this obvious fallacy. There’s no magical “emergence” of reasoning from statistics.
A technology that needs to be shown thousands of pictures of cats, appropriately normalized and labeled, in order to recognize a cat, when a child needs to be shown only a couple of cats to robustly assimilate the pattern, is not a technology that has actually “learned” anything - let alone a technology that manifests any sort of intelligence.
That’s not to say that we should throw neural networks away, but living beings don’t learn only by statistical exposure to examples.