If you come up with a new benchmark they'll just guzzle it as part of the training data and then claim to do "reasoning" on that.
It is so mind-boggling to me that people have to even spend time debunking these claims. Such a waste of resources that could be going to doing actual science and engineering work.