So, I’m doing some automated comparison testing with various publicly available LLMs — classifying posts in a subreddit based on a fixed list of flare categories, and seeing how well different models do.
It's hit or miss in many cases, but about 1 out of 8 posts just makes certain models WIG OUT. Instead of responding with the name of a set category, phi3.5 started regurgitating the summary of a paper on gene polymorphism in dopamine receptors. Another responded with a snippet of python