Alex @alexalbert__ Fun story from our internal testing on Claude 3 Opus. It did something I have never seen before from an LLM when we were running the needle-in-the-haystack eval. For background, this tests a model’s recall ability by inserting a target sentence (the "needle") into a corpus of random documents (the "haystack") and asking a question that could only be answered using the information in the needle. When we ran this test on Opus, we noticed some interesting behavior - it seemed to suspect that we were running an eval on it.
https://files.mastodon.social/media_attachments/files/112/048/520/745/753/953/original/f2cef70e38a9c637.png