LLM attacks frequently claim to access the "system prompt" (and I can't believe we're still building the future on a technology without use/mention distinction.)
But my question has always been: how do you know? The LLM produced something that is plausibly a system prompt. But LLMs are good at producing plausible text!
All it takes would be a little bit of methodology -- is the output always the same? Is it the same for different attack vectors? Or did you get something merely system-prompt shaped? If you just ask the LLM to write a generic system prompt, how close would you get?
(I'm particularly skeptical because the blog post is selling a technology that would "fix" this.)
https://hiddenlayer.com/innovation-hub/novel-universal-bypass-for-all-major-llms/
via https://circumstances.run/@davidgerard/114407046617549385